Location: Remote (May - August 2024); Bangkok, Thailand (16-20 September 2024)

Background

The development of digitized economic activities increasingly supports buying and selling transactions without face-to-face interaction. Meanwhile, recommendations and reviews of goods offered digitally foster a trustworthy ecosystem, enabling consumers to confidently purchase products without needing to physically inspect them.

As digital commerce expands, the data generated from these transactions becomes a valuable resource. Price data from online markets, in particular, offers new opportunities for economic analysis and the calculation of Consumer Price Indices (CPI). However, accessing and utilizing this data requires specialized skills in web scraping and data analysis. This is where the ESCAP Big Data Project comes in, providing essential training to equip participants with the knowledge and tools to effectively gather, process, and analyze the online price data for CPI production.

The support will include a series of remote training sessions during May-August 2024, as well as an in-person workshop, to be held in Bangkok, Thailand, 16 - 20 September 2024. Following a series of online training webinars from May to August 2024, this workshop was organized to enhance the capacity of National Statistical Offices (NSOs) in Azerbaijan, Cambodia, Fiji, Lao PDR, Maldives, Nepal, Thailand, Tonga, Kiribati, and Uzbekistan in the area of web scraping techniques for Consumer Price Indices (CPI).

The Regional Hub on Big Data and Data Science for Asia and the Pacific supported this capacity development initiative by providing a team of mentors from Badan Pusat Statistik (BPS)- Statistics Indonesia. These experts played a pivotal role throughout the training program, both during the online sessions and in the in-person workshop. The BPS mentors co-facilitated the workshop alongside the lead facilitators, Frances Krsinich, Christophe Bontemps, and Ceri Regan, offering deep knowledge of web scraping, data processing, and CPI methodologies, thereby enriching the learning experience for all participants.


Objectives

The objectives of the programme are to develop skills to scrape price data from relevant websites and understand how it can be utilized to produce consumer price indices (CPI), including:

  • To understand the extent and range of online price data, and to assess their usefulness for calculating the CPI.
  • To understand the methodological considerations relating to the use of web scraped Prices for use in CPI calculations.
  • To appreciate the ethical considerations prior to embarking on web scraping projects.
  • To build skills and applications using Python, including the construction of web-scrapers using appropriate libraries and commands.
  • To apply these tools and methods to websites relevant to produce national price indices and utilize the resulting data for consideration in CPI production.

Ongoing Mentorship Support

The mentorship program during the ESCAP Capacity Development Programme on Web Scraping for Price Statistics was supported by a diverse group of international experts. In addition to the four mentors from BPS Statistics Indonesia who attended the in-person workshop in Bangkok, the program also benefited from the contributions of online mentors: Cem Baş, (TÜİK at Turkey), Dominik Dabrowski(Stats Poland), Serge Goussev (Lead for Data Science and Engineering at Consumer Prices Division at Statistics Canada), and Luigi Palumbo (Researcher at Bank of Italy). These online mentors played a crucial role in the earlier phases of the training program, providing technical guidance and support to participants throughout the online training sessions held from May to September 2024.

Although these international mentors did not attend the in-person workshop in Bangkok, their expertise and involvement significantly enhanced the learning experience during the online phase, laying a strong foundation for the in-person sessions.

Activities During the Workshop

The workshop provided an intensive, hands-on learning environment, focusing on both theoretical and practical applications of web scraping. The key activities included:

Python Coding and Web Scraping: Participants engaged in real-world examples of online scraping for prices datausing Python Programming Language, focusing on e-commerce websites in each nation. The BPS mentors provided direct support during these sessions, guidedparticipants through code development and troubleshooting.

Automating Web Scraping Pipelines: A dedicated session focused on automating web scraping processes, which is essential for integrating web-based data into NSO workflows. BPS mentors led by example, sharing best practices on how automated pipelines have been implemented in Indonesia.

Methodological Challenges and Solutions: Methodological sessions addressed common issues related to bias, representativity, and data quality when incorporating web-scraped data into CPI calculations.

Follow-up and Future Plans

During the closing session, Calvin, a representative of the Regional Hub on Big Data and Data Science and BPS - Statistics Indonesia, outlined the Hub’s commitment to supporting further development in this area:

1. Compiling Materials and Videos: The Regional Hub will compile and upload all workshop materials and recorded sessions onto its website, making these resources available not only to the participating countries but also to other countries in the Asia-Pacific region interested in learning from this content.

2. Ongoing Mentorship Support: The Regional Hub will continue providing technical mentorship through its team of experts from Indonesia. This support will be offered both virtually and, if requested, in person. Countries that wish to engage in further learning and capacity development can invite the mentors for on-site assistance under a self-funded arrangement. The Hub is committed to being a continuous source of guidance and knowledge sharing, ensuring that participating countries receive tailored support in implementing and expanding web scraping techniques for price statistics.

3. Future Initiatives: The Regional Hub plans to develop additional programs and resources to further enhance the capacity of NSOs in utilizing big data and innovative technologies for official statistics. These initiatives will be designed to meet the evolving needs of countries in the Asia-Pacific region and to support the broader objectives of the 2030 Agenda for Sustainable Development.

Outcomes and Achievements

The workshop achieved the following key outcomes:

• Participants improved their technical skills in using Python for web scraping and were able to apply these skills directly to websites that relevance to their respective NSOs.

• Automation processes for web scraping were better understood, laying the groundwork for integrating such processes into routine NSO operations.

• Provide insights to the participants to process web-scraped data towards CPI calculations, reducing reliance on traditional methods of data collection.

The four BPS mentors—Erma Purnatika Dewi (Directorate of Price Statistics), I Nyoman Setiawan (Directorate of Statistical Analysis and Development), Wahyu Calvin Frans Mariel and Muhammad Ghozy Al Haqqoni (Directorate of Statistical Information System)—were significant in the workshop's accomplishments, providing their practical experiences and expertise to assist participants in navigating the technical and methodological aspects of web scraping.

Participants and Mentors

Each participating country will also be assigned a mentor with experience of the use of web scraping who will be available to provide technical guidance and support.

Country

Mentees-1

Mentees-2

Mentor

Mentor's Origin

Azerbaijan

Kanan Mammadrzazada

Nargiz Israfilova

Cem Baş

TÜİK, Turkey

Maldives

Aishath Nuha

Ismail Mahfooz

Luigi Palumbo

Bank of Italy

Nepal

Arjun Pandey

Sanjay Kumar Chaudhary

Luigi Palumbo

Bank of Italy

Tonga

Felix Feiloaki

Ofa Taulani

Muhammad Ghozy Al Haqqoni

BPS Indonesia

Fiji

Siti Sikivou (ssikivou)

Leba Wakolo (lebas)

Erma Purnatika Dewi

BPS Indonesia

Cambodia

Somethea Buoy (bsomethea)

Ly Sim (simly441)

I Nyoman Setiawan

BPS Indonesia

Thailand

Thitiwat Kaew-Amdee

Prawit Banjong

Wahyu Calvin Frans M

BPS Indonesia

Lao PDR

Saykham Saysombath (Saysombath9)

Sounidda Saiyasith

Dominik Dabrowski

Stats Poland

Kiribati

Tom Benitera;

-

Dominik Dabrowski

Stats Poland

Uzbekistan

Shaknoza Uktamovna (shohruh odilov)

Murodjon Anvarjon (depit)

Serge Goussev

Stats Canada

Programme

Activity

Date & Time

Module and Description

Skills Assessment

15 mins 22 – 30 Apr

• Issue Skills Assessment to all participants for completion

Induction (Virtual) 

15 May 

(12:00 – 13:30) 

• Welcome all – quick introductions 

• Aims, objectives, ways of working & expectations 

• High level background to Web scraping Prices for CPI and approach to be taken: 

• Selection of e-commerce retail outlets in country – are they online only or not, do they transport products across the country or not, is there any reliable info for magnitude of sales from elsewhere (for weights)? Homework: 

• Select an online retail outlet website for scraping during this programme – a set of Qs will be provided 

• Set up the Python environment – a list of software requirements will be provided for installation onto laptops 

Python Environment
set-up surgery (Virtual)

22 May 

(12:00 – 13:00) 

• Check Python environment set up 

• Issue Python commands to run as a class 

• Troubleshoot any issues 

Mentoring (Virtual)

Available throughout programme

• Mentoring will be made available to Prices and IT experts from NSOs throughout the programme. A variety of mentors will be on hand to support with all questions 

• Mentors will be able to provide a wide range of support from Python coding and web scraping to methodology and developing pipelines

Virtual Training Modules

1. Background & concepts for
web scraping Prices for CPI

29 May 

(12:00 – 14:00)

• Outline the basic requirements for web-scraping 

• The requirements that must be in place (e.g. ethical permissions and sample of websites) 

• Introduce concept and requirements for a strategy for price scraping: 

• What areas of CPI do they want to improve? 

• Which product groups, etc.? 

• What other information do you need? 

• Introduction to Methodology considerations – coverage, volume, adherence and integration with other data streams

2. Define country team outcomes

05 June 

(12:00 – 14:00)

• Country teams to (quickly) present results from the precourse work 

• Review country strategies with the aim of finalizing each team approach 

• Definition of a minimum viable pilot for each country team as outcome from the cooperation project, including: • Target data source(s) • Scraping schedule, data pipeline, and monitoring • Final data product • Project documentation

3. Showcase of examples
where web scraped Prices
used for Official Statistics 

19 June 

(12:00 – 14:00)

• Webinar that introduces NSO examples of application & discussion of challenges and how they overcame those

4. Python training 1

26 June 

(12:00 – 14:00) 

• Building on the online course already undertaken 

• Developing Python coding in readiness for building web scraper 

• Exercise session - providing tasks to work on

5. Python training 2 (Virtual)

10 July 

(12:00 – 14:00)

• Building on learning 

• Developing Python coding in readiness for building web scraper 

• Exercise session - providing tasks to work on

6. Building a web scraper
in Python 1

24 July 

(12:00 – 14:00)

• Applying python code to web scraping – using different examples (including prices) 

• Scrape dummy websites – data extraction & assessment of the website’s structure • Data Cleaning • Exercise session – providing tasks to work on 

7. Building a web scraper
in Python 2

31 July 

(12:00 – 14:00) 

• Building on learning - Applying python code to web scraping 

• Scrape dummy websites – data extraction & assessment of the website’s structure 

• Data Cleaning 

• Exercise session – providing tasks to work on

8. Building a web scraper
in Python 3

7 Aug

(12:00 – 14:00)

• Building on learning - Applying python code to web scraping 

• Scrape dummy websites – data extraction & assessment of the website’s structure 

• Data Cleaning 

• Exercise session – providing tasks to work on

9. Building a web scraper
in Python 4

21 Aug 

(12:00 – 14:00)

• Building on learning - Applying python code to web scraping 

• Scrape dummy websites – data extraction & assessment of the website’s structure 

• Data Cleaning 

• Exercise session – providing tasks to work on 

10. Developing a Reproducible
Analytical Pipeline

11 Sept 

(12:00 – 14:00)

In-person Week

10. Web scraping for CPI
– Day 1

Mon 16 Sep

Web scraping Prices in Python 

• Using the strategy developed pre-course and in Modules 1 & 2 – walk through steps to web scrape prices 

• Start scraping their website of choice (selected during precourse work

11. Web scraping for CPI
- Day 2

Tue 17 Sep

Web scraping Prices in Python 

• Assess scraped results 

• Refine code to improve the scraped pool

12. Web scraping for CPI
– Day 3

Wed 18 Sep

Discussion - am 

• Continuation of discussions (buffer) 

• Other web-scraping applications in Official Statistics – examples of this?

13. Web scraping for CPI
– Day 4

Thu 19 Sep

NSO presentations of results & next steps 

• Country 3 & 4 NSO presentations of results & next steps 

• Country 5 & 6

14. Web scraping for CPI
– Day 5 

Fri 20 Sep

Morning only 

• Summary of learning 

• Review of Training 

• Next steps