Data Scraping

With the development of the internet, there have been increases in the diversity and volume of data circulating on the web.

One of the priorities of modern analyses is to automate tedious processes – and data collection, including that from websites in particular. As a result of this improvement, it is possible to focus on a more creative and strategically important analytical area – interpretation, drawing conclusions and making important business decisions on the basis of data.

Let us assume a few hypothetical scenarios which pertain to data from the web.

1. You are interested in the subject of a certain currency exchange rate and its historical changes. Bank Y provides daily data in an Excel file which you can download. However, each such file can be found on a separate page. To prepare the data for 2 years, you would have to visit more than 700 pages. For 10 years this represents more than 3,500 clicks.

2. For your social campaign, you plan to analyse the content of the headlines of articles made available in the last 2 years on major industry portals. Each headline must be selected with the mouse, copied and pasted into a local file.

3. You were planning to present some data at an upcoming meeting to support your arguments for several decisions in the arena in question. On one of the websites the data in which you are interested is displayed in the form of a table, but the attempt to copy this to Excel fails, for some reason. Furthermore, even if you succeed, this table is divided into more than 100 pages, and you would, therefore, have to repeat the action 100-fold, and time does not allow you to do this.

Read more ▾

In all of these situations, manual collection of data would be inefficient or even impossible. At such moments, data extraction and, specifically, web scraping is appropriate.

What are web scraping and data extraction?

Web scraping, i.e., programming the extraction of data from websites, is part of the more general process of data extraction, which is, in turn, the extraction of information from unstructured or poorly structured sources.

  • An example of a structured source might be a well-known Excel table or a properly formatted CSV file. Such sources are even designed for analysis and are easy to handle.
  • An unstructured source might be email content or a speech recorded on an audio recording. Websites lie somewhere between these two poles – they have their own structure (HTML), although this is not as obvious as that in tables. However, the existence of this structure makes it possible to automate, simplify and standardise data collection, and, as a result, the analytical process is greatly enriched.

Example tasks or questions to be answered by PMR experts in the field of web scraping:

  • What is the main topic of discussion on a given internet forum?
  • At what price is a product offered in the e-commerce world?
  • Which features of company X are described by its customers on the network?
  • How can one download efficiently from the network and make historical data in multiple Excel files more consistent?

Benefits of data extraction and web scaling from PMR:

  • saves time and mental resources by automating a tedious and boring process
  • greater data accuracy
  • better control of the information extraction procedure and standardisation of the data collection process
  • easier data management

Projects realised

Entering aggregate and cement market in Russia

Challenge: What is the development potential of a new project in Russia? A manufacturer operating on the mining and construction …

Read More →

Segmentation of the agricultural market using marketing personas

Challenge: How to effectively segment the agricultural market, so that the segments are useful in sales activities? Our client, a …

Read More →

Irradiation and sterilization services market in Poland

Working for Łódź University of Technology (Politechnika Łódzka), PMR prepared a report on the irradiation and sterilization services market in …

Read More →