In this article we will discuss:
- The cost and effort required to create a dataset is significant
- Maintaining dataset relevance with constant updates is paramount
- ‘Datasets’ offers a variety of customized ‘refresh mechanisms’
- The infrastructure behind the datasets
The cost and effort required to create a dataset is significant
Before we discuss the ‘real-time’ aspect of datasets, let’s first take a glance at the challenge they present. In order to generate datasets independently, any given company would need to invest significant effort, and make hefty monetary commitments. The task at hand includes:
- Discovery of all data records, and data points
- Data Validation i.e. ensuring that all data is correct, and no crucial elements are missing
- Data enrichment i.e. cross- referencing data from other sources in order to increase the value of the dataset at hand
Businesses are left with two options:
- Option 1: Leveraging the know-how, infrastructure, and network of an existing provider, that can deliver ready-to-use information – ‘Datasets’ does exactly that.
- Option 2: Developing, and maintaining in-house infrastructure, and team members who can collect, clean, structure, and feed systems necessary data sets.
Maintaining dataset relevance with constant updates is paramount
Whichever option you chose, once the dataset has been collected and is ready for operational use, maintaining dataset relevance is key. This may include:
- Periodically updating data records – Do keep in mind that some data sets may require being more frequently refreshed, based on the industry, segment, and nature of data. For example, a digital vendor’s shop may want bestseller ranking data refreshed in the shorter term, while their marketing campaigns may benefit from social influencer data refreshed in the mid to longer term. Hedge Funds on the other hand may only need monthly or quarterly updates of target-company data.
- Identifying which subsets have changed since the last collection job. For example, have any of the employees in question changed positions? (something that would be displayed on a social/business network such as LinkedIn), has the company raised additional capital in a funding round? (something that may be displayed on a company stat platform the likes of Crunchbase), has the product in question changed its status and become a ‘best seller’ or has the price changed? (something that may appear on an eCommerce marketplace such as, Amazon, Walmart or Lazada).
- Storing key historical data points that can enable the detection of industry consumer trends/habits, as well as market patterns/cycles that may be repeating themselves, and once identified can generate insights whereby business models can work in synergy with relevant data sets.
When creating a dataset independently, you need to constantly invest in maintaining the relevance and validity of each dataset. When you tap into pre-collected Datasets this is handled automatically.
‘Datasets’ offers a variety of customized ‘refresh mechanisms’
When specifically taking a closer look at Bright Data’s ‘Datasets’ solution one can see that there are several ‘refresh mechanisms’ on offer, based on user needs, and the nature of the dataset in question. These include:
- A full refresh of an entire dataset, for example a company that wants quarterly ‘re-collection’ of the target website or industry directories.
- Automated daily or weekly updates of specific data records that may have changed such as product price, bestseller ranking, employee position, social sentiment regarding a specific brand or entity etc.
- Automated daily / weekly updates of any data records that have since been added or recollected, over time the dataset in its entirety is gradually updated with the cumulative information.
The infrastructure behind the datasets
This last section will take a quick look at the ‘data collection effort’ and the infrastructure that lies beyond. In order to successfully collect datasets from public websites, suitable infrastructure must be built internally or leveraged from an external provider available. Here are some examples of commonly used dataset collection infrastructure mechanisms, and what value they provide to achieving a desired end-result:
- Proxy network – When performing mass-scale data collection jobs, multiple IP addresses must be utilized. Also, as many websites present localized and customized content based on geolocation, it is more effective to use IP addresses that originate from the same GEOs as the location of a given target Dataset.
- Unlocking – Many websites have anti-bot mechanisms that make it hard to perform large-scale data crawling without the help of human intervention. Businesses use designated Web Unlocking tools, and algorithms in order to help them overcome these limitations (e.g. rate/geolocation limitations or User-Agent variants).
- Data collection – Software that manages the full-cycle data collection process, utilizing several algorithms that detect collection issues or limitations, and automatically performs retries, employing alternative access approaches until the desired target data is obtained.
The bottom line
For Companies where building, and maintaining data infrastructure is not part of their core business, and capabilities, purchasing ready-to-use Datasets is the way to go. Instead of wasting precious time, and resources, businesses can define which aspects of Datasets need refreshing, and when. That’s it.