Web Crawling Is So 2019
In this article we will discuss:
- Pre-collected Datasets are more effective and create more value than web crawling
- How Datasets are being leveraged across different industries:
Pre-collected Datasets are more effective and create more value than web crawling
Since Bright Data’s introduction of ready-to-use Datasets, many companies are moving away from in-house web crawling to having a snapshot of entire sites, or smart subsets that are tailored to their data needs, delivered directly to teams.
This option is helping businesses become more efficient in terms of their:
- Agility – Datasets enable high levels of workflow, and budgetary flexibility as you have no ‘ongoing commitment’ to your data collection operations. This means that you can custom order a Dataset for a specific project one month, then take a break, and order another for a Proof of Concept (PoC) later down the line. Access to data takes on a supportive role instead of constraining you.
- Resources – Datasets do not require maintenance/upkeep, or any in-house hardware/software, nor do they require maintaining teams of IT, engineering, and DevOps personnel.
- Time – Datasets can shorten the time span between ‘ideation stages’ and the roll out of a new product, feature or capability. This is because there is no collection time, meaning the data your algorithms need can be delivered in a matter of minutes. Additionally, datasets are regularly refreshed ensuring that you are relying on information that is up-to-date.
- Cost-efficiency – Datasets are a more cost-effective option as the cost of scaling, accessing, and upkeeping is spread among multiple corporations. This ‘data sharing model’ reduces the costs for each individual participant.
How Datasets are being leveraged across different industries
Industries such as insurance, investment, and lending are all part of very regimented industries that can benefit from datasets as a whole, and alternative datasets in particular.
For example, institutional lenders try to mitigate risk by creating a profile on the company or person requesting a line of credit. Typically they use ‘classic data’ such as:
- Credit history/scores
- Income to debt ratio
But being able to feed algorithms an additional layer of information with which decisions can be made about applicants can open institutions up to new previously overlooked low to mid-risk customers.
When evaluating the financial strength of a company, datasets such as industry ranking, job posting, employees’ reviews, or the more “traditional” data points such as revenue, company size, and investment rounds can provide relevant insights into a given company’s strengths and credit ratings while widening one’s scope of understanding of a specific corporation.
For individuals, lenders can utilize social media profiles in order to gain a better understanding of who the person is and how that might influence a loan’s level of risk (do they skydive? Party every night? etc).
Also, they can order a ready-to-use dataset pertaining to the average time it takes target audience applicants to fill out online loan applications. The First Bank of Omaha’s compliance team, for example, collects this information, taking a closer look at applications with an unusual time lag. This is due to their internal statistics which show that there is a higher probability of these applications fitting one of many fraud profiles.
As far as investors are concerned, Venture Capital firms are leveraging datasets in order to get in on companies at an early stage. This is due to a huge rise in investment capital while the pool of startups remains stagnant. Relevant ready-to-use datasets in this context include:
- Scanning entire startup accelerator sites in search of companies with stats that yell ‘monetization opportunity’ (such as growth in the number of employees over a short period of time, rise in number of job postings, heightened activity in industry forums or a recent successful launch of a product)
- Crawling full app store sites for applications with high performance, downloads, and star ratings which can all be indicative of a company’s growth/adoption rates among target audiences.
Social media Datasets
Many companies have business models and digital services that are heavily reliant on social media input. A good example of this are fitness apps, wearables, and ‘health tracking as a business model’ companies. In this context, businesses are ordering pre-collected datasets such as:
- Top-followed influencers in the health, beauty and sports industry – This may include entire profiles or just trending posts with high engagement metrics. These can serve as very real indicators of target audience interest, sentiment, and workout routines. For example, there may be multiple posts discussing a desire ‘to get rid of belly fat’ which may be indicative of a market need for a new product that targets this issue specifically or shed light on advertising messaging that may work well for existing product lines.
- Secondary wearable or app achievement data – Many people use fitness apps, and wearables such as smart watches to track their workout sessions. This information is private and cannot be collected but many people choose to share their achievements on social media, which is where this alternative/secondary dataset can be picked up on. This information can be extremely important in understanding what type of workout routine people are doing (running? yoga?) as well as the location (in a gym? Or in the park?). This data can inform ad campaigns, product lines, new fitness app features, and a host of other insights which can help your company become a consumer-first market leader.
The bottom line
Actively crawling the internet for the datasets your company needs in order to make smarter business decisions is ‘passé’. It is a resource-heavy, timely, and clunky way to run a business. Datasets allow you to focus on your core business, and order the data you need, whenever, and however (parsed JSON, CSV, or Excel) you need it.