In this article we will discuss:
- Which datasets can be problematic to collect?
- Best ways to ensure your data collection is following best practices
Which website data can be problematic to collect?
You should avoid collecting:
- Password-protected data
- Copyright-protected information
- Personal Identifiable Information (PII) for example name, email address, date of birth, phone number, billing information, etc.
Collecting this type of data may have significant legal/financial implications for your company. This is due to the General Data Protection Regulation (GDPR), and California Consumer Privacy Act (CCPA) which stipulates companies may be fined for lack of compliance.
**None of the content in this post constitutes legal advice. Before making any changes or decisions that affect the way in which you collect data or the type of data you collect, please consult legal counsel.**
Best ways to ensure your data collection is done correctly
#1: Perform targeted data collection
Instead of just collecting huge volumes of data or entire websites which may contain private data – pinpoint which data is essential to the projects you are collecting for, and only collect that. For example, instead of collecting entire social media profiles, only collect posts/comments pertaining to your product or industry in order to gauge target audience sentiment.
#2: Only collect publicly available data
Data collection can be tricky. Many open source data points may be technically challenging to collect due to complex target site architecture, but this data is public, and completely legal to crawl. However, if the data is password-protected or defined by law as PII, meaning indicative of an individual’s personal identity, then it should be avoided. Ensure you have data collection policies and procedures in place that ensure collectors are only monitoring open source data.
#3: Review target site Robots.txt files
These files exist on every website and essentially define the on-site dos and don’ts for robots/spiders/crawlers. It is a root directory which can be found by adding ‘/robots.txt’ at the end of any public domain. Be sure to check these and ensure your web crawlers are following these guidelines when crawling target sites.
#4: Use a sophisticated data collection tool
Tools such as Bright Data’s Web Scraper IDE allow you to specify exactly which data fields to collect, thereby avoiding personal data as well as any other undesired datasets.
Bright Data’s commitment to keeping private data private is of the highest importance which is why we have rolled out a tool to enable you to learn if your publicly available data was collected by Bright Data’s data collection platforms. You can then go ahead and ask for this information to be removed, which is part of our commitment to full transparency, and legal compliance.
The bottom line
Making sure you only collect datasets that are 100% ethical and compliant with regulations is extremely important to the long term value of your business. Avoid risk by implementing one or all of the above-mentioned suggestions today.