Data Collection Best Practices

In this article we will discuss:

Which datasets can be problematic to collect?
Best ways to ensure your data collection is following best practices

Which website data can be problematic to collect?

You should avoid collecting:

Password-protected data
Copyright-protected information
Personal Identifiable Information (PII) for example name, email address, date of birth, phone number, billing information, etc.

Collecting this type of data may have significant legal/financial implications for your company. This is due to the General Data Protection Regulation (GDPR), and California Consumer Privacy Act (CCPA) which stipulates companies may be fined for lack of compliance.

**None of the content in this post constitutes legal advice. Before making any changes or decisions that affect the way in which you collect data or the type of data you collect, please consult legal counsel.**

Best ways to ensure your data collection is done correctly

#1: Perform targeted data collection

Instead of just collecting huge volumes of data or entire websites which may contain private data – pinpoint which data is essential to the projects you are collecting for, and only collect that. For example, instead of collecting entire social media profiles, only collect posts/comments pertaining to your product or industry in order to gauge target audience sentiment.

#2: Only collect publicly available data

Data collection can be tricky. Many open source data points may be technically challenging to collect due to complex target site architecture, but this data is public, and completely legal to crawl. However, if the data is password-protected or defined by law as PII, meaning indicative of an individual’s personal identity, then it should be avoided. Ensure you have data collection policies and procedures in place that ensure collectors are only monitoring open source data.

#3: Review target site Robots.txt files

These files exist on every website and essentially define the on-site dos and don’ts for robots/spiders/crawlers. It is a root directory which can be found by adding ‘/robots.txt’ at the end of any public domain. Be sure to check these and ensure your web crawlers are following these guidelines when crawling target sites.

#4: Use a sophisticated data collection tool

Tools such as Bright Data’s Web Scraper API allow you to specify exactly which data fields to collect, thereby avoiding personal data as well as any other undesired datasets.

Bright Data’s commitment to keeping private data private is of the highest importance which is why we have rolled out a tool to enable you to learn if your publicly available data was collected by Bright Data’s data collection platforms. You can then go ahead and ask for this information to be removed, which is part of our commitment to full transparency, and legal compliance.

The bottom line

Making sure you only collect datasets that are 100% ethical and compliant with regulations is extremely important to the long term value of your business. Avoid risk by implementing one or all of the above-mentioned suggestions today. Interested in learning more about the products? Register now and start your free proxy trial or download free dataset samples!

Start free trial

Start free with Google

Amitai Richman

Product Marketing Manager

Amitai is a Product Marketing Manager at Bright Data, responsible for the Web Scraper IDE product. He is committed to making public web data easily accessible to all, thereby keeping markets openly competitive, benefiting everyone.

View all articles

To Collect Or Not To Collect That Is The Answer

Which website data can be problematic to collect?

Best ways to ensure your data collection is done correctly

#1: Perform targeted data collection

#2: Only collect publicly available data

#3: Review target site Robots.txt files

#4: Use a sophisticated data collection tool

The bottom line

Amitai Richman

You might also be interested in

How to Perform Web Scraping in Agno With Bright Data

Best Web Scraping Methods for JavaScript-Heavy Sites

Crawl4AI vs Firecrawl: Detailed Comparison 2025