To Collect Or Not To Collect That Is The Answer

This guide will provide you with the best techniques for keeping your data collection in line with privacy rules and best practices. From performing highly targeted data collection to reviewing Robots.txt files, we got you covered.
To Collect Or Not To Collect That Is The Questio
Amitai Richman | Product Marketing Manager

In this article we will discuss:

Which website data can be problematic to collect? 

You should avoid collecting:

  • Password-protected data
  • Copyright-protected information
  • Personal Identifiable Information (PII) for example name, email address, date of birth, phone number, billing information, etc.

Collecting this type of data may have significant legal/financial implications for your company. This is due to the General Data Protection Regulation (GDPR), and California Consumer Privacy Act (CCPA) which stipulates companies may be fined for lack of compliance. 

**None of the content in this post constitutes legal advice. Before making any changes or decisions that affect the way in which you collect data or the type of data you collect, please consult legal counsel.**

Best ways to ensure your data collection is done correctly 

#1: Perform targeted data collection

Instead of just collecting huge volumes of data or entire websites which may contain private data – pinpoint which data is essential to the projects you are collecting for, and only collect that. For example, instead of collecting entire social media profiles, only collect posts/comments pertaining to your product or industry in order to gauge target audience sentiment. 

#2: Only collect publicly available data 

Data collection can be tricky. Many open source data points may be technically challenging to collect due to complex target site architecture, but this data is public, and completely legal to crawl. However, if the data is password-protected or defined by law as PII, meaning indicative of an individual’s personal identity, then it should be avoided. Ensure you have data collection policies and procedures in place that ensure collectors are only monitoring open source data. 

#3: Review target site Robots.txt files

These files exist on every website and essentially define the on-site dos and don’ts for robots/spiders/crawlers. It is a root directory which can be found by adding ‘/robots.txt’ at the end of any public domain. Be sure to check these and ensure your web crawlers are following these guidelines when crawling target sites. 

#4: Use a sophisticated data collection tool 

Tools such as Bright Data’s Web Scraper IDE allow you to specify exactly which data fields to collect, thereby avoiding personal data as well as any other undesired datasets. 

Bright Data’s commitment to keeping private data private is of the highest importance which is why we have rolled out a tool to enable you to learn if your publicly available data was collected by Bright Data’s data collection platforms. You can then go ahead and ask for this information to be removed, which is part of our commitment to full transparency, and legal compliance. 

The bottom line 

Making sure you only collect datasets that are 100% ethical and compliant with regulations is extremely important to the long term value of your business. Avoid risk by implementing one or all of the above-mentioned suggestions today. 

Amitai Richman | Product Marketing Manager

Amitai is a Product Marketing Manager at Bright Data, responsible for the Web Scraper IDE product. He is committed to making public web data easily accessible to all, thereby keeping markets openly competitive, benefiting everyone.

You might also be interested in

What is data aggregation

Data Aggregation – Definition, Use Cases, and Challenges

This blog post will teach you everything you need to know about data aggregation. Here, you will see what data aggregation is, where it is used, what benefits it can bring, and what obstacles it involves.
What is a data parser featured image

What Is Data Parsing? Definition, Benefits, and Challenges

In this article, you will learn everything you need to know about data parsing. In detail, you will learn what data parsing is, why it is so important, and what is the best way to approach it.
What is a web crawler featured image

What is a Web Crawler?

Web crawlers are a critical part of the infrastructure of the Internet. In this article, we will discuss: Web Crawler Definition A web crawler is a software robot that scans the internet and downloads the data it finds. Most web crawlers are operated by search engines like Google, Bing, Baidu, and DuckDuckGo. Search engines apply […]

A Hands-On Guide to Web Scraping in R

In this tutorial, we’ll go through all the steps involved in web scraping in R with rvest with the goal of extracting product reviews from one publicly accessible URL from Amazon’s website.

The Ultimate Web Scraping With C# Guide

In this tutorial, you will learn how to build a web scraper in C#. In detail, you will see how to perform an HTTP request to download the web page you want to scrape, select HTML elements from its DOM tree, and extract data from them.
Javascript and node.js web scraping guide image

Web Scraping With JavaScript and Node.JS

We will cover why frontend JavaScript isn’t the best option for web scraping and will teach you how to build a Node.js scraper from scratch.
Web scraping with JSoup

Web Scraping in Java With Jsoup: A Step-By-Step Guide

Learn to perform web scraping with Jsoup in Java to automatically extract all data from an entire website.
Static vs. Rotating Proxies

Static vs Rotating Proxies: Detailed Comparison

Proxies play an important role in enabling businesses to conduct critical web research.