In this article we will discuss:
What makes a data collection network ethical from a technical standpoint ?
[1] Performing Know Your Customer (KYC) procedures
- All potential network users should undergo a strict vetting process conducted by a senior employee and/or a Compliance Officer.
- If a potential customer is a company, the following items should be reviewed: (i) The Company registration to ensure it is real, (ii) The Company’s website, (ii) The Company’s email domain (iv) Corporate social media profiles.
- For potential freelance customers, video interviews and physical proof of identification should be a prerequisite. Data collection networks should also confirm that the physical address is legitimate and that the IP address and credit card billing addresses match.
[2] Blocking actionable API endpoints
Ethical data collection networks should block API endpoints that can potentially be misused and abused including:
Creating fake accounts (on social media, review sites, at financial institutions etc)
Ad fraud (e.g. click fraud)
Fictitious reviews (including fake product ratings, service reviews, and mass voting)
[3] Overseeing global network usage
Global network usage should be monitored in order to ensure that it does not even come close to Denial-of-Service Attack (DDoS) rates. In the event that traffic rates start climbing it will automatically be throttled down.
The traffic monitor is not just over the traffic of a specific customer towards a specific target domain, its aggregated traffic of all customers on all products towards that target domain so that there will be a situation of an unintentional DDoS.
Additionally, account managers should perform granular monitoring of client event logs once network permissions have been granted. If a discrepancy is found between said client’s KYC use case and their practical account activity – their account should be permanently terminated.
For example, a customer who claims to be performing website testing but instead attempts to use the network in order to carry out ad fraud. Through monitoring, compliance teams can identify such network abuse and stop them dead in their tracks.
[4] The per site traffic limit rule – Do no harm
Data collection networks must ensure that their activity does not interfere with the site’s regular quality of service. Even if the collection task is taking 10% of the site’s resources and does not come close to DDoS proportions it can still affect performance and the operational statistics gathering that will cause the website’s product team to reach the wrong conclusions about their users’ behavior.
As such Data collection networks should study their targets and set per domain limits according to the site’s standard operational traffic levels. This ensures that no harm will be done to service tiers while helping to maintain a site’s usage stats.
[5] Blacklisting non-public domains
Ethical data providers should blacklist domains that do not contain public, open-source information that can be targeted for abusive activities. This may include:
Payment server attacks – This may include anything from illegal purchases using fake or stolen credentials to hacking and DDoS attacks.
API server disruption– This can be a direct attack on web servers, applications, or both.
[6] Peer consent
Legitimate data collection networks will only route traffic through peer devices once active consent has been given to a detailed terms of use description .By default, the user is not opted-in. This should be a fair transaction exchange, meaning data collection networks can route traffic through peer devices, and peers on the other hand are compensated for said resources. This may include a free upgraded subscription, an ad-free version of the app, or anything else that positively influences User Experience.
[7] Idle resources
Ethical data collection networks make it their business to only use peer resources (i.e. route traffic) under strict conditions, ensuring little to no change as far as User Experience is concerned. These conditions should ensure at a minimum that user devices:
- Are idle (not in use) when traffic is being routed
- Are connected to WiFi using very limited amounts of 3G/LTE data
- Have sufficient battery power
The median bandwidth per peer should vary according to geolocation. The global recommended average in practice should be 8 MB per peer, per day – i.e. half the size of any given Amazon product page.
[8] Network limitations set
Ethical data collection platforms monitor and limit the traffic through individual peer devices in order to consume negligible device resources compared to the user’s own usage. For example if an average user visits several websites during the day, hears music and watches a few short videos the usage of an ethical data collection platform, in comparison would be equivalent to loading a single Amazon product page on a regular browser.
[9] Opt-in / Opt-out
Data collection networks must be based on a democratic peer-to-peer network. The individuals who comprise this network need to be free to opt-in, and opt-out at any point in time. This is a basic tenet of internet transparency which must be upheld in order to ensure a decentralized, free flow of data, and information.
[10] GDPR-compliant (security, storage, and PII)
Ethical data collection networks should adhere to GDPR rules including but not limited to:
- Only collecting IPs as PII with full user consent – protecting users’ privacy and not collecting any other private information or behavioral statistics. Absolutely no data from, or, about the user is collected.
- Fully adhering to GDPR, and CCPA rules for collected data security, and storage as GDPR Web Scraper IDEs and GDPR Data Processors
The bottom line
Bright Data adheres to all 10 ethical data collection commandments. Due diligence is recommended when selecting a data collection platform, in order to ensure:
- The long-term value of information
- The legal viability of the data, and derived analysis, products, and services
- The safety of your networks, systems and software
And finally, the ingredient that rises above all else is transparency – you must exhibit transparency and demonstrate trustworthiness – so as a guideline, be open to frequent changes, and to frequently checking and testing your guidelines – this is a rapidly evolving domain, and getting it right is not easy.