How to scrape a website without getting blocked or misled (cloaked)?
Why should I care?
When a target website detects crawlers from a proxy (data-center) IP, it typically:
- Blocks the IP
- Presents the IP with purposely misleading information
- Throttle down the response rate
How does the target website identify my crawling activity?
Target websites log the IPs of whoever visits them and analyzes the activity of these IPs. Assuming you are using a traditional data center proxy, the target website can:
- Identify that the activity from a single IP (the rate of requests) is much greater than what a human can accomplish in a given time frame
- Identify that the IP address originated from a proxy server list, which these target websites have access to
- Identify that the IPs have the same subnet block range
How to prevent being detected?
- To prevent being detected by the number of requests per IP, you can reduce the number of requests per seconds. However, this will reduce your crawling speed. Learn about super fast crawling capacities here.
- To prevent the target website from identifying your IP as coming from a proxy server, you must rotate your requests through residential IPs. You should be able to circulate through enough IPs that the target website cannot detect your activity.
- When using residential IPs there is no subnet block range.
You can learn more from this guide how to prevent getting blacklisted or blocked when crawling a website.
By using a traditional proxy solution, it’s only a matter of time before the target website will identify your crawling activities, and can block or provide you with the wrong information.
Fill in the Form Below and Get:
- Your own Personal Bright Data Representative to set-up your account specifically for your needs
- Access to the Largest Residential Proxy Network in the World
- 24/7 support no matter where you are located
- A $250 bonus for every $250 deposit!