Make your crawler be seen as a 'real-user' and not get blocked
- Select the right tools for browser automation
- Selenium vs. Puppeteer, pro and cons
- Setup Selenium and Puppeteer to work with proxy
- Proxy manipulation for an automated crawler
- Setup your Proxy Manager with Selenium and Puppeteer
Don't want to watch the webinar, read it
How do sites know when you are using a "bot" or "crawler"?
Mostly it is due to cookies, the browser user agent and your IP.
When getting or posting information on a website, the visited website saves cookies on your browser.
The website recognizes a real browser by checking the IP address and reading the request headers which include information about the user agent.
A lack of cookies or the correct user agent may trigger the website to block you from retrieving information.
Both can be programmed to ensure this doesn't happen.
Your IP however, is the single thing that can't be coded since it is part of the network infrastructure.
I'll go to 'whoer.net' and check what information my computer in sending over the web.
So currently I'm browsing from our Bright Data Headquarters in Netanya, Israel over 013 ISP.
WebRTC is used for real-time communication from browser to browser and retrieves my IP, port, and protocol to compare with the network data.
Also, your computer time can be collected and compared to IP timezone.
I can also find the *request header* sent to the target website in the browser console by selecting F12 and going to Network tab in Chrome, or selecting ctrl+R.
Clicking on the document I will find the Header information and the request I sent to whoer.net
The request header contains the cookies and user agent which is the browser type and version.
Here are my user agent values, and as you can see the cookie of 'whoer' is also here.
Target websites that compare my pc information and check sessions with cookies require that when I send requests over an API that I provide a new cookie every few requests.
This as well as the exact same user agent and sometimes even the accept-language.
This is because a broken user agent or lack of user agent may in itself trigger an error response by the target website.
I can manage complex scraping operations by using automation tools such as Selenium or Puppeteer, that actually open a real browser, or even use an open source headless browser like chromium.
Browsers are a tool I can use if I need to run JS or don't want to write complex request chains myself.
The downside is that running a browser is slower and takes more RAM than using a custom script.
Automation on top of a real browser will make scraping more simple as this means I don't need to bother collecting a database of cookies to rotate or correct user agents.
This will increase my success rate with target websites that are checking for these.
Now You may be thinking, why a headless browser?
Mainly for run the scrapping automatically.
A headless browser lacks Flash Player and digital rights management that can send more information about my pc and I can easily increase my success rate by not having them altogether.
Which automation tool should I work with?
Well, this depends on your technical skills and the website you are targeting.
For example, I will compare Puppeteer an easy to use an automation tool to Selenium with requires more technical expertise.
Puppeteer developed by Google is easy to install with one command line:
npm install puppeteer It only supports Chromium headless browser and it is based on node.
Very quickly I am able to start working with Puppeteer and test a quick demo.
On the other hand, Puppeteer does not support cross-browser automation and Selenium is the most flexible tool in terms of automation and functionality out there.
Selenium will work on most target websites and can be automated for any reasonable scenario.
On the other hand, installing and using Selenium requires some technical skills in terms of understanding web technologies and APIs.
With Selenium WebDriver you can upload or download files, work with pop-ups and overcome dialog barriers.
However, Puppeteer works with only Chrome and Chromium, whereas Selenium supports all common browsers such as Chrome, Firefox, Explorer and other headless browsers.
I would like to mention that Puppeteer and Selenium DO NOT support native mobile applications out of the box.
However, Appium developed by JS foundation or Selendroid developed by eBay software foundation will allow you to automate native mobile applications.
Both Selendroid and Appium require Selenium and the knowledge of how to work with Selenium API and WebDriver.
How to Connect Puppeteer with Proxy
There are two steps to follow to route your traffic through the proxy IPs:
- First, route through the proxy server and port
- Second, authenticate the browser page with the proxy Zone, username, and password
In this example, I will connect our Bright Data super proxy when launching puppeteer by defining the proxy server as
zproxy.lum-superproxy.io and port
For 'newPage' I'm adding page authentication credentials of the proxy Zone.
This consists of the zones full username and password.
Removing authentication credentials and running my Puppeteer example will open the browser.
Now, I paste my username here and password here which will open a browser of Bright Data's homepage and take a screenshot. I'll bring back my credentials and run my example again.
And I get the page screenshot, without asking for credentials.
Where I can find my Zone username and password?
Just go to your Bright Data dashboard, and click on API examples.
Here I'll select the Zone and it will show me the relevant username and password.
The user name consists of three parts:
- My Bright Data account ID
- And the name of the Zone
Here is your Zone name that can change when you need it to.
I would like to remind you that the password is unique for each Zone and this should be copied into your code as well.
The password can also be found and updated in the settings of each Zone
In Selenium, I will start by setting the proxy and authentication credentials in the WebDriver function.
Here I will add my proxy
zproxy.lum-superproxy.io and my port
And now The username and password that we found before.
I will run the example and as you can see, it takes a screenshot of Bright Data's homepage just like we did before.
The most essential tool for proxy manipulation is the Proxy Manager that is installed locally on your machine or VM.
Why do I need the Proxy Manager:
- It will Retry my requests in case of failures/error code
- I can Automatically blacklist IPs
- I can Route my requests to through residential, data-center and mobile IP networks
- I can create rules for rotating and refreshing IPs
- It allows for Geo and ISP targeting
- It reduces response bandwidth
- Allows you to Save a pool of the fastest IPs
- And provides a complete request history with debug information for troubleshooting
You can download the Proxy Manager from Bright Data's Dashboard by going to the Proxy Manager tab found on the left or from Proxy Manager page.
Starting the Proxy Manager will open a command black window or terminal which consists of debugging information, You must keep this window open at all times since closing it will stop proxy manager and terminate all communication to the super proxy.
In the browser go to http://127.0.0.1:22999/ which is the address of the Proxy Manager and will take you to the Proxy Manager dashboard.
When sending a request, the Proxy Manager is a middle man between my automated browser (Selenium or puppeteer in our case) and the Super proxy.
Before starting I will download the Proxy Manager SSL certificate and install it, this will allow me to view HTTPS requests, debugging information and apply rules on HTTPS traffic.
Starting with creating a new proxy port, I'll select my Zone and preset configuration, now as you can see we have created a couple of different options for you based on your need merely hover over each option and it will tell you what it is best for.
I'll go to the targeting tab and select the country and city I want the IP to come from this can even be a specific ISP or mobile carrier.
At the Request speed tab, I can select to remotely resolve by the peer. This means the translation of the URL to an IP address will be made on the peer side or in other words our real-user in the location of interest.
Also, I can set a number of parallel requests when a specific load and time is needed.
Here under the Rules tab, I can choose a rule type and how I want it to be handled.
For this example, I will choose error code 504 if I get this error code I want the Proxy Manager to automatically refresh the IP.
I can even Add a second rule, saying when I hit a 403 error code I want to automatically retry with new IP or chose the waterfall.
The waterfall allows you to retry with a different type of IP or network and here I will use our mobile IPs network.
When working with HTTPS keep in mind to go to General tab and enable SSL logs for tracking your success rate, view debugging details and this will provide error information for troubleshooting purposes.
Connecting the Proxy Manager to Selenium and puppeteer is done by updating the proxy-server URL to your local machine.
I'll go to my puppeteer example and change the server from the Bright Data super proxy to the Proxy Manager on my local machine using 127.0.0.1 and the port I created earlier in the Proxy Manager dashboard which was port 24000.
When working with Bright Data's Proxy Manager there is no need for authentication credentials in my code since the Proxy Manager is already authenticated with Bright Data's Super Proxy.
Therefore I will remove the credentials.
Going to our Selenium example I'll do the same for connecting Selenium to the Proxy Manager, again connecting it through
IP 127.0.01 and port
And again remove the username and password.
Don't forget to press Save.
As we get to the end of our webinar I hope you see how web automation can be achieved with simple or more advances tools depending on what you need.