User-Agents For Web Scraping 101
In this post you will learn:
What is a user agent?
The term refers to any piece of software that facilitates end-user interaction with web content. A user agent (UA) string is a text that the client computer software sends through a request.
The user agent string helps the destination server identify which browser, type of device, and operating system is being used. For example, the string tells the server you are using Chrome browser and Windows 10 on your computer. The server can then use this information to adjust the response for the type of device, OS, and browser.
Most browsers send a user agent header in the following format, though there’s not much consistency in how user agents are chosen:
Every browser adds its own comment components, such as platform or RV: release version. Mozilla offers examples of strings to be used for crawlers:
You can learn more about the different strings you can use for the Mozilla browser on their developers’ site.
Below you can find examples from Chrome’s developer site on how the UA string format looks for different devices and browsers:
Chrome for Android
Why should you use a user agent?
When you are web scraping, sometimes you will find that the webserver blocks certain user agents. This is mostly because it identifies the origin as a bot and certain websites don’t allow bot crawlers or scrapers. More sophisticated websites do this the other way around ie they only allow user agents they think are valid to perform crawling jobs. The really sophisticated ones check that the browser behavior actually matches the user agent you claim.
You may think that the correct solution would be not setting a user agent in your requests. However, this causes tools to use a default UA. In many cases, the destination web server has it blacklisted and blocks it.
So how do you ensure your user agent doesn’t get banned?
Tips to avoid getting your UA banned when scraping:
#1: Use a real user agent
If your user agent doesn’t belong to a major browser, some websites will block its requests. Many bot-based web scrapers skip the step of defining a UA, with the consequence of being detected and banned for missing the wrong/default UA.
You can avoid this problem by setting a widely used UA for your web crawler. You can find a large list of popular user agents here. You can compile a list of popular strings and rotate them by performing a cURL request for a website. Nevertheless, we recommend using your browser’s user agent because your browser behavior is more likely to match what is expected from the user agent if you don’t change it too much.
#2: Rotate user agents
When you make numerous requests while web scraping, you should randomize them. This will minimize the possibility of the web server identifying and blocking your UAs.
How do you randomize requests?
One solution would be changing the request IP address using rotating proxies. This way, you send a different set of headers every single time. On the web server end, it will look like the request is coming from different computers and different browsers.
Pro tip: A user agent is a header, but headers include much more than just user agents. You can’t just send random headers, you need to make sure that the user agent you send matches the headers you’re sending.
You can use botcheck.luminatio.io to check if the headers you’re sending match what’s expected for the user agent.
How to rotate user agents
First, you need to collect a list of user agent strings. We recommend using strings from real browsers, which can be found here. The next step is adding the strings to a Python List. And finally, defining that every request picks a random string from the list.
You can see an example of how to rotate user agents using Python 3 and Selenium 4 in this stack overflow discussion. The code example looks like this:
Whichever program or method you choose to use to rotate your UA headers, you should follow the same techniques to avoid getting detected and blocked:
- #1: Rotate a full set of headers that are associated with each UA
- #2: Send headers in the order a real browser typically would
- #3: Use the previous page you visited as a ‘referrer header’
Pro tip: You need to make sure the IP address and cookies don’t change when using a referrer header. Ideally, you’d actually visit the previous page so that there is a record of it on your target server.
#3: Rotate use agents using a Proxy
You can avoid the headache and hassle of having to manually define lists and rotating IPs manually by using a rotating proxy network. Proxies have the capability of setting up automatic IP rotation and UA string rotation. This means that your requests look like they originated from a variety of web browsers. This severely decreases blockages and increases success rates as requests appear to have originated from real web users. Keep in mind that only very specific proxies that employ Data Unlocking technology have the ability to properly manage and rotate your user agents.
The Bottom Line
Since most websites block requests missing a valid or recognizable browser user agent, learning how to properly rotate UA is important in avoiding site blocks. Using the correct user agent will tell your target website that your request came from a valid origin, enabling you to freely collect data from your desired target sites.