User-Agents For Web Scraping 101

Using the correct user agent when performing data scraping tasks is crucial to your success in collecting your target data while avoiding being blocked. This is the only guide you will need to get started.
Chrome browser and web scraping to data collection
Josh Vanderwillik
Josh Vanderwillik | Product Manager
03-Dec-2020
Share:

In this post you will learn:

What is a user agent?

The term refers to any piece of software that facilitates end-user interaction with web content. A user agent (UA) string is a text that the client computer software sends through a request.

The user agent string helps the destination server identify which browser, type of device, and operating system is being used. For example, the string tells the server you are using Chrome browser and Windows 10 on your computer. The server can then use this information to adjust the response for the type of device, OS, and browser.

Most browsers send a user agent header in the following format, though there’s not much consistency in how user agents are chosen:

syntax for choosing a user agent - User-Agent: Mozilla/5.0 (<system-information.) <platform> (<platform-details>) <extensions> white text on grey background
User-Agent: Mozilla/5.0 () ()

Image source: Bright Data

Every browser adds its own comment components, such as platform or RV: release version. Mozilla offers examples of strings to be used for crawlers:

Crawler string example - white text on grey background - Mozilla/5.0 (compatible; Googlebot/2.1; +http:://www.google.com/bot.html
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Image source: Bright Data

You can learn more about the different strings you can use for the Mozilla browser on their developers’ site.

Below you can find examples from Chrome’s developer site on how the UA string format looks for different devices and browsers:

Chrome for Android

Phone UA:

white text on grey background - user agent to collect data as a android linux device or some other mobile device such as an apple iphone

Mozilla/5.0 (Linux; ; )AppleWebKit/ (KHTML, like Gecko) Chrome/Mobile Safari/

Image source: Bright Data

Tablet UA:

White text on grey background - syntax for user-agent to appear as either android tablet or apple ipad
Mozilla/5.0 (Linux; ; )AppleWebKit/(KHTML, like Gecko) Chrome/Safari/

Image source: Bright Data

Why should you use a user agent?

When you are web scraping, sometimes you will find that the webserver blocks certain user agents. This is mostly because it identifies the origin as a bot and certain websites don’t allow bot crawlers or scrapers. More sophisticated websites do this the other way around ie they only allow user agents they think are valid to perform crawling jobs. The really sophisticated ones check that the browser behavior actually matches the user agent you claim.

You may think that the correct solution would be not setting a user agent in your requests. However, this causes tools to use a default UA. In many cases, the destination web server has it blacklisted and blocks it.

So how do you ensure your user agent doesn’t get banned?

Tips to avoid getting your UA banned when scraping:

#1: Use a real user agent

If your user agent doesn’t belong to a major browser, some websites will block its requests. Many bot-based web scrapers skip the step of defining a UA, with the consequence of being detected and banned for missing the wrong/default UA.

You can avoid this problem by setting a widely used UA for your web crawler. You can find a large list of popular user agents here. You can compile a list of popular strings and rotate them by performing a cURL request for a website. Nevertheless, we recommend using your browser’s user agent because your browser behavior is more likely to match what is expected from the user agent if you don’t change it too much.

#2: Rotate user agents

When you make numerous requests while web scraping, you should randomize them. This will minimize the possibility of the web server identifying and blocking your UAs.

How do you randomize requests?

One solution would be changing the request IP address using rotating proxies. This way, you send a different set of headers every single time. On the web server end, it will look like the request is coming from different computers and different browsers.

Pro tip: A user agent is a header, but headers include much more than just user agents. You can’t just send random headers, you need to make sure that the user agent you send matches the headers you’re sending.

You can use botcheck.luminatio.io to check if the headers you’re sending match what’s expected for the user agent.

How to rotate user agents

First, you need to collect a list of user agent strings. We recommend using strings from real browsers, which can be found here. The next step is adding the strings to a Python List. And finally, defining that every request picks a random string from the list.

You can see an example of how to rotate user agents using Python 3 and Selenium 4 in this stack overflow discussion. The code example looks like this:

white text on grey background coding environment, python code on how to rotate user agentsImage source: Bright Data

Whichever program or method you choose to use to rotate your UA headers, you should follow the same techniques to avoid getting detected and blocked:

  • #1: Rotate a full set of headers that are associated with each UA
  • #2: Send headers in the order a real browser typically would
  • #3: Use the previous page you visited as a ‘referrer header’

Pro tip: You need to make sure the IP address and cookies don’t change when using a referrer header. Ideally, you’d actually visit the previous page so that there is a record of it on your target server.

#3: Rotate use agents using a Proxy

You can avoid the headache and hassle of having to manually define lists and rotating IPs manually by using a rotating proxy network. Proxies have the capability of setting up automatic IP rotation and UA string rotation. This means that your requests look like they originated from a variety of web browsers. This severely decreases blockages and increases success rates as requests appear to have originated from real web users. Keep in mind that only very specific proxies that employ Data Unlocking technology have the ability to properly manage and rotate your user agents.

The Bottom Line

Since most websites block requests missing a valid or recognizable browser user agent, learning how to properly rotate UA is important in avoiding site blocks. Using the correct user agent will tell your target website that your request came from a valid origin, enabling you to freely collect data from your desired target sites.

Josh Vanderwillik
Josh Vanderwillik | Product Manager

Josh is a product manager at Bright Data working on next-gen technology,
specifically in the field of automated data collection: building fingerprint-proof, high
scale web crawlers that are simple to use. He is an active participant in global
webinars which help companies learn cutting edge data collection techniques, and
is now expanding that knowledge base through blogging.

Share: