HTTP Headers for Web Scraping

Context, whether cultural, environmental, or relational, is present in all communication, and context influences the effectiveness of communication. In web communication, HTTP headers are the technical context that web servers and clients exchange when sending HTTP requests or receiving HTTP responses. This context can be used to facilitate authentication, determine caching behavior, or manage session state. It also helps web servers determine the origin of the HTTP request and how to respond to it. This response could include rendering a website to meet the requirements of your client device or delivering data to you. When the latter is accomplished using a bot, the operation is referred to as web scraping, which is useful when you need to automatically obtain data from a website.

When building a scraper, it’s easy to overlook configuring your HTTP header because the default values allow your requests to proceed. However, without properly configured HTTP headers, it’s difficult to maintain continuous communication between your scraper and the web server. This is because web servers can be set up to detect bots and automated scripts based on information in default HTTP headers, such as User-Agent, Referer, and Accept-Language.

However, if you correctly configure your headers, you can simulate normal user traffic, enhancing the reliability of your scraping operations. In this article, you’ll learn all about HTTP headers, their role in web scraping, and how to optimize them for effective data collection.

Why You Need HTTP Headers

HTTP headers are key-value pairs in requests and responses that are required for web communication. The web server receives information and instructions about the client and resource of interest via request headers. Meanwhile, response headers give the client more information about the fetched resource and the response received. While there are numerous HTTP headers, the following are some of the most important for web scraping:

User-Agent

User-Agent is a string that uniquely identifies the client you use to send a request. This string’s contents may include the type of application, operating system, software version, and software vendor.

By default, this header is set to a value that allows your scraper to be easily identified as a bot. For instance, if you want to scrape price data from an e-commerce website using a Python requests script, your scraper will send a User-Agent similar to the following in its HTTP header:

"python-requests/X.X.X"

You can avoid being detected by changing the User-Agent to mimic different browsers and devices. To do so, you need to replace the Python requests User-Agent header with the following:

"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36"

This new header includes information about the browser and the native platform on which it runs.

Accept-Language

The Accept-Language header lets you specify which language(s) you want to receive the requested resource in. If necessary, you can include the country code or alphabet type. For example, if you set Accept-Language to "en-US", it means that you expect the resource to be in English as spoken in the United States of America, even if you’re on another continent. You can also use the alphabet type to define the header as the Latin script version of the Serbian language by specifying "sr-Latn". This ensures that you retrieve the appropriate localized data.

When there are multiple languages, the Accept-Language header becomes a comma-separated list of languages with quality values that help define the priority order. An example of this is "en -GB;q=1.0, en-US;q=0.9, fr;q=0.8", where higher values of q indicate higher priority and q ranges from 0 to 1.

Cookie

The Cookie header contains data that lets the web server identify a user session across multiple request-response cycles. While scraping, you can generate cookies on the client side (or use previously stored ones) and include them in the HTTP header of a new request. This enables the web server to associate your request with a valid user session and return the data needed. For example, if you need to make multiple requests to obtain user-specific data from an e-commerce website, you should include session cookies in the HTTP request Cookie header to keep your scraper logged in, hold relevant data, and avoid cookie-based bot detection systems.

The Cookie header consists of a list of one or more key-value pairs separated by a semicolon and a space ("; "). It typically takes the form "name0=value0; name1=value1; name2=value2".

Referer

Referer contains the absolute or partial URL of the page from which you requested the resource. For example, while scrolling through the home page of an e-commerce website, you may choose to click on a link that piques your interest. The Referer header in the HTTP request that opens the next web page points to the home page of the e-commerce website from which you initiated the request. If you navigate to other web pages from the current one, each previously viewed page serves as the Referer for the next. This is analogous to how referrals work in human interactions.

Naturally, some websites check this header as part of their antiscraping mechanism. That means if you want to simulate natural traffic flow from other websites and avoid blocking, you need to set the Referer header to a valid URL, such as the site’s home page or a search engine’s URL.

How to Optimize HTTP Headers for Web Scraping

When scraping, keep in mind that the data you need is valuable to its owners, and they may be hesitant to share it. As a result, many owners take steps to detect automated agents attempting to access their content. If they succeed, they can block you or return irrelevant data.

HTTP headers help you get around these security measures by making it appear as if your scraper is a regular internet user browsing their website. By correctly setting headers, such as User-Agent, Accept, Accept-Language, and Referer, you can effectively mimic normal web traffic while making it difficult for the web server to identify your bot as a scraper.

Retrieving and Setting Custom Headers

To demonstrate how you can optimize your HTTP headers, let’s make a Python request to scrape books in the mystery category from the dummy e-commerce website Books to Scrape. Before doing this, you need to get HTTP headers from your browser’s developer tools.

To start, visit the website on a different browser tab:

Then launch the developer tools in your browser. One way to accomplish this is to right-click anywhere on the page and select Inspect or check the tools sublist. Then click on the Network tab in the developer tools’ top menu:

With the Network tab open, check the box next to Disable cache. This lets you see the entire request header. Then click the link to the Mystery category from the list of categories on the website. This opens a page with books in that category, but more importantly, a list of requests appears in the Network tab of the developer tools window:

Scroll to the top of the list and click on the first item. This opens a smaller window in the developer tools. Scroll to the Request Headers:

Under Request Headers, you’ll find the HTTP request headers, particularly those that you just learned about. To use these headers with a scraper, create a Python script with variables for the User-Agent, Accept, Accept-Language, Cookie, and Referer headers:

import requests

referer = "https://books.toscrape.com/"
accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8"
accept_language = "en-GB,en;q=0.6"
cookie = "zero-chakra-ui-color-mode=light-zero; AMP_MKTG_8f1ede8e9c=JTdCJTIycmVmZXJyZXIlMjIlM0ElMjJodHRwcyUzQSUyRiUyRnd3dy5nb29nbGUuY29tJTJGJTIyJTJDJTIycmVmZXJyaW5nX2RvbWFpbiUyMiUzQSUyMnd3dy5nb29nbGUuY29tJTIyJTdE; AMP_8f1ede8e9c=JTdCJTIyZGV2aWNlSWQlMjIlM0ElMjI1MjgxOGYyNC05ZGQ3LTQ5OTAtYjcxMC01NTY0NzliMzAwZmYlMjIlMkMlMjJzZXNzaW9uSWQlMjIlM0ExNzA4MzgxNTQ4ODQzJTJDJTIyb3B0T3V0JTIyJTNBZmFsc2UlMkMlMjJsYXN0RXZlbnRUaW1lJTIyJTNBMTcwODM4MjE1NTQ2MCUyQyUyMmxhc3RFdmVudElkJTIyJTNBNiU3RA=="
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36"

custom_headers = {
    "User-Agent": user_agent,
    "Accept": accept,
    "Accept-Language": accept_language,
    "Cookie": cookie,
    "Referer": referer
}

In this code snippet, you import the requests library and define variables for each HTTP header as strings. Then you create a dictionary called headers to map the HTTP header names to the defined variables.

Next, add the following code to the script to send an HTTP request without the custom headers and print the result:

URL = 'https://books.toscrape.com/catalogue/category/books/mystery_3/index.html'

r = requests.get(URL)
print(r.request.headers)

Here, you assign the mystery books URL to a variable. Then you call the requests.get method with this URL as the only parameter and print the request headers.

Your output should look like this:

{'User-Agent': 'python-requests/2.31.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

As you can see, the default HTTP headers are likely to identify your scraper as a bot. Update the requests.get line by passing an additional parameter to the function:

r = requests.get(URL, headers=custom_headers)

Here, you pass the custom_header dictionary you created and the URL parameter to the requests.get method.

Your output should look like this:

{'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36', 'Accept-Encoding': 'gzip, deflate', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8', 'Connection': 'keep-alive', 'Accept-Language': 'en-GB,en;q=0.6', 'Cookie': 'zero-chakra-ui-color-mode=light-zero; AMP_MKTG_8f1ede8e9c=JTdCJTIycmVmZXJyZXIlMjIlM0ElMjJodHRwcyUzQSUyRiUyRnd3dy5nb29nbGUuY29tJTJGJTIyJTJDJTIycmVmZXJyaW5nX2RvbWFpbiUyMiUzQSUyMnd3dy5nb29nbGUuY29tJTIyJTdE; AMP_8f1ede8e9c=JTdCJTIyZGV2aWNlSWQlMjIlM0ElMjI1MjgxOGYyNC05ZGQ3LTQ5OTAtYjcxMC01NTY0NzliMzAwZmYlMjIlMkMlMjJzZXNzaW9uSWQlMjIlM0ExNzA4MzgxNTQ4ODQzJTJDJTIyb3B0T3V0JTIyJTNBZmFsc2UlMkMlMjJsYXN0RXZlbnRUaW1lJTIyJTNBMTcwODM4MjE1NTQ2MCUyQyUyMmxhc3RFdmVudElkJTIyJTNBNiU3RA==', 'Referer': 'https://books.toscrape.com/'}

Here, you can see that the headers have been updated with the information obtained from your browser. This makes it more difficult for any web server to detect that you’re automatically visiting their sites, lowering your chances of being blocked.

Benefits of Optimizing Headers

Properly optimizing your HTTP headers is crucial in ensuring the continual success of your scraping operation.

One advantage of optimization is that the block rate is reduced. With optimized headers, your scrapers’ interactions with websites are similar to those of a typical user. As a result, you can avoid some bot detection systems, lowering the likelihood of your scraper being blocked over time (block rate).

Another benefit of optimizing your HTTP headers is an increased success rate as reduced block rates make it easier to scrape data.

In addition, focusing on optimizing your HTTP headers improves the efficiency of your scraping operation. This ensures you receive relevant data that meets your needs.

Header Optimization Tips

While properly configuring your header is important for ensuring the success of your web scraping projects, it’s not the end of the story—especially when you need to scrape data on a large scale. Following are some tips for increasing your scraper’s success rate:

Rotate Headers

Going beyond defining headers like the User-Agent to mimic normal traffic from a user, you can define several distinct HTTP headers and rotate between them per request. This allows you to simulate multiple users accessing the web server and distribute the traffic generated between them. This further reduces your chances of being identified as a bot and blocked.

Depending on the scale of your scraping operation, you can use anywhere from ten to hundreds of User-Agents. The more requests you need to send within a short time, the more reasonable it is for you to switch between more User-Agents.

Keep Headers Updated

Another consideration when it comes to optimizing HTTP headers is the regular maintenance of your headers. Users typically update their browsers as new versions are released, so there is a good chance that the valid headers at any given time will correspond to those of the most recent browser version. If you set up your header with one or more User-Agents that refer to outdated browser or software versions, web servers will be able to distinguish you from the crowd of regular users and potentially block your request. The same applies to other headers that require frequent updates.

Avoid Bad Header Configurations

You should also strive to avoid bad header configurations. This can happen when a header, such as User-Agent, does not match all the other standard headers that you’ve set. For example, having the User-Agent set to a Mozilla Firefox browser running on Windows while the remaining headers are defined for a Chromium browser running on Windows is likely to result in your scraper being blocked.

Additionally, when you use a proxy server, which acts as a middleman between the client and the server, you may unintentionally add headers that allow a detection system on the browser’s end to identify your requests as automated. To check your headers, send test requests and ensure that your proxy server does not add identifying headers.

Conclusion

In this article, you learned about HTTP headers, including the User-Agent, Referer, Accept-Language, and Cookie headers, which are some of the most important headers for web scraping. You must optimize your HTTP headers to ensure the longevity and usefulness of your scraping operation.

Properly using HTTP headers to make requests in your web scraping projects reduces your block rate and increases your success rate by making it easier to get past anti-scraping mechanisms. It also makes your scraping operation more efficient. However, advanced anti-scraping mechanisms that involve JavaScript challenges and CAPTCHAs can still be a hindrance. Bright Data simplifies your scraping operations by providing you with an award-winning and user-friendly proxy network, an advanced scraping browser, a comprehensive Web Scraping API, and a Web Unlocker. Whether you’re a beginner or an expert, these product offerings can help you achieve your scraping goals. Start a free trial and explore the offerings of Bright Data today.

Start free trial

Start free with Google

Fortune Adekogbe

View all articles