Robots.txt for Web Scraping Guide

In this guide, you will learn about robots.txt, why it’s important for web scraping, and how to use it in the scraping process.
9 min read
Robots.txt for web scraping

This robots.txt web scraping guide will cover:

  • What is robots.txt?
  • Why is it important for web scraping?
  • Consequences of ignoring it when scraping a site
  • Common directives you need to know for web scraping
  • How to use it in a web scraping process

What Is robots.txt?

robots.txt is a text file used to implement the Robots Exclusion Protocol (REP), a standard for instructing web robots on how to interact with a site. In detail, robots.txt specifies which bots are allowed to visit the site, what pages and resources they can access, at what rate, and more. These bots are usually web crawlers used by search engines such as Google, Bing, and DuckDuckGo to index the Web.

According to the Google specification, each domain (or subdomain) can have a robots.txt file. This is optional and must be placed in the root directory of the domain. In other words, if the base URL of a site is https://example.com, then the robots.txt file will be available at https://example.com/robots.txt.

For example, here is what Bright Data’s robots.txt looks like:

User-agent: *

Disallow: /lum/

Disallow: /www/*.html

Disallow: /use-cases/fintech

Disallow: /products/datasets2/

Disallow: /events/*

Disallow: /wp-stage/*

Disallow: /www/*

Disallow: /svc/*

Host: brightdata.com

Sitemap: https://brightdata.com/sitemap_index.xml

As you can see, this is just a text file containing a set of rules and directives for web bots.

Keep in mind that directives like User-agent and Disallow are not case-sensitive. On the contrary, values are case-sensitive. So /lum/ is not the same as /Lum/. 

Why Is It Important for Web Scraping?

The bots the robots.txt file provides instructions to are not only search engine crawling robots. This also considers all automated software that interacts with a site, including web scrapers. At the end of the day, scraping programs are nothing more than automated bots. Plus, they typically perform web crawling like search engine bots. Learn more in our comparison guide on web crawling vs. web scraping.

When scraping a site, it is then crucial to respect the target site’s robots.txt file. Doing so would imply:

  • Legal compliance: Web scrapers should abide by the rules defined by site owners, for an ethical approach to web scraping.
  • Reduced server load: Crawlers and scrapers are resource-intensive, and respecting the directives helps to prevent overloading a site.
  • Avoiding triggering anti-bot measures: Many sites monitor incoming web traffic to block unauthorized bots that do not comply with the declared rules.

Now the question is, what happens if your scraping script does not respect robots.txt? Let’s find that out in the section below!

Consequences of Ignoring robots.txt When Scraping a Site

Sites react very differently to violations of their robots.txt file. Sometimes, nothing happens. Other times, you may face serious consequences. In general, here are the scenarios you need to take into consideration when ignoring robots.txt in web scraping:

  • Blocks and disrupted operations: Anti-scraping and anti-bot technologies are likely to temporarily or permanently ban your IP. This compromises the efficiency of a scraping process.
  • Legal actions: If you do not comply, legal action may follow. This is possible if the fingerprint left by the scraping script can reveal your identity. Protect your privacy with a web scraping proxy
  • Increased scrutiny: Web security professionals, ISPs, and cybersecurity organizations may start investigating your scraping activities.

These are only some examples, but they are enough to understand the relevance of the matter. To adhere to robots.txt, you must first understand the meaning of its instructions. Time to dig into that!

Common robots.txt Directives You Need to Know for Web Scraping

The REP specification involves only a few directives a robots.txt file can specify. Over time, search engine specifications have introduced other possible rules. These represent a de facto standard and must be taken into account as well.

Now, take a look at the most relevant robots.txt web scraping directives.

User-agent

User-Agent specifies which user agents are allowed to crawl the site. A user-agent can be a web crawler, spider, scraper, or any bot. Usually, website admins use this directive to limit or instruct crawlers from specific search engines.

The syntax of the directive is:

User-agent: <user_agent_name>

Disallow: [value]

If <user_agent_name> is *, it means that any bot is allowed to interact with the site. The Disallow instruction must contain relative paths and restrict access to those specific parts of the site. 

Popular user agent strings are:

Search EngineUser-agent name
Baidubaiduspider
Bingbingbot
GoogleGooglebot
Yahoo!slurp
Yandexyandex

Consider the example below:

User-agent: *

Disallow: /private/

The above robots.txt file disallows all user agents from visiting pages under the /private/ path. Thus, /private/admin-login.php is disallowed, but also /private/platform/dashboard.php. This implies that files within subfolders are affected by the Disallow rule too.

Note that the same User-agent can have more than Disallow rule:

User-agent: *

Disallow: /private/

Disallow: /admin/

This time, both /private/ and /admin/ paths are disallowed.

When Disallow has no value, then all pages are allowed for access:

User-agent: *

Disallow:

If it instead contains the / value, it means that every page is disallowed:

User-agent: *

Disallow: /

The official standard for robots.txt standard does not mention regular expressions or wildcards for Disallow, but all major search engines expect them. So, it is pretty common to see something like this:

Disallow: /*.php 

Disallow: /resources/*.pdf 

This prevents your bots from accessing PDF files under /resources/ and PHP files. 

A non-standard opposite instruction to Disallow is Allow, which specifies the only paths allowed. Take a look at the following example:

User-agent: *

Allow: /

That corresponds to:

User-agent: *

Disallow:

Allow can overwrite specific routes blocked by Disallow:

User-agent: *

Disallow: /private/

Allow: /private/terms-and-conditions.php

In this robots.txt example, all pages under /private/ are disallowed except for /private/terms-and-conditions.php.

Keep in mind that the same robots.txt file can have multiple User-agent directives to target different web robots:

User-agent: Googlebot

Disallow: 

User-agent: bingbot

Disallow: /blog/

Sitemap

Sitemap is a non-standard directive that contains the location of a website’s XML sitemap:

Sitemap: https://www.example.com/sitemap.xml

This rule informs bots about the location of the XML sitemap, which provides useful information about the site’s structure. Following the URLs contained in a sitemap makes it easier to scrape an entire site. Explore our sitemap scraper!

Note that the URL pointing to the sitemap file must be absolute.

As a site can have multiple sitemaps, robots.txt can include many Sitemap directives:

Sitemap: https://www.example.com/sitemaps/page-sitemap.xml

Sitemap: https://www.example.com/sitemaps/post-sitemap.xml

Sitemap: https://www.example.com/sitemaps/author-sitemap.xml

Crawl-delay

The unofficial and non-popular Crawl-Delay directive defines how many seconds web crawlers should wait between successive requests to the site:

User-agent: *

Crawl-delay: 5

It is a User-agent-specific directive goal whose goal is to prevent overloading servers. In this example, all user agents are instructed to wait for a delay of 5 seconds between page visits.

Request-rate

The rare, User-agent-specific, and non-standard Request-Rate directive specifies the maximum number of requests a user agent can make to the site within a specified time frame:

User-agent: *

Request-rate: 1/10

For example, this rule instructs all user agents to limit their requests to one every 10 seconds.

The format Request-rate values follow is:

<number_of_request>/<seconds>

This directive is similar to Crawl-Delay, in that both help to avoid server overload. The main difference is that Crawl-delay achieves that by imposing a delay while Request-rate by enforcing rate limiting restrictions. 

How To Use robots.txt in a Web Scraping Process

You now know what robots.txt is and how it works. It only remains to see how to integrate its use into a web scraping process. Here’s what you need to do to respect the robots.txt file for web scraping:

  1. Reach the robots.txt file of the target site:
    • Send an HTTP GET request to the /robots.txt path to download the file or open it in the browser.
  2. Examine its content:
    • Read the directives contained in the robots.txt file.
    • Check if there are Disallow rules that restrict access to specific URLs or directories.
    • Look for Allow rules that may grant access to certain areas within disallowed paths.
    • Examine the Crawl-delay and Request-rate directives, if specified.
  3. Build your scraping script:
    • Create or modify your scraper, making sure it complies with the rules set in robots.txt.
    • Avoid accessing URLs that are disallowed for your user agent.
    • Implement throttling mechanisms in your scraper to respect the Crawl-delay or Request-rate limits.

As you can see, you must analyze the directives contained in robots.txt before building your scraper. Only this way you can avoid the consequences mentioned earlier.

Et voilà! You are now a robots.txt web scraping expert!

Conclusion

In this article, you saw what robots.txt is, why sites use it, and how it can help your scraper avoid getting blocked. In detail, you analyzed its directives that can affect your online data retrieval goals. You also learned how to structure an ethical scraping process around it.

Unfortunately, no matter how robots.txt compliant your script is, anti-scraping solutions can still block you. How to avoid that? With a proxy server. There are several providers available online, and trying them all would take months. Fortunately, we have sorted out that problem for you.

Bright Data controls the best proxy servers, serving over 20,000 customers and Fortune 500 companies. Its outstanding worldwide proxy network includes:

Overall, that is one of the largest and most reliable scraping-oriented proxy infrastructures on the market. Talk to one of our sales reps and see which of Bright Data’s products best suits your needs.

You might also be interested in

Best scraping proxies
Proxy 101

Best Web Scraping Proxies: A Complete Guide

What are the best proxies for scraping? Learn about the different proxy types and see which proxy type best suits your web scraping needs.
8 min read
Best proxy provider
Proxy 101

How To Choose the Best Proxy Provider

What should you look for when looking for the best proxy provider? Read this guide to gain insights when comparing proxy providers,
9 min read
Guide to using proxy with python request
Proxy 101

Guide to Using a Proxy with Python Requests

Guide using proxy with Python request for web scraping and why this can be helpful when working on a web scraping project.
12 min read
Using proxies with cURL featured image
Proxy 101

How to Use cURL With Proxy Guide

Use this detailed guide complete with code snippets to help jump start your cURL with proxies journey.
12 min read
Proxy 101

ISP Proxies vs. Residential Proxies: Main Differences

For those of you wondering: ‘Which proxy network is better for an increased number of concurrent requests?’, ‘How do network costs compare?’, ‘What unique advantages does each network type have?’, this complete guide was made expressly for you
6 min read
How Tos

Guide on How to Bypass an IP Ban in 2023

Learn how to change your MAC, use a VPN to change your IP address, clear your computer’s cache, as well as discovering tools and tips on how to use proxy solutions
5 min read
Rotating proxies using a Backconnect proxy connector
Bright Data in practice

Bright Data’s BackConnect Proxies

What is a backconnect proxy? Find out what they are and how to use them in this post
4 min read

More from Bright Data

Datasets Icon

Get immediately structured data

Access reliable public web data for any use case. The datasets can be downloaded or delivered in a variety of formats. Subscribe to get fresh records of your preferred dataset based on a pre-defined schedule.

Web scraper IDE Icon

Build reliable web scrapers. Fast.

Build scrapers in a cloud environment with code templates and functions that speed up the development. This solution is based on Bright Data's Web Unlocker and proxy infrastructure making it easy to scale and never get blocked.

Web Unlocker Icon

Implement an automated unlocking solution

Boost the unblocking process with fingerprint management, CAPTCHA-solving, and IP rotation. Any scraper, written in any language, can integrate it via a regular proxy interface.

Ready to get started?