This robots.txt web scraping guide will cover:
- What is robots.txt?
- Why is it important for web scraping?
- Consequences of ignoring it when scraping a site
- Common directives you need to know for web scraping
- How to use it in a web scraping process
What Is robots.txt?
robots.txt is a text file used to implement the Robots Exclusion Protocol (REP), a standard for instructing web robots on how to interact with a site. In detail, robots.txt specifies which bots are allowed to visit the site, what pages and resources they can access, at what rate, and more. These bots are usually web crawlers used by search engines such as Google, Bing, and DuckDuckGo to index the Web.
According to the Google specification, each domain (or subdomain) can have a robots.txt file. This is optional and must be placed in the root directory of the domain. In other words, if the base URL of a site is https://example.com, then the robots.txt file will be available at https://example.com/robots.txt.
For example, here is what Bright Data’s robots.txt looks like:
As you can see, this is just a text file containing a set of rules and directives for web bots.
Keep in mind that directives like User-agent and Disallow are not case-sensitive. On the contrary, values are case-sensitive. So /lum/ is not the same as /Lum/.
Why Is It Important for Web Scraping?
The bots the robots.txt file provides instructions to are not only search engine crawling robots. This also considers all automated software that interacts with a site, including web scrapers. At the end of the day, scraping programs are nothing more than automated bots. Plus, they typically perform web crawling like search engine bots. Learn more in our comparison guide on web crawling vs. web scraping.
When scraping a site, it is then crucial to respect the target site’s robots.txt file. Doing so would imply:
- Legal compliance: Web scrapers should abide by the rules defined by site owners, for an ethical approach to web scraping.
- Reduced server load: Crawlers and scrapers are resource-intensive, and respecting the directives helps to prevent overloading a site.
- Avoiding triggering anti-bot measures: Many sites monitor incoming web traffic to block unauthorized bots that do not comply with the declared rules.
Now the question is, what happens if your scraping script does not respect robots.txt? Let’s find that out in the section below!
Consequences of Ignoring robots.txt When Scraping a Site
Sites react very differently to violations of their robots.txt file. Sometimes, nothing happens. Other times, you may face serious consequences. In general, here are the scenarios you need to take into consideration when ignoring robots.txt in web scraping:
- Blocks and disrupted operations: Anti-scraping and anti-bot technologies are likely to temporarily or permanently ban your IP. This compromises the efficiency of a scraping process.
- Legal actions: If you do not comply, legal action may follow. This is possible if the fingerprint left by the scraping script can reveal your identity. Protect your privacy with a web scraping proxy!
- Increased scrutiny: Web security professionals, ISPs, and cybersecurity organizations may start investigating your scraping activities.
These are only some examples, but they are enough to understand the relevance of the matter. To adhere to robots.txt, you must first understand the meaning of its instructions. Time to dig into that!
Common robots.txt Directives You Need to Know for Web Scraping
The REP specification involves only a few directives a robots.txt file can specify. Over time, search engine specifications have introduced other possible rules. These represent a de facto standard and must be taken into account as well.
Now, take a look at the most relevant robots.txt web scraping directives.
User-Agent specifies which user agents are allowed to crawl the site. A user-agent can be a web crawler, spider, scraper, or any bot. Usually, website admins use this directive to limit or instruct crawlers from specific search engines.
The syntax of the directive is:
If <user_agent_name> is *, it means that any bot is allowed to interact with the site. The Disallow instruction must contain relative paths and restrict access to those specific parts of the site.
Popular user agent strings are:
|Search Engine||User-agent name|
Consider the example below:
The above robots.txt file disallows all user agents from visiting pages under the /private/ path. Thus, /private/admin-login.php is disallowed, but also /private/platform/dashboard.php. This implies that files within subfolders are affected by the Disallow rule too.
Note that the same User-agent can have more than Disallow rule:
This time, both /private/ and /admin/ paths are disallowed.
When Disallow has no value, then all pages are allowed for access:
If it instead contains the / value, it means that every page is disallowed:
The official standard for robots.txt standard does not mention regular expressions or wildcards for Disallow, but all major search engines expect them. So, it is pretty common to see something like this:
This prevents your bots from accessing PDF files under /resources/ and PHP files.
A non-standard opposite instruction to Disallow is Allow, which specifies the only paths allowed. Take a look at the following example:
That corresponds to:
Allow can overwrite specific routes blocked by Disallow:
In this robots.txt example, all pages under /private/ are disallowed except for /private/terms-and-conditions.php.
Keep in mind that the same robots.txt file can have multiple User-agent directives to target different web robots:
Sitemap is a non-standard directive that contains the location of a website’s XML sitemap:
This rule informs bots about the location of the XML sitemap, which provides useful information about the site’s structure. Following the URLs contained in a sitemap makes it easier to scrape an entire site. Explore our sitemap scraper!
Note that the URL pointing to the sitemap file must be absolute.
As a site can have multiple sitemaps, robots.txt can include many Sitemap directives:
The unofficial and non-popular Crawl-Delay directive defines how many seconds web crawlers should wait between successive requests to the site:
It is a User-agent-specific directive goal whose goal is to prevent overloading servers. In this example, all user agents are instructed to wait for a delay of 5 seconds between page visits.
The rare, User-agent-specific, and non-standard Request-Rate directive specifies the maximum number of requests a user agent can make to the site within a specified time frame:
For example, this rule instructs all user agents to limit their requests to one every 10 seconds.
The format Request-rate values follow is:
This directive is similar to Crawl-Delay, in that both help to avoid server overload. The main difference is that Crawl-delay achieves that by imposing a delay while Request-rate by enforcing rate limiting restrictions.
How To Use robots.txt in a Web Scraping Process
You now know what robots.txt is and how it works. It only remains to see how to integrate its use into a web scraping process. Here’s what you need to do to respect the robots.txt file for web scraping:
- Reach the robots.txt file of the target site:
- Send an HTTP GET request to the /robots.txt path to download the file or open it in the browser.
- Examine its content:
- Read the directives contained in the robots.txt file.
- Check if there are Disallow rules that restrict access to specific URLs or directories.
- Look for Allow rules that may grant access to certain areas within disallowed paths.
- Examine the Crawl-delay and Request-rate directives, if specified.
- Build your scraping script:
- Create or modify your scraper, making sure it complies with the rules set in robots.txt.
- Avoid accessing URLs that are disallowed for your user agent.
- Implement throttling mechanisms in your scraper to respect the Crawl-delay or Request-rate limits.
As you can see, you must analyze the directives contained in robots.txt before building your scraper. Only this way you can avoid the consequences mentioned earlier.
Et voilà! You are now a robots.txt web scraping expert!
In this article, you saw what robots.txt is, why sites use it, and how it can help your scraper avoid getting blocked. In detail, you analyzed its directives that can affect your online data retrieval goals. You also learned how to structure an ethical scraping process around it.
Unfortunately, no matter how robots.txt compliant your script is, anti-scraping solutions can still block you. How to avoid that? With a proxy server. There are several providers available online, and trying them all would take months. Fortunately, we have sorted out that problem for you.
Bright Data controls the best proxy servers, serving over 20,000 customers and Fortune 500 companies. Its outstanding worldwide proxy network includes:
- Datacenter proxies – Over 770,000 datacenter IPs.
- Residential proxies – Over 72M residential IPs in more than 195 countries.
- ISP proxies – Over 700,000 ISP IPs.
- Mobile proxies – Over 7M mobile IPs.
Overall, that is one of the largest and most reliable scraping-oriented proxy infrastructures on the market. Talk to one of our sales reps and see which of Bright Data’s products best suits your needs.