Crawler Index
A large sample of crawlers that are blocked by websites.
Update on 15-November-24
75.6%
of websites have atleast a partial disallow command.
75.6%
of websites have atleast a partial disallow command.
% of websites explicitly blocking user agent | % of websites blocking explicitly and with * command | Company | Purpose | User Agent |
---|---|---|---|---|
2.08% | 62.79% | Open AI | GPT | GPTBot |
1.46% | 62.62% | Common Crawl | Training Data | CCBot |
0.94% | 62.66% | Bard/Gemini/PaLM/Bison | Google-Extended | |
0.81% | 62.69% | OpenAI | Chat GPT | chatgpt-user |
0.80% | 66.85% | Amazon | Alexa | amazonbot |
0.53% | 64.42% | Meta AI | LIaMA | FacebookBot |
0.44% | 66.85% | Brandwatch | Magpie Crawler | magpie-crawler |
0.56% | 66.83% | ByteDance | ByteDance LLM N/A | Bytespider |
0.48% | 62.64% | Anthropic | Claude | Anthropic-AI |
0.48% | 66.82% | Anthropic | Claude | claudebot |
0.35% | 64.54% | Anthropic | Claude | claude-web |
0.35% | 64.57% | Perplexity | Chatbot | perplexitybot |
0.26% | 64.53% | Cohere | Cohere Command | Cohere-AI |
0.30% | 64.60% | Apple | Apple's foundational models | Applebot-Extended |
0.25% | 66.69% | Apple | Siri | Applebot |
0.21% | 66.81% | Diffbot | training data | diffbot |
0.19% | 66.80% | Meta | All Meta AI | meta-externalagent |
0.12% | 66.80% | OpenAI | SearchGPT | oai-searchbot |
0.08% | 66.80% | Timpi | Wilson AI | timpibot |
0.07% | 66.81% | webz.io | webzio-extended | webzio-extended |
0.04% | 66.79% | Bard/Gemini/PaLM/Bison | googleother | |
0.01% | 66.85% | Perplexity | perplexity-ai | perplexity-ai |
0.01% | 66.81% | Meta | All Meta AI | meta-externalfetcher |
% of websites explicitly blocking user agent | % of websites blocking explicitly and with * command | Company | Purpose | User Agent |
---|---|---|---|---|
24.09% | 87.37% | Open AI | Chat GPT | gptbot |
16.80% | 87.11% | Common Crawl | Training Data | ccbot |
12.63% | 86.85% | Bard/Gemini/PaLM/Bison | google-extended | |
11.20% | 76.04% | Open AI | Chat GPT | chatgpt-user |
9.11% | 86.59% | Anthropic | Claude | anthropic-ai |
8.46% | 86.33% | Anthropic | Claude | claudebot |
6.64% | 86.33% | Anthropic | Claude | claude-web |
6.77% | 86.07% | Meta | LIaMA | facebookbot |
6.51% | 86.46% | ByteDance | ByteDance LLM N/A | bytespider |
6.38% | 86.33% | Perplexity | Chatbot | perplexitybot |
5.86% | 86.59% | Cohere | Cohere Command | cohere-ai |
5.73% | 86.85% | Apple.com | Apple's foundational models | applebot-extended |
4.04% | 86.98% | Brand Watch | Magpie Crawler | magpie-crawler |
3.26% | 86.72% | Amazon | Alexa | amazonbot |
2.73% | 86.85% | Apple | Siri | applebot |
0.78% | 87.11% | Bard/Gemini/PaLM/Bison | googleother | |
0.52% | 86.98% | Webz | webzio-extended | webz-extended |
0.91% | 86.85% | Timpi | Wilson AI | timpibot |
1.17% | 87.11% | Perplexity | perplexity-ai | perplexity-ai |
1.30% | 87.11% | Meta | All Meta AI | meta-externalfetcher |
1.95% | 86.72% | Open AI | Search GPT | OAI-searchbot |
3.39% | 87.11% | Meta | All Meta AI | meta-externalagent |
Bright Data scrapes the world’s most sought-after public web data on billions of top websites. Through our compliance product, Bright Shield, we collect allow and disallow commands for user agents in robot.txt from the websites we scrape. Our current sample size of websites is 1,509,084 and we have collected about 1,700 unique user agents.
Our research team has identified the percentage of time each user agent of interest is explicitly blocked within our sample and each user agent that is blocked with the (*) command. We also track the overall percentage of websites that disallow all crawlers. Each user agent is identified to the best of our ability by company, use, and a link that includes additional information such as how to block it.
Comments on user agents? Email comments to [email protected]