Cloudflare serves a dual function of security and performance. Its global network speeds up content delivery and protects websites against bot attacks and malicious actors. However, if you’re trying to scrape a website that Cloudflare protects, you’re going to run into issues.
In this article, you’ll learn about some of the best methods to bypass Cloudflare’s protection for web scraping without getting blocked.
Understanding Cloudflare’s Mechanisms
To protect web applications against threats like distributed denial-of-service (DDoS) and zero-day attacks, Cloudflare offers a web application firewall (WAF) service that runs on the Cloudflare global network and sits in front of web applications to stop attacks in real time. Among its defense mechanisms, Cloudflare detects and blocks malicious bots using a proprietary algorithm that likely looks at the following traits:
- Transport Layer Security (TLS) fingerprints: When a request is sent to a server over HTTPS, the server and client generate a fingerprint called a JA3. The JA3 fingerprint can uniquely identify the clients and their capabilities and configurations. Cloudflare uses this fingerprint to determine whether the client is a real user or an automated bot.
- HTTP/2 fingerprints: HTTP/2 fingerprinting works similarly to TLS fingerprinting but uses the HTTP/2 parameters sent by the client to produce a fingerprint and then compares it against known bot fingerprints.
- HTTP details: Cloudflare looks at the HTTP request details, such as HTTP headers and cookies, to detect typical configurations used by bots.
- JavaScript fingerprints: Cloudflare makes use of JavaScript scripts to extract details about the client, such as the browser, OS, and hardware details. These details are then used to determine if the client is a bot or not.
- Behavior analysis: One of the main ways to detect a bot is via its behavior. If a client sends too many requests in a short amount of time, then it’s likely a bot. Cloudflare also looks at user behavior, such as mouse movements and idle times, and uses machine learning to identify whether the traffic is coming from a bot or not.
If Cloudflare suspects bot-like behavior, it produces a JavaScript challenge for the client to solve. This challenge is non-interactive and invisible to the user, and the verification takes place in the background. If the JavaScript challenge fails to produce a certain result, the user is shown a simple CAPTCHA like this:
Techniques to Bypass Cloudflare
Now that you know how Cloudflare works, let’s take a look at some methods that you can use to bypass Cloudflare. Because Cloudflare uses a complicated proprietary algorithm to detect bots, the following methods aren’t guaranteed to work. You need to experiment with different methods and figure out the best one for your use case.
Use Proxy Solutions
One of the most common ways Cloudflare detects bots is by looking at the number of requests sent from a specific IP address. If a particular IP address sends too many consecutive requests within a short period, it’s likely a web scraper.
To reduce the chances of Cloudflare blocking you in this scenario, you can use proxy servers with IP address rotation. With proxy rotation, you can switch proxies as soon as your current proxy is caught.
While there are no proxy servers dedicated to bypassing Cloudflare, most premium residential proxies can evade Cloudflare’s bot detection with varying degrees of success. However, keep in mind that this method doesn’t work if Cloudflare employs user-agent detection. No matter what proxy you’re using, the user agent can identify whether the requests are coming from a bot or not. If you experience this scenario, you should use user-agent spoofing, which we’ll talk about next.
Not sure which rotating proxy to choose? Read about our top 10 picks for this year.
Spoof HTTP Headers
HTTP headers provide critical information about the client making the request. Cloudflare often examines these headers to identify the origin of the traffic. A real browser often sends a complex collection of headers, whereas a scraper would employ only a handful of headers.
Thankfully, almost every tool that you can use to write scrapers provides a way to modify or add headers, which can help them appear more like a real browser. Following are some of the most common headers you can use:
User-Agent Header
The User-Agent
header is used to identify the browser and operating system. Cloudflare might block requests from unusual or known bot User-Agent strings. By spoofing this header to mimic a legitimate browser (eg Chrome, Firefox, Safari), a script can evade detection. For example, if you’re using the Python requests
library, you can set the User-Agent
header like this:
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
response = requests.get('http://httpbin.org/headers', headers=headers)
print(response.status_code)
print(response.text)
Referer Header
Cloudflare may check the Referer
header to determine if the request came from a trusted source. By setting this header to a legitimate URL, attackers can make their request appear as if it were coming from within a trusted context:
import requests
headers = {
'Referer': 'https://trusted-website.com'
}
response = requests.get('http://httpbin.org/headers', headers=headers)
print(response.status_code)
print(response.text)
Accept Headers
Accept
headers inform the server of the types of content the client can handle. Legitimate browsers often send a complex set of Accept
headers, which you can mimic to help avoid detection:
import requests
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'
}
response = requests.get('http://httpbin.org/headers', headers=headers)
print(response.status_code)
print(response.text)
Other than these commonly used headers, Cloudflare also detects bots by looking at whether there are any header mismatches or outdated headers. For example, if you use a Firefox user agent together with the Sec-CH-UA-Full-Version-List
, you might get blocked as Firefox doesn’t support this header.
Learn more about HTTP headers for web scraping here.
Implement CAPTCHA-Solving Services
Often, Cloudflare issues a CAPTCHA to a suspicious client if all the other methods of detection fail to produce a definite result. Cloudflare also provides Turnstile as a lightweight method of running challenges anywhere on the website without having to use Cloudflare CDN. Turnstile detects bots using non-interactive challenges issued to the browser but can resort to a simple interactive CAPTCHA (such as a checkbox), which can pose a challenge to web scrapers.
There are many services that offer CAPTCHA-solving services. These services usually send the CAPTCHAs to real humans who solve the CAPTCHAs and return the results.
Use a Fortified Headless Browser
To handle Cloudflare’s JavaScript challenges, your web scraper needs to mimic real browser behaviors, such as JavaScript execution and cookies.
To emulate a real browser, you can use a headless browser tool, like Selenium. Since Cloudflare checks for typical browsing behavior, such as scrolling, mouse movements, and clicks, automated tools like Selenium can simulate these behaviors to trick Cloudflare into believing that the request is coming from a legitimate user. A headless browser can also bypass Cloudflare’s canvas fingerprinting.
However, Selenium and tools like it were mainly created for automation testing and not for scraping. That means they intentionally expose certain traits that Cloudflare can use to identify a headless browser. For example, Selenium exposes the JavaScript attribute navigator.webdriver
.
To prevent this, there are a few plugins available that help fortify the headless browsers by patching these vulnerabilities. The undetected_chromedriver plugin for Selenium and the puppeteer-extra-plugin-stealth plugin for Playwright and Puppeteer are common examples of such plugins.
Following is a code snippet that shows how you can use the undetected_chromedriver plugin:
import undetected_chromedriver.v2 as uc
driver = uc.Chrome()
with driver:
driver.get('https://example.com')
To make the headless browser more resilient against Cloudflare, you can pair it with a high-quality proxy service like this:
chrome_options = uc.ChromeOptions()
proxy_options = {
'proxy': {
'http': 'HTTP_PROXY_URL',
'https': 'HTTPS_PROXY_URL'
}
}
driver = uc.Chrome(
options=chrome_options,
seleniumwire_options=proxy_options
)
Keep in mind that browsers are frequently updated, which introduces new weaknesses that can identify headless browsers. Cloudflare also frequently updates its algorithms to exploit new vulnerabilities. Thus, these plugins need to be frequently updated, and they may stop working if they aren’t maintained frequently.
Use Cloudflare Solvers
Using a dedicated Cloudflare solver service can often produce good results against basic Cloudflare protection (at least for a time). Many Cloudflare-solving tools employ various methods to bypass Cloudflare.
For example, cloudscraper is a Python module that uses a JavaScript engine to trick Cloudflare into thinking that the client supports JavaScript. However, it’s been over a year since it has been updated, and it might not work against the latest Cloudflare updates.
Advanced Techniques
As mentioned previously, Cloudflare uses a variety of methods to detect bots. More often than not, if used on their own, the methods described earlier will perform poorly against Cloudflare. For increased chances of evading Cloudflare, it’s recommended that you use a mix of different methods to emulate a real user as much as possible.
For example, you can use a fortified headless browser to evade headless browser detection. Then, you can also simulate mouse movement in a B-spline curve to mimic the mouse movements of a human. You can pair the whole setup with a residential proxy with rotation to combat IP bans and avoid suspicion. For extra security, you can throw in a tool like Hazetunnel that can mimic the fingerprint of a real browser based on the passed user agent. Together with a CAPTCHA solver service, this setup can help you bypass most of the Cloudflare detections.
Incorporate Bright Data Solutions
As you may have noticed, avoiding Cloudflare’s bot detection is a complicated task that isn’t guaranteed. Bypassing Cloudflare forces you to spend a lot of time tinkering with tools rather than focusing on writing the scraper.
Bright Data provides a collection of tools to help bypass Cloudflare easily. One of these tools is Web Unlocker, which is Bright Data’s solution to bypass various anti-bot detections, including Cloudflare. With a 99.99 percent success rate, Web Unlocker uses AI to automatically detect and solve website-blocking techniques in real time. It uses techniques like browser fingerprinting, CAPTCHA solving, IP rotations, and request retries to unlock Cloudflare-protected websites.
Web Unlocker automatically uses the best proxy services for your request and manages them for a seamless experience.
The developer experience with Web Unlocker is easy. When you sign up for Web Unlocker, you’ll get access to the proxy details and credentials. Simply use it like any other proxy server:
import requests
host = 'brd.superproxy.io'
port = 22225
username = 'brd-customer-<customer_id>-zone-<zone_name>'
password = '<zone_password>'
proxy_url = f'http://{username}:{password}@{host}:{port}'
proxies = {
'http': proxy_url,
'https': proxy_url
}
url = "http://lumtest.com/myip.json"
response = requests.get(url, proxies=proxies)
print(response.json())
Another tool to help you bypass Cloudflare is the Bright Data Scraping Browser, which is another of Bright Data’s proxy-unlocking solutions. The Scraping Browser runs your code on a remote browser hosted by Bright Data, and it utilizes multiple proxy networks and seamlessly handles unlocking Cloudflare-protected sites.
The Scraping Browser can be integrated with Puppeteer, Selenium, and Playwright, and it offers a full headless browser experience.
Conclusion
Cloudflare is a critical part of the internet with its WAF offering. On the one hand, it protects websites against legitimate threats, but on the other hand, it also blocks harmless web scrapers. In this article, you learned how Cloudflare’s bot detection works and how to bypass it.
Evading Cloudflare can be complicated, and it comes with varying degrees of success. Instead of duct-taping a solution, consider using Bright Data’s offerings, such as the Web Unlocker, Scraping Browser, and Web Scraper API. With only a few lines of code, you get a higher rate of success without needing to worry about managing complex solutions.
Sign up today for a free trial!
No credit card required