cloudscraper in Python Step-By-Step Guide

Learn how to use cloudscraper in Python to bypass Cloudflare’s protection, handle errors, and explore alternative solutions for anti-bot systems.
12 min read
Cloudscraper in Python blog image

In this tutorial, you’ll learn how to use the cloudscraper Python library to bypass Cloudflare’s bot detection, handle common errors, and explore alternative scraping solutions for the most robust anti-bot protections.

How to Use cloudscraper in Python

In this tutorial, you’ll attempt to scrape data from a website protected by Cloudflare with and without the cloudscraper. To do so, you’ll use the Beautiful Soup and Requests packages. If you’re not familiar with these packages, check out this Python web scraping guide to learn more.

To start, install the required packages by running the following pip command:

pip install tqdm==4.66.5 requests==2.32.3 beautifulsoup4==4.12.3

To make this tutorial easier to follow, the following web scraper has been created to scrape metadata from news articles published on a particular day on the ChannelsTV website:

import requests
from bs4 import BeautifulSoup
from datetime import datetime
from tqdm.auto import tqdm

def extract_article_data(article_source, headers):
    response = requests.get(article_source, headers=headers)
    if response.status_code != 200:
        return None

    soup = BeautifulSoup(response.content, 'html.parser')

    title = soup.find(class_="post-title display-3").text.strip()

    date = soup.find(class_="post-meta_time").text.strip()
    date_object = datetime.strptime(date, 'Updated %B %d, %Y').date()

    categories = [category.text.strip() for category in soup.find('nav', {"aria-label": "breadcrumb"}).find_all('li')]

    tags = [tag.text.strip() for tag in soup.find("div", class_="tags").find_all("a")]

    article_data = {
        'date': date_object,
        'title': title,
        'link': article_source,
        'tags': tags,
        'categories': categories
    }

    return article_data

def process_page(articles, headers):
    page_data = []
    for article in tqdm(articles):
        url = article.find('a', href=True).get('href')
        if "https://" not in url:
            continue
        article_data = extract_article_data(url, headers)
        if article_data:
            page_data.append(article_data)
    return page_data

def scrape_articles_per_day(base_url, headers):
    day_data = []
    page = 1

    while True:
        page_url = f"{base_url}/page/{page}"
        response = requests.get(page_url, headers=headers)

        if not response or response.status_code != 200:
            break

        soup = BeautifulSoup(response.content, 'html.parser')
        articles = soup.find_all('article')

        if not articles:
            break
        page_data = process_page(articles, headers)
        day_data.extend(page_data)

        page += 1

    return day_data

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36',
}

URL = "https://www.channelstv.com/2024/08/01/"

scraped_articles = scrape_articles_per_day(URL, headers)
print(f"{len(scraped_articles)} articles were scraped.")
print("Samples:")
print(scraped_articles[:2])

In this code, three functions are defined to facilitate the scraping process. The first one, called extract_article_data, loads data from an individual article and extracts metadata, like its publishing date, title, tags, and categories, into a Python dictionary, which is then returned. The loading and extraction steps are implemented using the Requests and Beautiful Soup Python libraries.

The second function, process_page, gets the link to all the articles on a particular page and calls the extract_article_data function to extract the data in them and stores each metadata dictionary in a list, which is then returned. The last function, named scrape_articles_per_day, uses a while loop to increment the page number and scrapes the article data on each page until you arrive at a page that does not exist.

Next, you define the URL to be scraped and specify the date to filter as August 1, 2024, and a headers variable containing a sample user agent. You call the scrape_articles_per_day function and pass the URL and headers variables. Then, you print the number of scraped articles and the first two results.

Ideally, this scraper would work, but it doesn’t because the ChannelsTV website uses Cloudflare to prevent you from accessing the content of its web pages via the direct requests implemented in the extract_article_data and scrape_articles_per_day functions.

When you try to run this script, your output looks like this:

0 articles were scraped.
Samples:
[]

Incorporate cloudscraper

Earlier, when you intended to scrape specific metadata from an article, nothing was returned, thanks to Cloudflare’s protection. In this section, you’ll install and use cloudscraper to bypass this issue.

To get started, begin by installing the cloudscraper library by running the following pip command:

pip install cloudscraper==1.2.71

Then, import the package and define a fetch_html_content function like this:

import cloudscraper

def fetch_html_content(url, headers):
    try:
        scraper = cloudscraper.create_scraper()
        response = scraper.get(url, headers=headers)

        if response.status_code == 200:
            return response
        else:
            print(f"Failed to fetch URL: {url}. Status code: {response.status_code}")
            return None
    except Exception as e:
        print(f"An error occurred while fetching URL: {url}. Error: {str(e)}")
        return None

This function takes the URL to be scraped and the request headers as input parameters, and the function returns either a response object or None. Within the function, you define a try-except block. In the try block, you create a scraper using the cloudscraper.create_scraper method. Next, you call the scraper.get method and pass in the url and headers variables. If the status code of your response is 200, you return the response. Otherwise, you print an error message and return None. Similarly, if an error occurs in the try block, the except block is triggered, wherein an appropriate message is printed and None is returned.

Following this, you replace every requests.get call in your script with this fetch_html_content function. Initially, do this replacement on your extract_article_data function like this:

def extract_article_data(article_source, headers):
    response = fetch_html_content(article_source, headers)

Then, replace the requests.get call in your scrape_articles_per_day function like this:

def scrape_articles_per_day(base_url, headers):
    day_data = []
    page = 1

    while True:
        page_url = f"{base_url}/page/{page}" 
        response = fetch_html_content(page_url, headers)

By defining this function, the cloudscraper library can help you evade Cloudflare’s restrictions.

When you run the code, your output looks like this:

Failed to fetch URL: https://www.channelstv.com/2024/08/01//page/5. Status code: 404
55 articles were scraped.
Samples:
[{'date': datetime.date(2024, 8, 1),
  'title': 'Resilience, Tear Gas, Looting, Curfew As #EndBadGovernance Protests Hold',
  'link': 'https://www.channelstv.com/2024/08/01/tear-gas-resilience-looting-curfew-as-endbadgovernance-protests-hold/',
  'tags': ['Eagle Square', 'Hunger', 'Looting', 'MKO Abiola Park', 'violence'],
  'categories': ['Headlines']},
 {'date': datetime.date(2024, 8, 1),
  'title': 'Mother Of Russian Artist Freed In Prisoner Swap Waiting To 'Hug' Her',
  'link': 'https://www.channelstv.com/2024/08/01/mother-of-russian-artist-freed-in-prisoner-swap-waiting-to-hug-her/',
  'tags': ['Prisoner Swap', 'Russia'],
  'categories': ['World News']}]

Additional cloudscraper Features

As you can see, cloudscraper can help you get past Cloudflare’s IUAM protection, but cloudscraper also has other features worth highlighting.

Using Proxies

Proxies serve as intermediary servers between your computer and target sites, enabling you to be more anonymous as you explore the internet. Your requests are routed through them so that target websites, like Cloudflare-protected sites, see the proxy server as the source of the traffic and not your device.

With cloudscraper, you can define proxies and pass them to your already created cloudscraper object like this:

scraper = cloudscraper.create_scraper()

proxy = {
    'http': 'http://your-proxy-ip:port',
    'https': 'https://your-proxy-ip:port'
}

response = scraper.get(URL, proxies=proxy)

Here, you define a scraper object with default values. Then, you define a proxy dictionary with http and https proxies. Finally, you pass the proxy dictionary object to the scraper.get method as you would with a regular request.get method.

Changing the User Agent and JavaScript Interpreter

While you directly specified a user agent in the previous script, the cloudscraper library can also autogenerate user agents. This reduces the manual configurations necessary during scripting and allows you to mimic real users with different browser identities. This is done randomly, but you can also select the kind of user agents it samples from by passing a browser parameter to the cloudscraper.create_scraper method. This browser parameter contains a dictionary that stores string values for the browser and platform and boolean values for desktop and mobile.

cloudscraper also lets you specify the JavaScript interpreter and engine you use with your scraper. The default is a native solver created by the cloudscraper team. Other available options are Node.js, Js2Py, ChakraCore, and v8eval.

Here’s a sample snippet showing the specifications of an interpreter and browser:

scraper = cloudscraper.create_scraper(
    interpreter="nodejs",
    browser={
        "browser": "chrome",
        "platform": "ios",
        "desktop": False,
    }
)

Here, you set the interpreter as "nodejs" and pass a dictionary to the browser parameter. Within this dictionary, the browser is set to Chrome and the platform is set to "ios". The desktop parameter is set to False, implying that the browser runs on mobile since the mobile and desktop values are set to True by default. In this case, Cloudflare selects mobile iOS user agents running on the Chrome browser.

Handling CAPTCHAs

CAPTCHAs are designed to tell humans and bots apart, and they can often prevent your target web page from loading when you’re scraping. One of the benefits of cloudscraper is that it supports some third-party CAPTCHA solvers, dedicated to reCAPTCHA, hCaptcha, and more. If you have other third-party CAPTCHA solvers you’re interested in, you can offer your suggestion to the cloudscraper team via GitHub support tickets.

The following snippet shows you how to modify your scraper to handle CAPTCHA:

scraper = cloudscraper.create_scraper(
  captcha={
    'provider': 'capsolver',
    'api_key': 'your_capsolver_api_key'
  }
)

In this code, you specify your CAPTCHA provider as Capsolver and your Capsolver API key. Both values are stored in a dictionary and passed to the CAPTCHA parameter in the cloudscraper.create_scraper method.

Common cloudscraper Errors

While cloudscraper is an easy way to work around Cloudflare restrictions, you may encounter a few errors as you begin to use it. Following are some of the most common errors (and solutions) you may run into.

module not found

The module not found error is a common error in Python that occurs when you try to import or use a library that does not exist in your Python environment.

When working in Python, you operate within an environment, and only the libraries installed in that active environment are accessible by your script or notebook. The module not found error implies that you have either not activated the relevant (virtual) environment or not installed the package in your environment.

To activate your virtual environment in Windows, run the following command:

.<venv-name>\Scripts\activate.bat

If you’re working with Linux or macOS, you can use the following command:

source <venv-name>/bin/activate

If the package is not installed at all, install it by running the following command:

pip install cloudscraper

cloudscraper can’t bypass the latest Cloudflare version

The cloudscraper can't bypass the latest Cloudflare version error occurs when you try to use a version of cloudscraper that’s designed to bypass an older version of Cloudflare. This is a problem because more recent Cloudflare versions may come with changes that restrict older versions of cloudscraper until the Python library is updated.

If you’re running an older version of cloudscraper, it’s best to upgrade your package with the following command:

pip install -U cloudscraper

In cases where you’re already using the most recent version of cloudscraper, you may need to wait for an update or find an alternative solution that works.

An Alternative to cloudscraper

If, after implementing what you’ve learned here, you’re still having trouble bypassing Cloudflare’s protection, you should consider using Bright Data.

Bright Data has one of the largest proxy networks, including data center, ISP, mobile, and residential proxies. With these proxies serving as the intermediary, you can avoid IP blocking, boost performance, get around geographical restrictions, and protect your privacy.

To bypass Cloudflare protection using Bright Data, all you have to do is create an account, configure it, and get your API credentials. Then, you can use those credentials to access the data at your target URL like this:

import requests

host = 'brd.superproxy.io'
port = 22225

username = 'brd-customer-<customer_id>-zone-<zone_name>'
password = '<zone_password>'

proxy_url = f'http://{username}:{password}@{host}:{port}'

proxies = {
    'http': proxy_url,
    'https': proxy_url
}

response = requests.get(URL, proxies=proxies)

Here, you make a GET request with the Python Requests library and pass in proxies via the proxies parameter. The proxies created use your Bright Data username, password, host, and port number. Your username, in particular, is defined based on your Bright Data customer ID and zone name—all of which can be retrieved from your account.

Conclusion

In this tutorial, you learned how to use the cloudscraper library in Python to scrape Cloudflare-protected websites. You also learned about some common errors you may encounter and how to circumvent them. While cloudscraper can be a great solution to circumvent Cloudflare’s IUAM, as with any free technology, it has its limits. That’s why you also learned how to use the impressive Bright Data proxy network and Web Unlocker to access Cloudflare-protected sites.

Bright Data provides you with automated tools that allow you to access data on the internet without any restrictions. You can also use its large proxy network to reduce the number of failed requests if automation is not your goal.

Ready to take your web scraping to the next level? Discover how our premium proxies and expert web data collection services can easily bypass even the toughest bot protections. Start with a free trial today!

No credit card required