Using curl_cffi for Web Scraping in Python

Discover how curl_cffi empowers stealthy and efficient Python web scraping by mimicking real browser TLS fingerprints.
11 min read
web scraping with curl-cffi blog image

In this guide, you will learn:

  • What curl_cffi is and the features it offers
  • How it minimizes TLS fingerprint-based bot detection
  • How to use it with Python for web scraping
  • Advanced usage and methods
  • A comparison with similar HTTP clients

Let’s dive in!

What Is curl_cffi?

curl_cffi is a library that provides Python bindings for the curl-impersonate fork via CFFI. In other words, it is an HTTP client capable of impersonating browser TLS/JA3/HTTP2 fingerprints. This makes the library an excellent solution for bypassing anti-bot blocks based on TLS fingerprinting.

⚙️ Features

  • Supports JA3/TLS and HTTP2 fingerprint impersonation, including recent browsers and custom fingerprints
  • Much faster than requests and httpx, on par with aiohttp
  • Mimics the requests API
  • Supports asyncio for asynchronous HTTP requests
  • Support for with proxy rotation on each request
  • Supports HTTP/2.0
  • Supports WebSockets

How It Works

curl_cffi is built on cURL Impersonate, a library that generates TLS fingerprints matching real-world browsers.

When you send an HTTPS request, a TLS handshake occurs, producing a unique TLS fingerprint. Since HTTP clients differ from browsers, their fingerprints can expose automation, triggering anti-bot defenses.

cURL Impersonate modifies cURL to match real browser TLS fingerprints:

  • TLS library tweaks: Rely on the libraries for TLS connection used by browsers instead of that of cURL.
  • Configuration changes: Adjust TLS extensions and SSL options to mimic browsers.
  • HTTP/2 customization: Match browser handshake settings.
  • Non-default cURL flags: Set --ciphers--curves, and custom headers for accuracy.

This makes the requests appear browser-like, helping bypass bot detection. For more information, refer to our guide on cURL Impersonate.

How to Use curl_cffi for Web Scraping: Step-By-Step Guide

Suppose your goal is to scrape the “Keyboard” page from Walmart:

The Walmart “Keyboard” product page

If you try to access this page using any HTTP client, you will receive the following error page:

Note the response from the server

Do not be misled by the 200 OK response status. The page returned by Walmart’s server is actually a bot detection page. It specifically asks you to verify whether you are human with a CAPTCHA challenge.

You might wonder, how is this possible even if youe set the User-Agent to simulate a real browser? The answer is TLS fingerprinting!

Now, let’s see how to use curl_cffi to avoid anti-bot measures and perform web scraping with ease.

Step #1: Project Setup

First, make sure that you have Python 3+ installed on your machine. Otherwise, download it from the official site and follow the installation instructions.

Then, create a directory for your curl_cffi scraping project using this command:

mkdir curl-cfii-scraper

Navigate into that directory and set up a virtual environment inside it:

cd curl-cfii-scraper
python -m venv env

Open the project folder in your preferred Python IDE. Visual Studio Code with the Python extension or PyCharm Community Edition are both valid choices.

Now, create a scraper.py file inside the project folder. It will be empty at first, but you will soon add the scraping logic to it.

In your IDE’s terminal, activate the virtual environment. On Linux or macOS, use:

./env/bin/activate

Equivalently, on Windows, launch:

env/Scripts/activate

Amazing! You are all set up and ready to go.

Step #2: Install curl_cffi

In an activated virtual environment, install the HTTP client via the curl-cffi pip package:

pip install curl-cffi

Behind the scenes, this library automatically downloads the curl impersonation binaries for Windows, macOS, and Linux.

Step #3: Connect to the Target Page

Import requests from curl_cffi:

from curl_cffi import requests

This object exposes a high-level API that is similar to that of the Python Requests library.

You can use it to perform a GET HTTP request to the target page as follows:

response = requests.get("https://www.walmart.com/search?q=keyboard", impersonate="chrome")

The impersonate="chrome" argument tells curl_cffi to make the HTTP request look like it is coming from the latest version of Chrome. As a result, Walmart will treat the automated request as a regular browser request, returning the standard web page instead of an anti-bot page.

You can access the HTML content of the target page with:

html = response.text

If you print html, you will see:

<!DOCTYPE html>
<html lang="en-US">
   <head>
      <meta charSet="utf-8"/>
      <meta property="fb:app_id" content="105223049547814"/>
      <meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1, interactive-widget=resizes-content"/>
      <link rel="dns-prefetch" href="https://tap.walmart.com "/>
      <link rel="preload" fetchpriority="high" crossorigin="anonymous" href="https://i5.walmartimages.com/dfw/63fd9f59-a78c/fcfae9b6-2f69-4f89-beed-f0eeb4237946/v1/BogleWeb_subset-Bold.woff2" as="font" type="font/woff2"/>
      <link rel="preload" fetchpriority="high" crossorigin="anonymous" href="https://i5.walmartimages.com/dfw/63fd9f59-a78c/fcfae9b6-2f69-4f89-beed-f0eeb4237946/v1/BogleWeb_subset-Regular.woff2" as="font" type="font/woff2"/>
      <link rel="preconnect" href="https://beacon.walmart.com"/>
      <link rel="preconnect" href="https://b.wal.co"/>
      <title>Electronics - Walmart.com</title>
      <!-- omitted for brevity ... -->

Great! That is the HTML of the regular Walmart “keyboard” product page.

Step #4: Add the Data Scraping Logic

curl_cffi is just an HTTP client that helps you retrieve the HTML of a page. If you want to perform web scraping, you will also need a library for HTML parsing like BeautifulSoup. For more guidance, refer to our guide on BeautifulSoup web scraping.

In the activated virtual environment, install BeautifulSoup:

pip install beautifulsoup4

Import it in scraper.py:

from bs4 import BeautifulSoup

Then, use it to parse the HTML of the page:

soup = BeautifulSoup(response.text, "html.parser")

"html.parser" is the default HTML parser from Python’s standard library used by BeautifulSoup for parsing the HTML string. Now, soup contains all the methods you need to select HTML elements on the page and extract data from them.

In this example, as data parsing is not what matters most, we will scrape only the page title. You can select it through a CSS selector using the find() method and then access its text with the text attribute:

title_element = soup.find("title")
title = title_element.text

For more advanced scraping logic, refer to our guide on how to scrape Walmart.

Finally, print the page title:

print(title)

Awesome! You implemented basic web scraping logic.

Step #5: Put It All Together

This is your final curl_cffi web scraping script:

from curl_cffi import requests
from bs4 import BeautifulSoup

# Send a GET request to the Walmart search page for "keyboard"
response = requests.get("https://www.walmart.com/search?q=keyboard", impersonate="chrome")

# Extract the HTML from the page
html = response.text

# Parse the response content with BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")

# Find the title tag using a CSS selector and print it
title_element = soup.find("title")
# Extract data from it
title = title_element.text

# More complex scraping logic...

# Print the scraped data
print(title)

Launch it with the following command:

python3 scraper.py

Or, equivalently, on Windows:

python scraper.py

The result will be:

Electronics - Walmart.com

If you remove the impersonate="chrome" argument, you will get instead:

Robot or human?

This demonstrates how browser impersonation makes all the difference when it comes to avoiding anti-scraping measures.

Mission complete!

curl_cffi : Advanced Usage

Now that you know how the library works, you are ready to explore some more advanced scenarios.

Browser Impersonation Selection

curl_cffi supports impersonating several browsers. Each browser is associated with a unique label that you can pass to the impersonate argument as below:

response = requests.get("<YOUR_URL>", impersonate="<BROWSER_LABEL>")

Here are the labels for the supported browsers:

  • chrome99chrome100chrome101chrome104chrome107chrome110chrome116chrome119chrome120chrome123chrome124chrome131
  • chrome99_androidchrome131_android
  • edge99edge101
  • safari15_3safari15_5safari17_0safari17_2_iossafari18_0safari18_0_ios

Notes:

  1. To always impersonate the latest browser versions, you can simply use chromesafari and safari_ios.
  2. Firefox is currently not available, as only WebKit-based browsers are supported.
  3. Browser versions are added only when their fingerprints change. If a version, such as chrome122, is skipped, you can still impersonate it by using the headers of the previous version.
  4. For non-browser targets, use ja3akamai, and similar arguments to specify your own custom TLS fingerprints. For details, refer to the documentation on impersonation.

Session Management

Just like the requests library, curl-cfii supports sessions. Session objects allow you to persist certain parameters across multiple requests, such as cookies, headers, or other session-specific data.

This is how you can define a session using the Python bindings for the cURL Impersonate library:

# Create a new session
session = requests.Session()

# This endpoint sets a cookie on the server
session.get("https://httpbin.io/cookies/set/userId/5", impersonate="chrome")

# Print the session's cookies to confirm they are being stored
print(session.cookies)

The output of the above script will be:

<Cookies[<Cookie userId=5 for httpbin.org />]>

The result proves that the session is maintaining state across requests, such as storing cookies defined by the server.

Proxy Integration

Just like the requests library, curl_cffi supports proxy integration through a proxies object:

# Define your proxy URL
proxy = "YOUR_PROXY_URL"

# Create a dictionary of proxies for HTTP and HTTPS
proxies = {"http": proxy, "https": proxy}

# Make a request using a proxy and browser impersonation
response = requests.get("<YOUR_URL>", impersonate="chrome", proxies=proxies)

Since the underlying API are very similar to requests, refer to our guide on how to use a proxy in Requests.

Async API

curl_cffi supports asynchronous requests through asyncio via the AsyncSession object:

from curl_cffi.requests import AsyncSession
import asyncio

# Define an async function to execute the asynchronous code
async def fetch_data():
    async with AsyncSession() as session:
        # Perform the asynchronous GET request
        response = await session.get("https://httpbin.org/anything", impersonate="chrome")
        # Print the response text
        print(response.text)

# Run the async function
asyncio.run(fetch_data())

Using AsyncSession makes it easier to handle multiple asynchronous requests efficiently, which is vital for speeding up web scraping.

WebScokets Connection

curl_cffi also supports WebSockets through the WebSocket class:

from curl_cffi.requests import WebSocket


# Define a callback function to handle incoming messages
def on_message(ws, message):
    print(message)

# Initialize the WebSocket connection with the callback
ws = WebSocket(on_message=on_message)

# Connect to a sample WebSocket server and listen for messages
ws.run_forever("wss://api.gemini.com/v1/marketdata/BTCUSD")

This is particularly useful for scraping real-time data from sites or APIs that use WebSocket to populate data dynamically. Some examples are sites with financial market data, live sports scores, or live chats.

Instead of scraping rendered pages, you can directly target the WebSocket channel for efficient data retrieval.

Note: You can use WebSockets asynchronously thanks to the AsyncWebSocket class.

curl_cffi vs Requests vs AIOHTTP vs HTTPX for Web Scraping

Below is a summary table to compare curl_cffi with other popular Python HTTP clients for web scraping:

Feature curl_cffi Requests AIOHTTP HTTPX
Sync API ✔️ ✔️ ✔️
Async API ✔️ ✔️ ✔️
Support for **WebSocket**s ✔️ ✔️
Connection pooling ✔️ ✔️ ✔️ ✔️
Support for HTTP/2 ✔️ ✔️
**User-Agent** customization ✔️ ✔️ ✔️ ✔️
TLS fingerprint spoofing ✔️
Speed High Medium High Medium
Retry mechanism Available via HTTPAdapters Available only via a third-party library Available via built-in Transports
Proxy integration ✔️ ✔️ ✔️ ✔️
Cookie handling ✔️ ✔️ ✔️ ✔️

curl_cffi Alternatives for Web Scraping

curl_cffi involves a manual approach to web scraping, where you need to write most of the code yourself. While suitable for simple static websites, that is prone to challenges from when targeting dynamic or more secure sites.

Bright Data provides a range of curl_cffi alternatives for web scraping:

  • Scraping Browser API: Fully managed cloud browser instances integrated with Puppeteer, Selenium, and Playwright. These browsers offer built-in CAPTCHA solving and automated proxy rotation, bypassing anti-bot defenses while interacting with websites like real users.
  • Web Scraper APIs: Pre-configured endpoints for retrieving fresh, structured data from over 100 popular domains. These APIs are ethical and compliant, allowing easy data extraction using HTTPX or any other HTTP client.
  • No-Code Scraper: An intuitive, on-demand data collection service that eliminates coding. It offers control, scalability, and flexibility without dealing with infrastructure, proxies, or anti-scraping hurdles.
  • Datasets: Access pre-built datasets from various websites or customize data collections to fit your requirements.

These solutions simplify scraping by offering robust, scalable, and compliant data extraction tools that reduce manual effort.

Conclusion

In this article, you discovered how to use the curl_cffi library for web scraping. You explored its purpose, key features, and advantages. This HTTP client excels as a fast and dependable option for making requests that mimic real browsers.

However, automated HTTP requests can expose your public IP address, potentially revealing your identity and location, which poses a privacy risk. To protect your security and anonymity, one of the most effective solutions is to use a proxy server to hide your IP address.

Bright Data controls the best proxy servers in the world, serving Fortune 500 companies and more than 20,000 customers. Its offer includes a wide range of proxy types:

Create a free Bright Data account today to test our proxies and scraping solutions!

No credit card required