Web Scraping With HTTPX in Python

Explore HTTPX, a powerful Python HTTP client for web scraping. Learn setup, features, advanced techniques, and how it compares with Requests.
13 min read
web scraping with httpx and python blog image

In this article, you will learn:

  • What HTTP is and the features it offers
  • How to use HTTPX for web scraping in a guided section
  • Advanced HTTPX features for web scraping
  • A comparison of HTTPX vs. Requests for automated requests

Let’s dive in!

What Is HTTPX?

HTTPX is a fully featured HTTP client for Python 3, built on top of the retryablehttp library. It is designed to ensure reliable results even with a high number of threads. HTTPX provides both synchronous and asynchronous APIs, with support for HTTP/1.1 and HTTP/2 protocols.

⚙️ Features

  • Simple and modular codebase, making it easy to contribute.
  • Fast and fully configurable flags for probing multiple elements.
  • Supports various HTTP-based probing methods.
  • Smart automatic fallback from HTTPS to HTTP by default.
  • Accepts hosts, URLs, and CIDR as input.
  • Supports proxies, custom HTTP headers, custom timeouts, basic authentication, and more.

👍 Pros

  • Available from the command line using httpx[cli].
  • Packed with features, including support for HTTP/2 and an asynchronous API.
  • This project is actively developed…

👎 Cons

  • …with frequent updates that may introduce breaking changes with new releases.
  • Less popular than the requests library.

Scraping with HTTPX: Step-By-Step Guide

HTTPX is an HTTP client, meaning it helps you retrieve the raw HTML content of a page. To then parse and extract data from the HTML, you will need an HTML parser like BeautifulSoup.

Actually, HTTPX is not just any HTTP client, but one of the best Python HTTP clients for web scraping.

Follow this tutorial to learn how to use HTTPX for web scraping with BeautifulSoup!

Warning: While HTTPX is only used in the early stages of the process, we will walk you through a complete workflow. If you are interested in more advanced HTTPX scraping techniques, you can skip ahead to the next chapter after Step 3.

Step #1: Project Setup

Make sure you have Python 3+ installed on your machine. Otherwise, download it from the official site and follow the installation instructions.

Now, use the following command to create a directory for your HTTPX scraping project:

mkdir httpx-scraper

Navigate into it and initialize a virtual environment inside it:

cd httpx-scraper
python -m venv env

Open the project folder in your preferred Python IDE. Visual Studio Code with the Python extension or PyCharm Community Edition will do.

Next, create a scraper.py file inside the project folder. Currently, scraper.py is an empty Python script but it will soon contain the scraping logic.

In your IDE’s terminal, activate the virtual environment. On Linux or macOS, run:

./env/bin/activate

Equivalently, for Windows, fire:

env/Scripts/activate

Amazing! You are now fully set up.

Step #2: Install the Scraping Libraries

In an activated virtual environment, install HTTPX and BeautifulSoup with the following command:

pip install httpx beautifulsoup4

This will add both the httpx and beautifulsoup4 to your project’s dependencies.

Import them into your scraper.py script:

import httpx
from bs4 import BeautifulSoup

Great! You are ready to move on to the next step in your scraping workflow.

Step #3: Retrieve the HTML of the Target Page

In this example, the target page will be the “Quotes to Scrape” site:

The Quotes To Scrape homepage

Use HTTPX to retrieve the HTML of the homepage with the get() method:

# Make an HTTP GET request to the target page
response = httpx.get("http://quotes.toscrape.com")

Behind the scenes, HTTPX will make an HTTP GET request to the server, which will respond with the HTML of the page. You can access the HTML content using the response.text attribute:

html = response.text
print(html)

This will print the raw HTML content of the page:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Quotes to Scrape</title>
    <link rel="stylesheet" href="/static/bootstrap.min.css">
    <link rel="stylesheet" href="/static/main.css">
</head>
<body>
    <!-- omitted for brevity... -->
</body>
</html>

Terrific! Time to parse this content and extract the data you need.

Step #4: Parse the HTML

Feed the HTML content to the BeautifulSoup constructor to parse it:

# Parse the HTML content using
BeautifulSoup soup = BeautifulSoup(html, "html.parser")

html.parser is the standard Python HTML parser that will be used to parse the content.

The soup variable now holds the parsed HTML and exposes the methods to extract the data you need.

HTTPX has done its job of retrieving the HTML, and now you are moving into the traditional data parsing phase with BeautifulSoup. For more information, refer to our tutorial on BeautifulSoup web scraping.

Step #5: Scrape Data From It

You can scrape quotes data from the page with the following lines of code:

# Where to store the scraped data
quotes = []

# Extract all quotes from the page
quote_elements = soup.find_all("div", class_="quote")

# Loop through quotes and extract text, author, and tags
for quote_element in quote_elements:
    text = quote_element.find("span", class_="text").get_text().get_text().replace("“", "").replace("”", "")
    author = quote_element.find("small", class_="author")
    tags = [tag.get_text() for tag in quote_element.find_all("a", class_="tag")]

    # Store the scraped data
    quotes.append({
        "text": text,
        "author": author,
        "tags": tags
    })

This snippet defines a list named quotes to store the scraped data. It then selects all quote HTML elements and iterates over them to extract the quote text, author, and tags. Each extracted quote is stored as a dictionary within the quotes list, organizing the data for further use or export.

Yes! Scraping logic implemented.

Step #6: Export the Scraped Data

Use the following logic to export the scraped data to a CSV file:

# Specify the file name for export
with open("quotes.csv", mode="w", newline="", encoding="utf-8") as file:
    writer = csv.DictWriter(file, fieldnames=["text", "author", "tags"])

    # Write the header row
    writer.writeheader()

    # Write the scraped quotes data
    writer.writerows(quotes)

This snippet opens a file named quotes.csv in write mode, defines column headers (textauthortags), writes the headers to the file, and then writes each dictionary from the quotes list to the CSV file. The csv.DictWriter handles the formatting, making it easy to store structured data.

Do not forget to import csv from the Python Standard Library:

import csv

Step #7: Put It All Together

Your final HTTPX web scraping script will contain:

import httpx
from bs4 import BeautifulSoup
import csv

# Make an HTTP GET request to the target page
response = httpx.get("http://quotes.toscrape.com")

# Access the HTML of the target page
html = response.text

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html, "html.parser")

# Where to store the scraped data
quotes = []

# Extract all quotes from the page
quote_elements = soup.find_all("div", class_="quote")

# Loop through quotes and extract text, author, and tags
for quote_element in quote_elements:
    text = quote_element.find("span", class_="text").get_text().replace("“", "").replace("”", "")
    author = quote_element.find("small", class_="author").get_text()
    tags = [tag.get_text() for tag in quote_element.find_all("a", class_="tag")]

    # Store the scraped data
    quotes.append({
        "text": text,
        "author": author,
        "tags": tags
    })

# Specify the file name for export
with open("quotes.csv", mode="w", newline="", encoding="utf-8") as file:
    writer = csv.DictWriter(file, fieldnames=["text", "author", "tags"])

    # Write the header row
    writer.writeheader()

    # Write the scraped quotes data
    writer.writerows(quotes)

Execute it with:

python scraper.py

Or, on Linux/macOS:

python3 scraper.py

quotes.csv file will appear in the root folder of your project. Open it and you will see:

The CSV containing the scraped data

Et voilà! You just learned how to perform web scraping with HTTPX and BeautifulSoup.

HTTPX Web Scraping Advanced Features and Techniques

Now that you know how to use HTTPX for web scraping in a basic scenario, you are ready to see it in action with more complex use cases.

In the examples below, the target site will be the HTTPBin.io /anything endpoint. This is a special API that returns the IP address, headers, and other information sent by the caller.

Master HTTPX for web scraping!

Set Custom Headers

HTTPX allows you to specify custom headers thanks to the headers argument:

import httpx

# Custom headers for the request
headers = {
    "accept": "application/json",
    "accept-language": "en-US,en;q=0.9,fr-FR;q=0.8,fr;q=0.7,es-US;q=0.6,es;q=0.5,it-IT;q=0.4,it;q=0.3"
}

# Make a GET request with custom headers
response = httpx.get("https://httpbin.io/anything", headers=headers)
# Handle the response...

Set a Custom User Agent

User-Agent is one of the most important HTTP headers for web scraping. By default, HTTPX uses the following User-Agent:

python-httpx/<VERSION>

This value can easily reveal that your requests are automated, which could lead to blocking by the target site.

To avoid that, you can set a custom User-Agent to mimic a real browser, like so:

import httpx

# Define a custom User-Agent
headers = {
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36"
}

# Make a GET request with the custom User-Agent
response = httpx.get("https://httpbin.io/anything", headers=headers)
# Handle the response...

Discover the best user agents for web scraping!

Set Cookies

Just like HTTP headers, you can set cookies in HTTPX using the cookies argument:

import httpx

# Define cookies as a dictionary
cookies = {
    "session_id": "3126hdsab161hdabg47adgb",
    "user_preferences": "dark_mode=true"
}

# Make a GET request with custom cookies
response = httpx.get("https://httpbin.io/anything", cookies=cookies)
# Handle the response...

This gives you the ability to include session data required for your web scraping requests.

Proxy Integration

You can route your HTTPX requests through a proxy to protect your identity and avoid IP bans while performing web scraping. That is possible by using the proxies argument:

import httpx

# Replace with the URL of your proxy server
proxy = "<YOUR_PROXY_URL>"

# Make a GET request through a proxy server
response = httpx.get("https://httpbin.io/anything", proxy=proxy)
# Handle the response...

Find out more in our guide on how to use HTTPX with a proxy.

Error Handling

By default, HTTPX raises errors only for connection or network issues. To raise exceptions also for HTTP responses with 4xx and 5xx status codes,use the raise_for_status() method as below:

import httpx

try:
    response = httpx.get("https://httpbin.io/anything")
    # Raise an exception for 4xx and 5xx responses
    response.raise_for_status()
    # Handle the response...
except httpx.HTTPStatusError as e:
    # Handle HTTP status errors
    print(f"HTTP error occurred: {e}")
except httpx.RequestError as e:
    # Handle connection or network errors
    print(f"Request error occurred: {e}")

Session Handling

When using the top-level API in HTTPX, a new connection is established for every single request. In other words, TCP connections are not reused. As the number of requests to a host increases, that approach becomes inefficient.

In contrast, using a httpx.Client instance enables HTTP connection pooling. This means that multiple requests to the same host can reuse an existing TCP connection instead of creating a new one for each request.

The benefits of using a Client over the top-level API are:

  • Reduced latency across requests (avoiding repeated handshaking)
  • Lower CPU usage and fewer round-trips
  • Reduced network congestion

Additionally, Client instances support session handling with features unavailable in the top-level API, including:

  • Cookie persistence across requests.
  • Applying configuration across all outgoing requests.
  • Sending requests through HTTP proxies.

The recommended way to use a Client in HTTPX is with a context manager (with statement):

import httpx

with httpx.Client() as client:
    # Make an HTTP request using the client
    response = client.get("https://httpbin.io/anything")

    # Extract the JSON response data and print it
    response_data = response.json()
    print(response_data)

Alternatively, you can manually manage the client and close the connection pool explicitly with client.close():

import httpx

client = httpx.Client()
try:
    # Make an HTTP request using the client
    response = client.get("https://httpbin.io/anything")

    # Extract the JSON response data and print it
    response_data = response.json()
    print(response_data)
except:
  # Handle the error...
  pass
finally:
  # Close the client connections and release resources
  client.close()

Note: If you are familiar with the requests library, httpx.Client() serves a similar purpose to requests.Session().

Async API

By default, HTTPX exposes a standard synchronous API. At the same time, it also offers an asynchronous client for cases where it is needed. If you are working with asyncio, using an async client is essential for sending outgoing HTTP requests efficiently.

Asynchronous programming is a concurrency model that is significantly more efficient than multi-threading. It offers notable performance improvements and supports long-lived network connections like WebSockets. That makes it a key factor in speeding up web scraping.

To make asynchronous requests in HTTPX, you’ll need an AsyncClient. Initialize it and use it to make a GET request as shown below:

import httpx
import asyncio

async def fetch_data():
    async with httpx.AsyncClient() as client:
        # Make an async HTTP request
        response = await client.get("https://httpbin.io/anything")

        # Extract the JSON response data and print it
        response_data = response.json()
        print(response_data)

# Run the async function
asyncio.run(fetch_data())

The with statement ensures the client is automatically closed when the block ends. Alternatively, if you manage the client manually, you can close it explicitly with await client.close().

Remember, all HTTPX request methods (get()post(), etc.) are asynchronous when using an AsyncClient. Therefore, you must add await before calling them to get a response.

Retry Failed Requests

Network instability during web scraping can lead to connection failures or timeouts. HTTPX simplifies handling such issues via its HTTPTransport interface. This mechanism retries requests when an httpx.ConnectError or httpx.ConnectTimeout occurs.

The following example demonstrates how to configure a transport to retry requests up to 3 times:

import httpx

# Configure transport with retry capability on connection errors or timeouts
transport = httpx.HTTPTransport(retries=3)

# Use the transport with an HTTPX client
with httpx.Client(transport=transport) as client:
    # Make a GET request
    response = client.get("https://httpbin.io/anything")
    # Handle the response...

Note that only connection-related errors trigger a retry. To handle read/write errors or specific HTTP status codes, you need to implement custom retry logic with libraries like tenacity.

HTTPX vs Requests for Web Scraping

Here is a summary table to compare HTTPX and Requests for web scraping:

Feature HTTPX Requests
GitHub stars 8k 52.4k
Async support ✔️
Connection pooling ✔️ ✔️
HTTP/2 support ✔️
User-agent customization ✔️ ✔️
Proxy support ✔️ ✔️
Cookie handling ✔️ ✔️
Timeouts Customizable for connection and read Customizable for connection and read
Retry mechanism Available via transports Available via HTTPAdapters
Performance High Medium
Community support and popularity Growing Large

Conclusion

In this article, you explored the httpx library for web scraping. You gained an understanding of what it is, what it offers, and its advantages. HTTPX is a fast and reliable option for making HTTP requests when collecting online data.

The problem is that automated HTTP requests reveal your public IP address, which can expose your identity and location. That compromises your privacy. To enhance your security and privacy, one of the most effective methods is to use a proxy server to hide your IP address.

Bright Data controls the best proxy servers in the world, serving Fortune 500 companies and more than 20,000 customers. Its offer includes a wide range of proxy types:

Create a free Bright Data account today to test our scraping solutions and proxies!

No credit card required