In this article, you will learn:
- What HTTP is and the features it offers
- How to use HTTPX for web scraping in a guided section
- Advanced HTTPX features for web scraping
- A comparison of HTTPX vs. Requests for automated requests
Let’s dive in!
What Is HTTPX?
HTTPX is a fully featured HTTP client for Python 3, built on top of the retryablehttp
library. It is designed to ensure reliable results even with a high number of threads. HTTPX provides both synchronous and asynchronous APIs, with support for HTTP/1.1 and HTTP/2 protocols.
⚙️ Features
- Simple and modular codebase, making it easy to contribute.
- Fast and fully configurable flags for probing multiple elements.
- Supports various HTTP-based probing methods.
- Smart automatic fallback from HTTPS to HTTP by default.
- Accepts hosts, URLs, and CIDR as input.
- Supports proxies, custom HTTP headers, custom timeouts, basic authentication, and more.
👍 Pros
- Available from the command line using
httpx[cli]
. - Packed with features, including support for HTTP/2 and an asynchronous API.
- This project is actively developed…
👎 Cons
- …with frequent updates that may introduce breaking changes with new releases.
- Less popular than the
requests
library.
Scraping with HTTPX: Step-By-Step Guide
HTTPX is an HTTP client, meaning it helps you retrieve the raw HTML content of a page. To then parse and extract data from the HTML, you will need an HTML parser like BeautifulSoup.
Actually, HTTPX is not just any HTTP client, but one of the best Python HTTP clients for web scraping.
Follow this tutorial to learn how to use HTTPX for web scraping with BeautifulSoup!
Warning: While HTTPX is only used in the early stages of the process, we will walk you through a complete workflow. If you are interested in more advanced HTTPX scraping techniques, you can skip ahead to the next chapter after Step 3.
Step #1: Project Setup
Make sure you have Python 3+ installed on your machine. Otherwise, download it from the official site and follow the installation instructions.
Now, use the following command to create a directory for your HTTPX scraping project:
mkdir httpx-scraper
Navigate into it and initialize a virtual environment inside it:
cd httpx-scraper
python -m venv env
Open the project folder in your preferred Python IDE. Visual Studio Code with the Python extension or PyCharm Community Edition will do.
Next, create a scraper.py
file inside the project folder. Currently, scraper.py
is an empty Python script but it will soon contain the scraping logic.
In your IDE’s terminal, activate the virtual environment. On Linux or macOS, run:
./env/bin/activate
Equivalently, for Windows, fire:
env/Scripts/activate
Amazing! You are now fully set up.
Step #2: Install the Scraping Libraries
In an activated virtual environment, install HTTPX and BeautifulSoup with the following command:
pip install httpx beautifulsoup4
This will add both the httpx
and beautifulsoup4
to your project’s dependencies.
Import them into your scraper.py
script:
import httpx
from bs4 import BeautifulSoup
Great! You are ready to move on to the next step in your scraping workflow.
Step #3: Retrieve the HTML of the Target Page
In this example, the target page will be the “Quotes to Scrape” site:
Use HTTPX to retrieve the HTML of the homepage with the get()
method:
# Make an HTTP GET request to the target page
response = httpx.get("http://quotes.toscrape.com")
Behind the scenes, HTTPX will make an HTTP GET request to the server, which will respond with the HTML of the page. You can access the HTML content using the response.text
attribute:
html = response.text
print(html)
This will print the raw HTML content of the page:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Quotes to Scrape</title>
<link rel="stylesheet" href="/static/bootstrap.min.css">
<link rel="stylesheet" href="/static/main.css">
</head>
<body>
<!-- omitted for brevity... -->
</body>
</html>
Terrific! Time to parse this content and extract the data you need.
Step #4: Parse the HTML
Feed the HTML content to the BeautifulSoup constructor to parse it:
# Parse the HTML content using
BeautifulSoup soup = BeautifulSoup(html, "html.parser")
html.parser
is the standard Python HTML parser that will be used to parse the content.
The soup
variable now holds the parsed HTML and exposes the methods to extract the data you need.
HTTPX has done its job of retrieving the HTML, and now you are moving into the traditional data parsing phase with BeautifulSoup. For more information, refer to our tutorial on BeautifulSoup web scraping.
Step #5: Scrape Data From It
You can scrape quotes data from the page with the following lines of code:
# Where to store the scraped data
quotes = []
# Extract all quotes from the page
quote_elements = soup.find_all("div", class_="quote")
# Loop through quotes and extract text, author, and tags
for quote_element in quote_elements:
text = quote_element.find("span", class_="text").get_text().get_text().replace("“", "").replace("”", "")
author = quote_element.find("small", class_="author")
tags = [tag.get_text() for tag in quote_element.find_all("a", class_="tag")]
# Store the scraped data
quotes.append({
"text": text,
"author": author,
"tags": tags
})
This snippet defines a list named quotes
to store the scraped data. It then selects all quote HTML elements and iterates over them to extract the quote text, author, and tags. Each extracted quote is stored as a dictionary within the quotes
list, organizing the data for further use or export.
Yes! Scraping logic implemented.
Step #6: Export the Scraped Data
Use the following logic to export the scraped data to a CSV file:
# Specify the file name for export
with open("quotes.csv", mode="w", newline="", encoding="utf-8") as file:
writer = csv.DictWriter(file, fieldnames=["text", "author", "tags"])
# Write the header row
writer.writeheader()
# Write the scraped quotes data
writer.writerows(quotes)
This snippet opens a file named quotes.csv
in write mode, defines column headers (text
, author
, tags
), writes the headers to the file, and then writes each dictionary from the quotes
list to the CSV file. The csv.DictWriter
handles the formatting, making it easy to store structured data.
Do not forget to import csv
from the Python Standard Library:
import csv
Step #7: Put It All Together
Your final HTTPX web scraping script will contain:
import httpx
from bs4 import BeautifulSoup
import csv
# Make an HTTP GET request to the target page
response = httpx.get("http://quotes.toscrape.com")
# Access the HTML of the target page
html = response.text
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
# Where to store the scraped data
quotes = []
# Extract all quotes from the page
quote_elements = soup.find_all("div", class_="quote")
# Loop through quotes and extract text, author, and tags
for quote_element in quote_elements:
text = quote_element.find("span", class_="text").get_text().replace("“", "").replace("”", "")
author = quote_element.find("small", class_="author").get_text()
tags = [tag.get_text() for tag in quote_element.find_all("a", class_="tag")]
# Store the scraped data
quotes.append({
"text": text,
"author": author,
"tags": tags
})
# Specify the file name for export
with open("quotes.csv", mode="w", newline="", encoding="utf-8") as file:
writer = csv.DictWriter(file, fieldnames=["text", "author", "tags"])
# Write the header row
writer.writeheader()
# Write the scraped quotes data
writer.writerows(quotes)
Execute it with:
python scraper.py
Or, on Linux/macOS:
python3 scraper.py
A quotes.csv
file will appear in the root folder of your project. Open it and you will see:
Et voilà! You just learned how to perform web scraping with HTTPX and BeautifulSoup.
HTTPX Web Scraping Advanced Features and Techniques
Now that you know how to use HTTPX for web scraping in a basic scenario, you are ready to see it in action with more complex use cases.
In the examples below, the target site will be the HTTPBin.io /anything
endpoint. This is a special API that returns the IP address, headers, and other information sent by the caller.
Master HTTPX for web scraping!
Set Custom Headers
HTTPX allows you to specify custom headers thanks to the headers
argument:
import httpx
# Custom headers for the request
headers = {
"accept": "application/json",
"accept-language": "en-US,en;q=0.9,fr-FR;q=0.8,fr;q=0.7,es-US;q=0.6,es;q=0.5,it-IT;q=0.4,it;q=0.3"
}
# Make a GET request with custom headers
response = httpx.get("https://httpbin.io/anything", headers=headers)
# Handle the response...
Set a Custom User Agent
User-Agent
is one of the most important HTTP headers for web scraping. By default, HTTPX uses the following User-Agent
:
python-httpx/<VERSION>
This value can easily reveal that your requests are automated, which could lead to blocking by the target site.
To avoid that, you can set a custom User-Agent
to mimic a real browser, like so:
import httpx
# Define a custom User-Agent
headers = {
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36"
}
# Make a GET request with the custom User-Agent
response = httpx.get("https://httpbin.io/anything", headers=headers)
# Handle the response...
Discover the best user agents for web scraping!
Set Cookies
Just like HTTP headers, you can set cookies in HTTPX using the cookies
argument:
import httpx
# Define cookies as a dictionary
cookies = {
"session_id": "3126hdsab161hdabg47adgb",
"user_preferences": "dark_mode=true"
}
# Make a GET request with custom cookies
response = httpx.get("https://httpbin.io/anything", cookies=cookies)
# Handle the response...
This gives you the ability to include session data required for your web scraping requests.
Proxy Integration
You can route your HTTPX requests through a proxy to protect your identity and avoid IP bans while performing web scraping. That is possible by using the proxies
argument:
import httpx
# Replace with the URL of your proxy server
proxy = "<YOUR_PROXY_URL>"
# Make a GET request through a proxy server
response = httpx.get("https://httpbin.io/anything", proxy=proxy)
# Handle the response...
Find out more in our guide on how to use HTTPX with a proxy.
Error Handling
By default, HTTPX raises errors only for connection or network issues. To raise exceptions also for HTTP responses with 4xx
and 5xx
status codes,use the raise_for_status()
method as below:
import httpx
try:
response = httpx.get("https://httpbin.io/anything")
# Raise an exception for 4xx and 5xx responses
response.raise_for_status()
# Handle the response...
except httpx.HTTPStatusError as e:
# Handle HTTP status errors
print(f"HTTP error occurred: {e}")
except httpx.RequestError as e:
# Handle connection or network errors
print(f"Request error occurred: {e}")
Session Handling
When using the top-level API in HTTPX, a new connection is established for every single request. In other words, TCP connections are not reused. As the number of requests to a host increases, that approach becomes inefficient.
In contrast, using a httpx.Client
instance enables HTTP connection pooling. This means that multiple requests to the same host can reuse an existing TCP connection instead of creating a new one for each request.
The benefits of using a Client
over the top-level API are:
- Reduced latency across requests (avoiding repeated handshaking)
- Lower CPU usage and fewer round-trips
- Reduced network congestion
Additionally, Client
instances support session handling with features unavailable in the top-level API, including:
- Cookie persistence across requests.
- Applying configuration across all outgoing requests.
- Sending requests through HTTP proxies.
The recommended way to use a Client
in HTTPX is with a context manager (with
statement):
import httpx
with httpx.Client() as client:
# Make an HTTP request using the client
response = client.get("https://httpbin.io/anything")
# Extract the JSON response data and print it
response_data = response.json()
print(response_data)
Alternatively, you can manually manage the client and close the connection pool explicitly with client.close()
:
import httpx
client = httpx.Client()
try:
# Make an HTTP request using the client
response = client.get("https://httpbin.io/anything")
# Extract the JSON response data and print it
response_data = response.json()
print(response_data)
except:
# Handle the error...
pass
finally:
# Close the client connections and release resources
client.close()
Note: If you are familiar with the requests
library, httpx.Client()
serves a similar purpose to requests.Session()
.
Async API
By default, HTTPX exposes a standard synchronous API. At the same time, it also offers an asynchronous client for cases where it is needed. If you are working with asyncio
, using an async client is essential for sending outgoing HTTP requests efficiently.
Asynchronous programming is a concurrency model that is significantly more efficient than multi-threading. It offers notable performance improvements and supports long-lived network connections like WebSockets. That makes it a key factor in speeding up web scraping.
To make asynchronous requests in HTTPX, you’ll need an AsyncClient
. Initialize it and use it to make a GET request as shown below:
import httpx
import asyncio
async def fetch_data():
async with httpx.AsyncClient() as client:
# Make an async HTTP request
response = await client.get("https://httpbin.io/anything")
# Extract the JSON response data and print it
response_data = response.json()
print(response_data)
# Run the async function
asyncio.run(fetch_data())
The with
statement ensures the client is automatically closed when the block ends. Alternatively, if you manage the client manually, you can close it explicitly with await client.close()
.
Remember, all HTTPX request methods (get()
, post()
, etc.) are asynchronous when using an AsyncClient
. Therefore, you must add await
before calling them to get a response.
Retry Failed Requests
Network instability during web scraping can lead to connection failures or timeouts. HTTPX simplifies handling such issues via its HTTPTransport
interface. This mechanism retries requests when an httpx.ConnectError
or httpx.ConnectTimeout
occurs.
The following example demonstrates how to configure a transport to retry requests up to 3 times:
import httpx
# Configure transport with retry capability on connection errors or timeouts
transport = httpx.HTTPTransport(retries=3)
# Use the transport with an HTTPX client
with httpx.Client(transport=transport) as client:
# Make a GET request
response = client.get("https://httpbin.io/anything")
# Handle the response...
Note that only connection-related errors trigger a retry. To handle read/write errors or specific HTTP status codes, you need to implement custom retry logic with libraries like tenacity
.
HTTPX vs Requests for Web Scraping
Here is a summary table to compare HTTPX and Requests for web scraping:
Feature | HTTPX | Requests |
---|---|---|
GitHub stars | 8k | 52.4k |
Async support | ✔️ | ❌ |
Connection pooling | ✔️ | ✔️ |
HTTP/2 support | ✔️ | ❌ |
User-agent customization | ✔️ | ✔️ |
Proxy support | ✔️ | ✔️ |
Cookie handling | ✔️ | ✔️ |
Timeouts | Customizable for connection and read | Customizable for connection and read |
Retry mechanism | Available via transports | Available via HTTPAdapter s |
Performance | High | Medium |
Community support and popularity | Growing | Large |
Conclusion
In this article, you explored the httpx
library for web scraping. You gained an understanding of what it is, what it offers, and its advantages. HTTPX is a fast and reliable option for making HTTP requests when collecting online data.
The problem is that automated HTTP requests reveal your public IP address, which can expose your identity and location. That compromises your privacy. To enhance your security and privacy, one of the most effective methods is to use a proxy server to hide your IP address.
Bright Data controls the best proxy servers in the world, serving Fortune 500 companies and more than 20,000 customers. Its offer includes a wide range of proxy types:
- Datacenter proxies – Over 770,000 datacenter IPs.
- Residential proxies – Over 72M residential IPs in more than 195 countries.
- ISP proxies – Over 700,000 ISP IPs.
- Mobile proxies – Over 7M mobile IPs.
Create a free Bright Data account today to test our scraping solutions and proxies!
No credit card required