In this guide, you will explore:
- What AIOHTTP is and the key features it provides
- A step-by-step section on using AIOHTTP for web scraping
- Advanced techniques for web scraping with AIOHTTP
- An AIOHTTP vs Requests comparison for handling automated requests
Let’s dive in!
What Is AIOHTTP?
AIOHTTP is an asynchronous client/server HTTP framework built on top of Python’s asyncio
. Unlike traditional HTTP clients, AIOHTTP uses client sessions to maintain connections across multiple requests. That makes it an efficient choice for high-concurrency session-based tasks.
⚙️ Features
- Supports both the client and server sides of the HTTP protocol.
- Provides native support for WebSockets (both client and server).
- Offers middleware and pluggable routing for web servers.
- Efficiently handles streaming large data.
- Includes client session persistence, enabling connection reuse and reducing overhead for multiple requests.
Scraping with AIOHTTP: Step-By-Step Tutorial
In the context of web scraping, AIOHTTP is just an HTTP client to fetch the raw HTML content of a page. To parse and extract data from that HTML, you then need an HTML parser like BeautifulSoup.
Follow this section to learn how to use AIOHTTP for web scraping with BeautifulSoup!
Warning: Although AIOHTTP is used primarily in the initial stages of the process, we will guide you through the entire scraping workflow. If you are interested in more advanced AIOHTTP web scraping techniques, feel free to skip ahead to the next chapter after Step 3.
Step #1: Setup Your Scraping Project
Ensure that Python 3+ is installed on your machine. If not, download it from the official site and follow the installation instructions.
Next, create a directory for your AIOHTTP scraping project using this command:
mkdir aiohttp-scraper
Navigate into that directory and set up a virtual environment:
cd aiohttp-scraper
python -m venv env
Open the project folder in your preferred Python IDE. Visual Studio Code with the Python extension or PyCharm Community Edition are both valid choices.
Now, create a scraper.py
file inside the project folder. It will be empty at first, but you will soon add the scraping logic to it.
In your IDE’s terminal, activate the virtual environment. On Linux or macOS, use:
./env/bin/activate
Equivalently, on Windows, run:
env/Scripts/activate
Great! You are all set up and ready to go.
Step #2: Set Up the Scraping Libraries
With the virtual environment activated, install AIOHTTP and BeautifulSoup using the command below:
pip install aiohttp beautifulsoup4
This will add both the aiohttp
and beautifulsoup4
to your project’s dependencies.
Import them into your scraper.py
script:
import asyncio
import aiohttp
from bs4 import BeautifulSoup
Note that aiohttp
requires the asyncio
to work.
Now, add the following async
function workflow to your scrper.py
file:
async def scrape_quotes():
# Scraping logic...
# Run the asynchronous function
asyncio.run(scrape_quotes())
scrape_quotes()
defines an asynchronous function where your scraping logic will run concurrently without blocking. Finally, asyncio.run(scrape_quotes())
starts and runs the asynchronous function.
Awesome! You can proceed to the next step in your scraping workflow.
Step #3: Get the HTML of the Target Page
In this example, you will see how to scrape data from the “Quotes to Scrape” site:
With libraries like Requests or AIOHTTP, you would simply make a GET request and directly receive the HTML content of the page. However, AIOHTTP follows a different request lifecycle.
The AIOHTTP primary component is the ClientSession
, which manages a pool of connections and supports Keep-Alive
by default. Instead of opening a new connection for every request, it reuses connections, improving performance.
When making a request, the process typically involves three steps:
- Opening a session through
ClientSession()
. - Sending the GET request asynchronously with
session.get()
. - Accessing the response data with methods like
await response.text()
.
This design allows the event loop to use different with
contexts between operations without blocking, making it ideal for high-concurrency tasks.
Given that, you can use AIOHTTP to retrieve the HTML of the homepage with this logic:
async with aiohttp.ClientSession() as session:
async with session.get("http://quotes.toscrape.com") as response:
# Access the HTML of the target page
html = await response.text()
Behind the scenes, AIOHTTP sends the request to the server and waits for the response, which contains the HTML of the page. Once the response is received, await response.text()
extracts the HTML content as a string.
Print the html
variable and you will see:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Quotes to Scrape</title>
<link rel="stylesheet" href="/static/bootstrap.min.css">
<link rel="stylesheet" href="/static/main.css">
</head>
<body>
<!-- omitted for brevity... -->
</body>
</html>
Way to go! You successfully retrieved the HTML content of the target page. Time to parse this content and extract the data you need.
Step #4: Parse the HTML
Pass the HTML content into the BeautifulSoup constructor to parse it:
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
html.parser
is the default Python HTML parser used to process the content.
The soup
object contains the parsed HTML and provides methods to extract the data you need.
AIOHTTP has handled retrieving the HTML, and now you are transitioning into the typical data parsing phase with BeautifulSoup. For more details, read our tutorial on BeautifulSoup web scraping.
Step #5: Write the Data Extraction Logic
You can scrape the quotes data from the page using the following code:
# Where to store the scraped data
quotes = []
# Extract all quotes from the page
quote_elements = soup.find_all("div", class_="quote")
# Loop through quotes and extract text, author, and tags
for quote_element in quote_elements:
text = quote_element.find("span", class_="text").get_text().get_text().replace("“", "").replace("”", "")
author = quote_element.find("small", class_="author")
tags = [tag.get_text() for tag in quote_element.find_all("a", class_="tag")]
# Store the scraped data
quotes.append({
"text": text,
"author": author,
"tags": tags
})
This snippet initializes a list named quotes
to hold the scraped data. It then identifies all quote HTML elements and loops through them to extract the quote text, author, and tags. Each extracted quote is stored as a dictionary in the quotes
list, organizing the data for later use or export.
Super! Scraping logic is now implemented.
Step #6: Export the Scraped Data
Use these lines of code to export the scraped data into a CSV file:
# Open the file for export
with open("quotes.csv", mode="w", newline="", encoding="utf-8") as file:
writer = csv.DictWriter(file, fieldnames=["text", "author", "tags"])
# Write the header row
writer.writeheader()
# Write the scraped quotes data
writer.writerows(quotes)
The above snippet opens a file named quotes.csv
in write mode. Then it, sets up column headers (text
, author
, tags
), writes the headers, and then writes each dictionary from the quotes
list to the CSV file.
csv.DictWriter
simplifies data formatting, making it easier to store structured data. To make it work, remember to import csv
from the Python Standard Library:
import csv
Step #7: Put It All Together
This is what your final AIOHTTP web scraping script should look like:
import asyncio
import aiohttp
from bs4 import BeautifulSoup
import csv
# Define an asynchronous function to make the HTTP GET request
async def scrape_quotes():
async with aiohttp.ClientSession() as session:
async with session.get("http://quotes.toscrape.com") as response:
# Access the HTML of the target page
html = await response.text()
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
# List to store the scraped data
quotes = []
# Extract all quotes from the page
quote_elements = soup.find_all("div", class_="quote")
# Loop through quotes and extract text, author, and tags
for quote_element in quote_elements:
text = quote_element.find("span", class_="text").get_text().replace("“", "").replace("”", "")
author = quote_element.find("small", class_="author").get_text()
tags = [tag.get_text() for tag in quote_element.find_all("a", class_="tag")]
# Store the scraped data
quotes.append({
"text": text,
"author": author,
"tags": tags
})
# Open the file name for export
with open("quotes.csv", mode="w", newline="", encoding="utf-8") as file:
writer = csv.DictWriter(file, fieldnames=["text", "author", "tags"])
# Write the header row
writer.writeheader()
# Write the scraped quotes data
writer.writerows(quotes)
# Run the asynchronous function
asyncio.run(scrape_quotes())
You can run it with:
python scraper.py
Or, on Linux/macOS:
python3 scraper.py
A quotes.csv
file will appear in the root folder of your project. Open it and you will see:
Et voilà! You just learned how to perform web scraping with AIOHTTP and BeautifulSoup.
AIOHTTP for Web Scraping: Advanced Features and Techniques
Now that you understand how to use AIOHTTP for basic web scraping, it is time to see more advanced scenarios.
In the following examples, the target site will be the HTTPBin.io /anything
endpoint. That is a handy API that returns the IP address, headers, and other data sent by the requester.
Get ready to master AIOHTTP for web scraping!
Set Custom Headers
You can specify custom headers in an AIOHTTP request with to the headers
argument:
import aiohttp
import asyncio
async def fetch_with_custom_headers():
# Custom headers for the request
headers = {
"Accept": "application/json",
"Accept-Language": "en-US,en;q=0.9,fr-FR;q=0.8,fr;q=0.7,es-US;q=0.6,es;q=0.5,it-IT;q=0.4,it;q=0.3"
}
async with aiohttp.ClientSession() as session:
# Make a GET request with custom headers
async with session.get("https://httpbin.io/anything", headers=headers) as response:
data = await response.json()
# Handle the response...
print(data)
# Run the event loop
asyncio.run(fetch_with_custom_headers())
This way, AIOHTTP will make a GET HTTP request with the Accept
and Accept-Language
headers set.
Set a Custom User Agent
User-Agent
is one of the most critical HTTP headers for web scraping. By default, AIOHTTP uses this User-Agent
:
Python/<PYTHON_VERSION> aiohttp/<AIOHTTP_VERSION>
The default value above can easily expose your requests as coming from an automated script. That will increase the risk of being blocked by the target site.
To reduce the chances of getting detected, you can set a custom real-world User-Agent
as before:
import aiohttp
import asyncio
async def fetch_with_custom_user_agent():
# Define a Chrome-like custom User-Agent
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36"
}
async with aiohttp.ClientSession(headers=headers) as session:
# Make a GET request with the custom User-Agent
async with session.get("https://httpbin.io/anything") as response:
data = await response.text()
# Handle the response...
print(data)
# Run the event loop
asyncio.run(fetch_with_custom_user_agent())
Discover the best user agents for web scraping!
Set Cookies
Just like HTTP headers, you can set custom cookies using the cookies
in ClientSession()
:
import aiohttp
import asyncio
async def fetch_with_custom_cookies():
# Define cookies as a dictionary
cookies = {
"session_id": "9412d7hdsa16hbda4347dagb",
"user_preferences": "dark_mode=false"
}
async with aiohttp.ClientSession(cookies=cookies) as session:
# Make a GET request with custom cookies
async with session.get("https://httpbin.io/anything") as response:
data = await response.text()
# Handle the response...
print(data)
# Run the event loop
asyncio.run(fetch_with_custom_cookies())
Cookies help you include session data required in your web scraping requests.
Note that cookies set in ClientSession
are shared across all requests made with that session. To access session cookies, refer to ClientSession.cookie_jar
.
Proxy Integration
In AIOHTTP, you can route your requests through a proxy server to reduce the risk of IP bans. Do that by using the proxy
argument in the HTTP method function on session
:
import aiohttp
import asyncio
async def fetch_through_proxy():
# Replace with the URL of your proxy server
proxy_url = "<YOUR_PROXY_URL>"
async with aiohttp.ClientSession() as session:
# Make a GET request through the proxy server
async with session.get("https://httpbin.io/anything", proxy=proxy_url) as response:
data = await response.text()
# Handle the response...
print(data)
# Run the event loop
asyncio.run(fetch_through_proxy())
Find out how to perform proxy authentication and rotation in our guide on how to use a proxy in AIOHTTP.
Error Handling
By default, AIOHTTP raises errors only for connection or network issues. To raise exceptions for HTTP responses when receiving 4xx
and 5xx
status codes, you can use any of the following approaches:
- Set
raise_for_status=True
when creating theClientSession
: Automatically raise exceptions for all requests made through the session if the response status is4xx
or5xx
. - Pass
raise_for_status=True
directly to request methods: Enable error raising for individual request methods (likesession.get()
orsession.post()
) without affecting others. - Call
response.raise_for_status()
manually: Give full control over when to raise exceptions, allowing you to decide on a per-request basis.
Option #1 example:
import aiohttp
import asyncio
async def fetch_with_session_error_handling():
async with aiohttp.ClientSession(raise_for_status=True) as session:
try:
async with session.get("https://httpbin.io/anything") as response:
# No need to call response.raise_for_status(), as it is automatic
data = await response.text()
print(data)
except aiohttp.ClientResponseError as e:
print(f"HTTP error occurred: {e.status} - {e.message}")
except aiohttp.ClientError as e:
print(f"Request error occurred: {e}")
# Run the event loop
asyncio.run(fetch_with_session_error_handling())
When raise_for_status=True
is set at the session level, all requests made through that session will raise an aiohttp.ClientResponseError
for 4xx
or 5xx
responses.
Option #2 example:
import aiohttp
import asyncio
async def fetch_with_raise_for_status():
async with aiohttp.ClientSession() as session:
try:
async with session.get("https://httpbin.io/anything", raise_for_status=True) as response:
# No need to manually call response.raise_for_status(), it is automatic
data = await response.text()
print(data)
except aiohttp.ClientResponseError as e:
print(f"HTTP error occurred: {e.status} - {e.message}")
except aiohttp.ClientError as e:
print(f"Request error occurred: {e}")
# Run the event loop
asyncio.run(fetch_with_raise_for_status())
In this case, the raise_for_status=True
argument is passed directly to the session.get()
call. This ensures that an exception is raised automatically for any 4xx
or 5xx
status codes.
Option #3 example:
import aiohttp
import asyncio
async def fetch_with_manual_error_handling():
async with aiohttp.ClientSession() as session:
try:
async with session.get("https://httpbin.io/anything") as response:
response.raise_for_status() # Manually raises error for 4xx/5xx
data = await response.text()
print(data)
except aiohttp.ClientResponseError as e:
print(f"HTTP error occurred: {e.status} - {e.message}")
except aiohttp.ClientError as e:
print(f"Request error occurred: {e}")
# Run the event loop
asyncio.run(fetch_with_manual_error_handling())
If you prefer more control over individual requests, you can call response.raise_for_status()
manually after making a request. This approach allows you to decide exactly when to handle errors.
Retry Failed Requests
AIOHTTP does not provide built-in support for retrying requests automatically. To implement that, you must use custom logic or a third-party library like aiohttp-retry
. This enables you to configure retry logic for failed requests, helping to handle transient network issues, timeouts, or rate limits.
Install aiohttp-retry
with:
pip install aiohttp-retry
Then, you can use it as follows:
import asyncio
from aiohttp_retry import RetryClient, ExponentialRetry
async def main():
retry_options = ExponentialRetry(attempts=1)
retry_client = RetryClient(raise_for_status=False, retry_options=retry_options)
async with retry_client.get("https://httpbin.io/anything") as response:
print(response.status)
await retry_client.close()
This configures retry behavior, with an exponential backoff strategy. Learn more in the official docs.
AIOHTTP vs Requests for Web Scraping
Below is a summary table to compare AIOHTTP and Requests for web scraping:
Feature | AIOHTTP | Requests |
---|---|---|
GitHub stars | 15.3k | 52.4k |
Client support | ✔️ | ✔️ |
Sync support | ❌ | ✔️ |
Async support | ✔️ | ❌ |
Server support | ✔️ | ❌ |
Connection pooling | ✔️ | ✔️ |
HTTP/2 support | ❌ | ❌ |
User-agent customization | ✔️ | ✔️ |
Proxy support | ✔️ | ✔️ |
Cookie handling | ✔️ | ✔️ |
Retry mechanism | Available only via a third-party library | Available via HTTPAdapter s |
Performance | High | Medium |
Community support and popularity | Medium | Large |
For a complete comparison, check out our blog post on Requests vs HTTPX vs AIOHTTP.
Learn how scrape websites with HTTPX.
Conclusion
In this article, you learned how to use the aiohttp
library for web scraping. You explored what it is, the features it offers, and the benefits it provides. AIOHTTP stands out as a fast and reliable choice for making HTTP requests when gathering online data.
However, automated HTTP requests expose your public IP address. That can reveal your identity and location, putting your privacy at risk. To safeguard your security and privacy, one of the most effective strategies is to use a proxy server to hide your IP address.
Bright Data controls the best proxy servers in the world, serving Fortune 500 companies and more than 20,000 customers. Its offer includes a wide range of proxy types:
- Datacenter proxies – Over 770,000 datacenter IPs.
- Residential proxies – Over 72M residential IPs in more than 195 countries.
- ISP proxies – Over 700,000 ISP IPs.
- Mobile proxies – Over 7M mobile IPs.
Create a free Bright Data account today to test our proxies and scraping solutions!
No credit card required