In this guide, you will explore:
- What AIOHTTP is and the key features it provides
- A step-by-step section on using AIOHTTP for web scraping
- Advanced techniques for web scraping with AIOHTTP
- An AIOHTTP vs Requests comparison for handling automated requests
Let’s dive in!
What Is AIOHTTP?
AIOHTTP is an asynchronous client/server HTTP framework built on top of Python’s asyncio
. Unlike traditional HTTP clients, AIOHTTP uses client sessions to maintain connections across multiple requests. That makes it an efficient choice for high-concurrency session-based tasks.
⚙️ Features
- Supports both the client and server sides of the HTTP protocol.
- Provides native support for WebSockets (both client and server).
- Offers middleware and pluggable routing for web servers.
- Efficiently handles streaming large data.
- Includes client session persistence, enabling connection reuse and reducing overhead for multiple requests.
Scraping with AIOHTTP: Step-By-Step Tutorial
In the context of web scraping, AIOHTTP is just an HTTP client to fetch the raw HTML content of a page. To parse and extract data from that HTML, you then need an HTML parser like BeautifulSoup.
Follow this section to learn how to use AIOHTTP for web scraping with BeautifulSoup!
Warning: Although AIOHTTP is used primarily in the initial stages of the process, we will guide you through the entire scraping workflow. If you are interested in more advanced AIOHTTP web scraping techniques, feel free to skip ahead to the next chapter after Step 3.
Step #1: Setup Your Scraping Project
Ensure that Python 3+ is installed on your machine. If not, download it from the official site and follow the installation instructions.
Next, create a directory for your AIOHTTP scraping project using this command:
Navigate into that directory and set up a virtual environment:
Open the project folder in your preferred Python IDE. Visual Studio Code with the Python extension or PyCharm Community Edition are both valid choices.
Now, create a scraper.py
file inside the project folder. It will be empty at first, but you will soon add the scraping logic to it.
In your IDE’s terminal, activate the virtual environment. On Linux or macOS, use:
Equivalently, on Windows, run:
Great! You are all set up and ready to go.
Step #2: Set Up the Scraping Libraries
With the virtual environment activated, install AIOHTTP and BeautifulSoup using the command below:
This will add both the aiohttp
and beautifulsoup4
to your project’s dependencies.
Import them into your scraper.py
script:
Note that aiohttp
requires the asyncio
to work.
Now, add the following async
function workflow to your scrper.py
file:
scrape_quotes()
defines an asynchronous function where your scraping logic will run concurrently without blocking. Finally, asyncio.run(scrape_quotes())
starts and runs the asynchronous function.
Awesome! You can proceed to the next step in your scraping workflow.
Step #3: Get the HTML of the Target Page
In this example, you will see how to scrape data from the “Quotes to Scrape” site:
With libraries like Requests or AIOHTTP, you would simply make a GET request and directly receive the HTML content of the page. However, AIOHTTP follows a different request lifecycle.
The AIOHTTP primary component is the ClientSession
, which manages a pool of connections and supports Keep-Alive
by default. Instead of opening a new connection for every request, it reuses connections, improving performance.
When making a request, the process typically involves three steps:
- Opening a session through
ClientSession()
. - Sending the GET request asynchronously with
session.get()
. - Accessing the response data with methods like
await response.text()
.
This design allows the event loop to use different with
contexts between operations without blocking, making it ideal for high-concurrency tasks.
Given that, you can use AIOHTTP to retrieve the HTML of the homepage with this logic:
Behind the scenes, AIOHTTP sends the request to the server and waits for the response, which contains the HTML of the page. Once the response is received, await response.text()
extracts the HTML content as a string.
Print the html
variable and you will see:
Way to go! You successfully retrieved the HTML content of the target page. Time to parse this content and extract the data you need.
Step #4: Parse the HTML
Pass the HTML content into the BeautifulSoup constructor to parse it:
html.parser
is the default Python HTML parser used to process the content.
The soup
object contains the parsed HTML and provides methods to extract the data you need.
AIOHTTP has handled retrieving the HTML, and now you are transitioning into the typical data parsing phase with BeautifulSoup. For more details, read our tutorial on BeautifulSoup web scraping.
Step #5: Write the Data Extraction Logic
You can scrape the quotes data from the page using the following code:
This snippet initializes a list named quotes
to hold the scraped data. It then identifies all quote HTML elements and loops through them to extract the quote text, author, and tags. Each extracted quote is stored as a dictionary in the quotes
list, organizing the data for later use or export.
Super! Scraping logic is now implemented.
Step #6: Export the Scraped Data
Use these lines of code to export the scraped data into a CSV file:
The above snippet opens a file named quotes.csv
in write mode. Then it, sets up column headers (text
, author
, tags
), writes the headers, and then writes each dictionary from the quotes
list to the CSV file.
csv.DictWriter
simplifies data formatting, making it easier to store structured data. To make it work, remember to import csv
from the Python Standard Library:
Step #7: Put It All Together
This is what your final AIOHTTP web scraping script should look like:
You can run it with:
Or, on Linux/macOS:
A quotes.csv
file will appear in the root folder of your project. Open it and you will see:
Et voilà! You just learned how to perform web scraping with AIOHTTP and BeautifulSoup.
AIOHTTP for Web Scraping: Advanced Features and Techniques
Now that you understand how to use AIOHTTP for basic web scraping, it is time to see more advanced scenarios.
In the following examples, the target site will be the HTTPBin.io /anything
endpoint. That is a handy API that returns the IP address, headers, and other data sent by the requester.
Get ready to master AIOHTTP for web scraping!
Set Custom Headers
You can specify custom headers in an AIOHTTP request with to the headers
argument:
This way, AIOHTTP will make a GET HTTP request with the Accept
and Accept-Language
headers set.
Set a Custom User Agent
User-Agent
is one of the most critical HTTP headers for web scraping. By default, AIOHTTP uses this User-Agent
:
The default value above can easily expose your requests as coming from an automated script. That will increase the risk of being blocked by the target site.
To reduce the chances of getting detected, you can set a custom real-world User-Agent
as before:
Discover the best user agents for web scraping!
Set Cookies
Just like HTTP headers, you can set custom cookies using the cookies
in ClientSession()
:
Cookies help you include session data required in your web scraping requests.
Note that cookies set in ClientSession
are shared across all requests made with that session. To access session cookies, refer to ClientSession.cookie_jar
.
Proxy Integration
In AIOHTTP, you can route your requests through a proxy server to reduce the risk of IP bans. Do that by using the proxy
argument in the HTTP method function on session
:
Find out how to perform proxy authentication and rotation in our guide on how to use a proxy in AIOHTTP.
Error Handling
By default, AIOHTTP raises errors only for connection or network issues. To raise exceptions for HTTP responses when receiving 4xx
and 5xx
status codes, you can use any of the following approaches:
- Set
raise_for_status=True
when creating theClientSession
: Automatically raise exceptions for all requests made through the session if the response status is4xx
or5xx
. - Pass
raise_for_status=True
directly to request methods: Enable error raising for individual request methods (likesession.get()
orsession.post()
) without affecting others. - Call
response.raise_for_status()
manually: Give full control over when to raise exceptions, allowing you to decide on a per-request basis.
Option #1 example:
When raise_for_status=True
is set at the session level, all requests made through that session will raise an aiohttp.ClientResponseError
for 4xx
or 5xx
responses.
Option #2 example:
In this case, the raise_for_status=True
argument is passed directly to the session.get()
call. This ensures that an exception is raised automatically for any 4xx
or 5xx
status codes.
Option #3 example:
If you prefer more control over individual requests, you can call response.raise_for_status()
manually after making a request. This approach allows you to decide exactly when to handle errors.
Retry Failed Requests
AIOHTTP does not provide built-in support for retrying requests automatically. To implement that, you must use custom logic or a third-party library like aiohttp-retry
. This enables you to configure retry logic for failed requests, helping to handle transient network issues, timeouts, or rate limits.
Install aiohttp-retry
with:
Then, you can use it as follows:
This configures retry behavior, with an exponential backoff strategy. Learn more in the official docs.
AIOHTTP vs Requests for Web Scraping
Below is a summary table to compare AIOHTTP and Requests for web scraping:
Feature | AIOHTTP | Requests |
---|---|---|
GitHub stars | 15.3k | 52.4k |
Client support | ✔️ | ✔️ |
Sync support | ❌ | ✔️ |
Async support | ✔️ | ❌ |
Server support | ✔️ | ❌ |
Connection pooling | ✔️ | ✔️ |
HTTP/2 support | ❌ | ❌ |
User-agent customization | ✔️ | ✔️ |
Proxy support | ✔️ | ✔️ |
Cookie handling | ✔️ | ✔️ |
Retry mechanism | Available only via a third-party library | Available via HTTPAdapter s |
Performance | High | Medium |
Community support and popularity | Medium | Large |
For a complete comparison, check out our blog post on Requests vs HTTPX vs AIOHTTP.
Learn how scrape websites with HTTPX.
Conclusion
In this article, you learned how to use the aiohttp
library for web scraping. You explored what it is, the features it offers, and the benefits it provides. AIOHTTP stands out as a fast and reliable choice for making HTTP requests when gathering online data.
However, automated HTTP requests expose your public IP address. That can reveal your identity and location, putting your privacy at risk. To safeguard your security and privacy, one of the most effective strategies is to use a proxy server to hide your IP address.
Bright Data controls the best proxy servers in the world, serving Fortune 500 companies and more than 20,000 customers. Its offer includes a wide range of proxy types:
- Datacenter proxies – Over 770,000 datacenter IPs.
- Residential proxies – Over 72M residential IPs in more than 195 countries.
- ISP proxies – Over 700,000 ISP IPs.
- Mobile proxies – Over 7M mobile IPs.
Create a free Bright Data account today to test our proxies and scraping solutions!
No credit card required