Web Scraping With AIOHTTP in Python

Discover AIOHTTP for web scraping! Learn setup, features, and advanced techniques, plus a comparison with Requests for efficient data extraction.
14 min read
web scraping with aiohttp and python blog image

In this guide, you will explore:

  • What AIOHTTP is and the key features it provides
  • A step-by-step section on using AIOHTTP for web scraping
  • Advanced techniques for web scraping with AIOHTTP
  • An AIOHTTP vs Requests comparison for handling automated requests

Let’s dive in!

What Is AIOHTTP?

AIOHTTP is an asynchronous client/server HTTP framework built on top of Python’s asyncio. Unlike traditional HTTP clients, AIOHTTP uses client sessions to maintain connections across multiple requests. That makes it an efficient choice for high-concurrency session-based tasks.

⚙️ Features

  • Supports both the client and server sides of the HTTP protocol.
  • Provides native support for WebSockets (both client and server).
  • Offers middleware and pluggable routing for web servers.
  • Efficiently handles streaming large data.
  • Includes client session persistence, enabling connection reuse and reducing overhead for multiple requests.

Scraping with AIOHTTP: Step-By-Step Tutorial

In the context of web scraping, AIOHTTP is just an HTTP client to fetch the raw HTML content of a page. To parse and extract data from that HTML, you then need an HTML parser like BeautifulSoup.

Follow this section to learn how to use AIOHTTP for web scraping with BeautifulSoup!

Warning: Although AIOHTTP is used primarily in the initial stages of the process, we will guide you through the entire scraping workflow. If you are interested in more advanced AIOHTTP web scraping techniques, feel free to skip ahead to the next chapter after Step 3.

Step #1: Setup Your Scraping Project

Ensure that Python 3+ is installed on your machine. If not, download it from the official site and follow the installation instructions.

Next, create a directory for your AIOHTTP scraping project using this command:

mkdir aiohttp-scraper

Navigate into that directory and set up a virtual environment:

cd aiohttp-scraper
python -m venv env

Open the project folder in your preferred Python IDE. Visual Studio Code with the Python extension or PyCharm Community Edition are both valid choices.

Now, create a scraper.py file inside the project folder. It will be empty at first, but you will soon add the scraping logic to it.

In your IDE’s terminal, activate the virtual environment. On Linux or macOS, use:

./env/bin/activate

Equivalently, on Windows, run:

env/Scripts/activate

Great! You are all set up and ready to go.

Step #2: Set Up the Scraping Libraries

With the virtual environment activated, install AIOHTTP and BeautifulSoup using the command below:

pip install aiohttp beautifulsoup4

This will add both the aiohttp and beautifulsoup4 to your project’s dependencies.

Import them into your scraper.py script:

import asyncio
import aiohttp 
from bs4 import BeautifulSoup

Note that aiohttp requires the asyncio to work.

Now, add the following async function workflow to your scrper.py file:

async def scrape_quotes():
    # Scraping logic...

# Run the asynchronous function
asyncio.run(scrape_quotes())

scrape_quotes() defines an asynchronous function where your scraping logic will run concurrently without blocking. Finally, asyncio.run(scrape_quotes()) starts and runs the asynchronous function.

Awesome! You can proceed to the next step in your scraping workflow.

Step #3: Get the HTML of the Target Page

In this example, you will see how to scrape data from the “Quotes to Scrape” site:

The target site

With libraries like Requests or AIOHTTP, you would simply make a GET request and directly receive the HTML content of the page. However, AIOHTTP follows a different request lifecycle.

The AIOHTTP primary component is the ClientSession, which manages a pool of connections and supports Keep-Alive by default. Instead of opening a new connection for every request, it reuses connections, improving performance.

When making a request, the process typically involves three steps:

  1. Opening a session through ClientSession().
  2. Sending the GET request asynchronously with session.get().
  3. Accessing the response data with methods like await response.text().

This design allows the event loop to use different with contexts between operations without blocking, making it ideal for high-concurrency tasks.

Given that, you can use AIOHTTP to retrieve the HTML of the homepage with this logic:

async with aiohttp.ClientSession() as session:
    async with session.get("http://quotes.toscrape.com") as response:
        # Access the HTML of the target page
        html = await response.text()

Behind the scenes, AIOHTTP sends the request to the server and waits for the response, which contains the HTML of the page. Once the response is received, await response.text() extracts the HTML content as a string.

Print the html variable and you will see:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Quotes to Scrape</title>
    <link rel="stylesheet" href="/static/bootstrap.min.css">
    <link rel="stylesheet" href="/static/main.css">
</head>
<body>
    <!-- omitted for brevity... -->
</body>
</html>

Way to go! You successfully retrieved the HTML content of the target page. Time to parse this content and extract the data you need.

Step #4: Parse the HTML

Pass the HTML content into the BeautifulSoup constructor to parse it:

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html, "html.parser")

html.parser is the default Python HTML parser used to process the content.

The soup object contains the parsed HTML and provides methods to extract the data you need.

AIOHTTP has handled retrieving the HTML, and now you are transitioning into the typical data parsing phase with BeautifulSoup. For more details, read our tutorial on BeautifulSoup web scraping.

Step #5: Write the Data Extraction Logic

You can scrape the quotes data from the page using the following code:

# Where to store the scraped data
quotes = []

# Extract all quotes from the page
quote_elements = soup.find_all("div", class_="quote")

# Loop through quotes and extract text, author, and tags
for quote_element in quote_elements:
    text = quote_element.find("span", class_="text").get_text().get_text().replace("“", "").replace("”", "")
    author = quote_element.find("small", class_="author")
    tags = [tag.get_text() for tag in quote_element.find_all("a", class_="tag")]

    # Store the scraped data
    quotes.append({
        "text": text,
        "author": author,
        "tags": tags
    })

This snippet initializes a list named quotes to hold the scraped data. It then identifies all quote HTML elements and loops through them to extract the quote text, author, and tags. Each extracted quote is stored as a dictionary in the quotes list, organizing the data for later use or export.

Super! Scraping logic is now implemented.

Step #6: Export the Scraped Data

Use these lines of code to export the scraped data into a CSV file:

# Open the file for export
with open("quotes.csv", mode="w", newline="", encoding="utf-8") as file:
    writer = csv.DictWriter(file, fieldnames=["text", "author", "tags"])
    
    # Write the header row
    writer.writeheader()
    
    # Write the scraped quotes data
    writer.writerows(quotes)

The above snippet opens a file named quotes.csv in write mode. Then it, sets up column headers (textauthortags), writes the headers, and then writes each dictionary from the quotes list to the CSV file.

csv.DictWriter simplifies data formatting, making it easier to store structured data. To make it work, remember to import csv from the Python Standard Library:

import csv

Step #7: Put It All Together

This is what your final AIOHTTP web scraping script should look like:

import asyncio
import aiohttp
from bs4 import BeautifulSoup
import csv

# Define an asynchronous function to make the HTTP GET request
async def scrape_quotes():
    async with aiohttp.ClientSession() as session:
        async with session.get("http://quotes.toscrape.com") as response:
            # Access the HTML of the target page
            html = await response.text()

            # Parse the HTML content using BeautifulSoup
            soup = BeautifulSoup(html, "html.parser")

            # List to store the scraped data
            quotes = []

            # Extract all quotes from the page
            quote_elements = soup.find_all("div", class_="quote")

            # Loop through quotes and extract text, author, and tags
            for quote_element in quote_elements:
                text = quote_element.find("span", class_="text").get_text().replace("“", "").replace("”", "")
                author = quote_element.find("small", class_="author").get_text()
                tags = [tag.get_text() for tag in quote_element.find_all("a", class_="tag")]

                # Store the scraped data
                quotes.append({
                    "text": text,
                    "author": author,
                    "tags": tags
                })

            # Open the file name for export
            with open("quotes.csv", mode="w", newline="", encoding="utf-8") as file:
                writer = csv.DictWriter(file, fieldnames=["text", "author", "tags"])

                # Write the header row
                writer.writeheader()

                # Write the scraped quotes data
                writer.writerows(quotes)

# Run the asynchronous function
asyncio.run(scrape_quotes())

You can run it with:

python scraper.py

Or, on Linux/macOS:

python3 scraper.py

quotes.csv file will appear in the root folder of your project. Open it and you will see:

Et voilà! You just learned how to perform web scraping with AIOHTTP and BeautifulSoup.

AIOHTTP for Web Scraping: Advanced Features and Techniques

Now that you understand how to use AIOHTTP for basic web scraping, it is time to see more advanced scenarios.

In the following examples, the target site will be the HTTPBin.io /anything endpoint. That is a handy API that returns the IP address, headers, and other data sent by the requester.

Get ready to master AIOHTTP for web scraping!

Set Custom Headers

You can specify custom headers in an AIOHTTP request with to the headers argument:

import aiohttp
import asyncio

async def fetch_with_custom_headers():
    # Custom headers for the request
    headers = {
        "Accept": "application/json",
        "Accept-Language": "en-US,en;q=0.9,fr-FR;q=0.8,fr;q=0.7,es-US;q=0.6,es;q=0.5,it-IT;q=0.4,it;q=0.3"
    }

    async with aiohttp.ClientSession() as session:
        # Make a GET request with custom headers
        async with session.get("https://httpbin.io/anything", headers=headers) as response:
            data = await response.json()
            # Handle the response...
            print(data)

# Run the event loop
asyncio.run(fetch_with_custom_headers())

This way, AIOHTTP will make a GET HTTP request with the Accept and Accept-Language headers set.

Set a Custom User Agent

User-Agent is one of the most critical HTTP headers for web scraping. By default, AIOHTTP uses this User-Agent:

Python/<PYTHON_VERSION> aiohttp/<AIOHTTP_VERSION>

The default value above can easily expose your requests as coming from an automated script. That will increase the risk of being blocked by the target site.

To reduce the chances of getting detected, you can set a custom real-world User-Agent as before:

import aiohttp
import asyncio

async def fetch_with_custom_user_agent():
    # Define a Chrome-like custom User-Agent
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36"
    }

    async with aiohttp.ClientSession(headers=headers) as session:
        # Make a GET request with the custom User-Agent
        async with session.get("https://httpbin.io/anything") as response:
            data = await response.text()
            # Handle the response...
            print(data)

# Run the event loop
asyncio.run(fetch_with_custom_user_agent())

Discover the best user agents for web scraping!

Set Cookies

Just like HTTP headers, you can set custom cookies using the cookies in ClientSession():

import aiohttp
import asyncio

async def fetch_with_custom_cookies():
    # Define cookies as a dictionary
    cookies = {
        "session_id": "9412d7hdsa16hbda4347dagb",
        "user_preferences": "dark_mode=false"
    }

    async with aiohttp.ClientSession(cookies=cookies) as session:
        # Make a GET request with custom cookies
        async with session.get("https://httpbin.io/anything") as response:
            data = await response.text()
            # Handle the response...
            print(data)

# Run the event loop
asyncio.run(fetch_with_custom_cookies())

Cookies help you include session data required in your web scraping requests.

Note that cookies set in ClientSession are shared across all requests made with that session. To access session cookies, refer to ClientSession.cookie_jar.

Proxy Integration

In AIOHTTP, you can route your requests through a proxy server to reduce the risk of IP bans. Do that by using the proxy argument in the HTTP method function on session:

import aiohttp
import asyncio

async def fetch_through_proxy():
    # Replace with the URL of your proxy server
    proxy_url = "<YOUR_PROXY_URL>"

    async with aiohttp.ClientSession() as session:
        # Make a GET request through the proxy server
        async with session.get("https://httpbin.io/anything", proxy=proxy_url) as response:
            data = await response.text()
            # Handle the response...
            print(data)

# Run the event loop
asyncio.run(fetch_through_proxy())

Find out how to perform proxy authentication and rotation in our guide on how to use a proxy in AIOHTTP.

Error Handling

By default, AIOHTTP raises errors only for connection or network issues. To raise exceptions for HTTP responses when receiving 4xx and 5xx status codes, you can use any of the following approaches:

  1. Set raise_for_status=True when creating the ClientSession: Automatically raise exceptions for all requests made through the session if the response status is 4xx or 5xx.
  2. Pass raise_for_status=True directly to request methods: Enable error raising for individual request methods (like session.get() or session.post()) without affecting others.
  3. Call response.raise_for_status() manually: Give full control over when to raise exceptions, allowing you to decide on a per-request basis.

Option #1 example:

import aiohttp
import asyncio

async def fetch_with_session_error_handling():
    async with aiohttp.ClientSession(raise_for_status=True) as session:
        try:
            async with session.get("https://httpbin.io/anything") as response:
                # No need to call response.raise_for_status(), as it is automatic
                data = await response.text()
                print(data)
        except aiohttp.ClientResponseError as e:
            print(f"HTTP error occurred: {e.status} - {e.message}")
        except aiohttp.ClientError as e:
            print(f"Request error occurred: {e}")

# Run the event loop
asyncio.run(fetch_with_session_error_handling())

When raise_for_status=True is set at the session level, all requests made through that session will raise an aiohttp.ClientResponseError for 4xx or 5xx responses.

Option #2 example:

import aiohttp
import asyncio

async def fetch_with_raise_for_status():
    async with aiohttp.ClientSession() as session:
        try:
            async with session.get("https://httpbin.io/anything", raise_for_status=True) as response:
                # No need to manually call response.raise_for_status(), it is automatic
                data = await response.text()
                print(data)
        except aiohttp.ClientResponseError as e:
            print(f"HTTP error occurred: {e.status} - {e.message}")
        except aiohttp.ClientError as e:
            print(f"Request error occurred: {e}")

# Run the event loop
asyncio.run(fetch_with_raise_for_status())

In this case, the raise_for_status=True argument is passed directly to the session.get() call. This ensures that an exception is raised automatically for any 4xx or 5xx status codes.

Option #3 example:

import aiohttp
import asyncio

async def fetch_with_manual_error_handling():
    async with aiohttp.ClientSession() as session:
        try:
            async with session.get("https://httpbin.io/anything") as response:
                response.raise_for_status()  # Manually raises error for 4xx/5xx
                data = await response.text()
                print(data)
        except aiohttp.ClientResponseError as e:
            print(f"HTTP error occurred: {e.status} - {e.message}")
        except aiohttp.ClientError as e:
            print(f"Request error occurred: {e}")

# Run the event loop
asyncio.run(fetch_with_manual_error_handling())

If you prefer more control over individual requests, you can call response.raise_for_status() manually after making a request. This approach allows you to decide exactly when to handle errors.

Retry Failed Requests

AIOHTTP does not provide built-in support for retrying requests automatically. To implement that, you must use custom logic or a third-party library like aiohttp-retry. This enables you to configure retry logic for failed requests, helping to handle transient network issues, timeouts, or rate limits.

Install aiohttp-retry with:

pip install aiohttp-retry

Then, you can use it as follows:

import asyncio
from aiohttp_retry import RetryClient, ExponentialRetry

async def main():
    retry_options = ExponentialRetry(attempts=1)
    retry_client = RetryClient(raise_for_status=False, retry_options=retry_options)
    async with retry_client.get("https://httpbin.io/anything") as response:
        print(response.status)
        
    await retry_client.close()

This configures retry behavior, with an exponential backoff strategy. Learn more in the official docs.

AIOHTTP vs Requests for Web Scraping

Below is a summary table to compare AIOHTTP and Requests for web scraping:

Feature AIOHTTP Requests
GitHub stars 15.3k 52.4k
Client support ✔️ ✔️
Sync support ✔️
Async support ✔️
Server support ✔️
Connection pooling ✔️ ✔️
HTTP/2 support
User-agent customization ✔️ ✔️
Proxy support ✔️ ✔️
Cookie handling ✔️ ✔️
Retry mechanism Available only via a third-party library Available via HTTPAdapters
Performance High Medium
Community support and popularity Medium Large

For a complete comparison, check out our blog post on Requests vs HTTPX vs AIOHTTP.

Learn how scrape websites with HTTPX.

Conclusion

In this article, you learned how to use the aiohttp library for web scraping. You explored what it is, the features it offers, and the benefits it provides. AIOHTTP stands out as a fast and reliable choice for making HTTP requests when gathering online data.

However, automated HTTP requests expose your public IP address. That can reveal your identity and location, putting your privacy at risk. To safeguard your security and privacy, one of the most effective strategies is to use a proxy server to hide your IP address.

Bright Data controls the best proxy servers in the world, serving Fortune 500 companies and more than 20,000 customers. Its offer includes a wide range of proxy types:

Create a free Bright Data account today to test our proxies and scraping solutions!

No credit card required