Scrapy vs. Requests: Which One Is Better For Web Scraping?

Compare Scrapy and Requests for web scraping to find the best tool for your needs.
16 min read
Scrapy vs Requests blog image

In this Scrapy vs Requests guide, you will see:

  • What Scrapy and Requests are
  • A comparison between Scrapy and Requests for web scraping
  • A comparison between Scrapy and Requests on a pagination scenario
  • Common limitations between Scrapy and Requests in web scraping scenarios

Let’s dive in!

What Is Requests?

Requests is a Python library for sending HTTP requests. It is widely used in web scraping, generally coupled with HTML parsing libraries like BeautifulSoup.

Key features of Requests for web scraping include:

  • Support for HTTP methods: You can use all major HTTP methods like GETPOSTPUTPATCH, and DELETE, which are essential for interacting with web pages and APIs.
  • Custom headers: Set custom headers (e.g., User-Agent and others) to mimic a real browser or handle basic authentication.
  • Session management: The requests.Session() object allows you to persist cookies and headers across multiple requests. That is useful for scraping websites that require login or maintaining session states.
  • Timeouts and error handling: You can set timeouts to avoid hanging requests and handle exceptions for robust scraping.
  • Proxy support: You can route your requests through proxies, which is helpful for bypassing IP bans and accessing geo-restricted content.

What Is Scrapy?

Scrapy is an open-source web scraping framework written in Python. It is built for extracting data from websites in a fast, efficient, and scalable way.

Scrapy provides a complete framework for crawling websites, extracting data, and storing it in various formats (e.g., JSON, CSV, etc.). It is particularly useful for large-scale web scraping projects, as it can handle complex crawling tasks and concurrent requests while respecting crawling rules.

Key features of Scrapy for web scraping include:

  • Built-in web crawling: Scrapy is designed to be a web crawler. This means that it can follow links on a webpage automatically, allowing you to scrape multiple pages or entire sites with minimal effort.
  • Asynchronous requests: It uses an asynchronous architecture to handle multiple requests concurrently. That makes it much faster than Python HTTP clients like requests.
  • Selectors for data extraction: Scrapy provides the possibility to extract data from HTML by using XPaths and CSS Selectors.
  • Middleware for customization: It supports middleware to customize how requests and responses are handled.
  • Automatic throttling: It can automatically throttle requests to avoid overloading the target server. This means that it can adjust the crawling speed based on server response times and load.
  • Handling robots.txt: It respects the robots.txt file for web scraping, ensuring that your scraping activities comply with the site’s rules.
  • Proxy and user-agent rotation: Scrapy supports proxy rotation and User-Agent rotation through middlewares, which helps avoid IP bans and detection.

Scrapy vs Requests: Feature Comparison for Web Scraping

Now that you learned what Requests and Scrapy are, it is time to make a deep comparison of their uses for web scraping:

Feature Scrapy Requests
Use case Large-scale and complex scraping projects Simpler web scraping tasks and prototypes
Asynchronous requests Built-in support for asynchronous requests No built-in support
Crawling Automatically follows links and crawls multiple pages Requires manual implementation for crawling
Data extraction Built-in support for XPath and CSS selectors Requires external libraries to manage data extraction
Concurrency Handles multiple requests concurrently out of the box Requires external integrations to manage concurrency requests
Middleware Customizable middlewares for handling proxies, retries, and headers No built-in middleware
Throttling Built-in auto-throttling to avoid overloading servers No built-in throttling
Proxy rotation Supports proxy rotation via middlewares Requires manual implementation
Errors handling Built-in retry mechanisms for failed requests Requires manual implementation
Files downloads Supports file downloads but requires additional setup Simple and straightforward file download support

Use Cases

Scrapy is a full-fledged web scraping framework for large-scale and complex scraping projects. It is ideal for tasks that involve crawling multiple pages, concurrent requests, and data export in structured formats.

Requests, on the other hand, is a library that manages HHTP requests. So, it is better suited for simple tasks like fetching a single webpage, interacting with APIs, or downloading files.

Asynchronous Requests and Concurrency

Scrapy is built on Twisted, an event-driven networking framework for Python. That means it can handle asynchronous and multiple requests concurrently, making it much faster for large-scale scraping.

Requests, instead, does not support asynchronous or concurrent requests natively. If you want to make asynchronous HTTP requests you can integrate it with GRequests.

Crawling

When the ROBOTSTXT_OBEY setting is set to True, Scrapy will read the robots.txt file, automatically following allowed links on a webpage, and crawling allowed pages.

Requests does not have built-in crawling capabilities, so you need to manually define links and make additional requests.

Data Extraction

Scrapy provides built-in support for extracting data using XPath and CSS selectors, making it easy to parse HTML and XML.

Requests does not include any data extraction abilities. You need to use external libraries like BeautifulSoup for parsing and extracting data.

Middleware

Scrapy offers customizable middlewares for handling proxiesretriesheaders, and more. This makes it highly extensible for advanced scraping tasks.

Instead, Requests does not provide middleware support, so you need to manually implement features like proxy rotation or retries.

Throttling

Scrapy includes a built-in auto-throttling ability used to adjust the crawling speed based on server response times and load. That way, you can avoid flooding the target server with HTTP requests.

Requests does not have a built-in throttling feature. If you want to implement throttling , you need to manually add delays between requests, for example by using the method time.sleep().

Proxy Rotation

Scrapy supports proxy rotation through middlewares, making it easy to avoid IP bans and scrape sites anonymously.

Requests does not provide a built-in proxy rotation capability. If you want to manage proxies with requests, you need to manually configure proxies and write custom logic, as explained in our guide.

Errors Handling

Scrapy includes built-in retry mechanisms for failed requests, making it robust for handling network errors or server issues.

On the contrary, Requests requires you to manually handle errors and exceptions—for example, by using the try-except block. Consider also libraries like retry-requests.

Files Downloads

Scrapy supports file downloads via the FilesPipeline but requires additional setup to handle large files or streaming.

Requests provides a simple and straightforward file download support with the stream=True parameter into the requests.get() method.

Scrapy vs Requests: Comparing the Two Libraries on a Pagination Scenario

You now know what Requests and Scrapy are. Get ready to see a step-by-step tutorial comparison for a specific web scraping scenario!

The focus will be on showing a comparison between these two libraries in a pagination scenario. Handling pagination in web scraping requires custom logic for link following and data extraction on multiple pages.

The target site will be Quotes to Scrape, which provides quotes from famous authors on different pages:

The Quotes to Scrape target site

The objective of the tutorial is to show how to use Scrapy and Requests to retrieve the quotes from all pages. We will start with Requests, as it is may be more complex to use than Scrapy.

Requirements

To replicate the tutorials for Scrapy and Requests, you must have Python 3.7 or higher installed on your machine.

How to Use Requests for Web Scraping

In this chapter, you will learn how to use Requests to scrape all the quotes from the target site.

Bear in mind that you can not use Requests alone to scrape data directly from web pages. You will also need an HTML parser like BeautifulSoup.

Step #1: Setting Up the Environment and Installing Dependencies

Suppose you call the main folder of your project requests_scraper/. At the end of this step, the folder will have the following structure:

requests_scraper/
    ├── requests_scraper.py
    └── venv/

Where:

  • requests_scraper.py is that Python file that contains all the code
  • venv/ contains the virtual environment

You can create the venv/ virtual environment directory like so:

python -m venv venv

To activate it, on Windows, run:

venv\Scripts\activate

Equivalently, on macOS and Linux, execute:

source venv/bin/activate

Now you can install the required libraries with:

pip install requests beautifulsoup4

Step #2: Setting Up the Variables

You are now ready to start writing code into the requests_scraper.py file.

First, set up the variables like so:

base_url = "https://quotes.toscrape.com"
all_quotes = []

Here you defined:

  • base_url as the starting URL of the website to scrape
  • all_quotes as an empty list used to store all the quotes as they are scraped

Step #3: Create the Scraping Logic

You can implement the scraping and crawling logic with the following code:

url = base_url
while url:
    # Send a GET request to the current page
    response = requests.get(url)

    # Parse the HTML code of the page
    soup = BeautifulSoup(response.text, "html.parser")

    # Find all quote blocks
    quotes = soup.select(".quote")
    for quote in quotes:
        text = quote.select_one(".text").get_text(strip=True)
        author = quote.select_one(".author").get_text(strip=True)
        tags = [tag.get_text(strip=True) for tag in quote.select(".tag")]
        all_quotes.append({
            "text": text,
            "author": author,
            "tags": ",".join(tags)
        })

    # Check for the "Next" button
    next_button = soup.select_one("li.next")
    if next_button:
        # Extract the URL from the "Next" button and
        # set it as the next page to scrape
        next_page = next_button.select_one("a")["href"]
        url = base_url + next_page
    else:
        url = None

This code:

  • Instantiates a while loop that will continue to run until all the pages are scraped
  • Under the while loop:
    • soup.``select``() intercepts all quote HTML elements on the page. The HTML of the page is structured so that each quote element has a class called quote.
    • The for cycle iterates all over the quote classes to extract the text, author, and tags from the quotes with the scraping methods from Beautiful Soup. Here, you need custom logic for tags that each quote element can contain more than one tag.
      The ‘quote’ classes in the HTML code of the target web page
  • After scraping the whole page, the script searches for the next button. If the button exists, it extracts the link to the next page. Then, the base URL is updated to be the next one via the variable url = base_url + next_page. When the process hits the last page, the next URL is set to None, and the process ends.
The ‘next’ class that defines the ‘next’ button in the HTML code of the target web page

Step #4: Append the Data to a CSV File

Now that you have scraped all the data, you can append it to a CSV file as below:

with open("quotes.csv", mode="w", newline="", encoding="utf-8") as file:
    writer = csv.DictWriter(file, fieldnames=["text", "author", "tags"])
    writer.writeheader()
    writer.writerows(all_quotes)

This part of the script uses the csv library to:

  • Specify the name of the output CSV file as quotes.csv.
  • Open the CSV in writing mode (mode="w") and:
    • Write the header row to the CSV
    • Write all the scraped quotes to the file

Step #5: Put it All Together

This is the whole code for this Scrapy vs Requests part of the tutorial:

import requests
from bs4 import BeautifulSoup
import csv

# URL of the website
base_url = "https://quotes.toscrape.com"
# List to store all quotes
all_quotes = []

# Start scraping from the first page
url = base_url
while url:
    # Send a GET request to the current page
    response = requests.get(url)

    # Parse the HTML code of the page
    soup = BeautifulSoup(response.text, "html.parser")

    # Find all quote blocks
    quotes = soup.select(".quote")
    for quote in quotes:
        text = quote.select_one(".text").get_text(strip=True)
        author = quote.select_one(".author").get_text(strip=True)
        tags = [tag.get_text(strip=True) for tag in quote.select(".tag")]
        all_quotes.append({
            "text": text,
            "author": author,
            "tags": ",".join(tags)
        })

    # Check for the "Next" button
    next_button = soup.select_one("li.next")
    if next_button:
        # Extract the URL from the "Next" button and
        # set it as the next page to scrape
        next_page = next_button.select_one("a")["href"]
        url = base_url + next_page
    else:
        url = None

# Save the quotes to a CSV file
with open("quotes.csv", mode="w", newline="", encoding="utf-8") as file:
    writer = csv.DictWriter(file, fieldnames=["text", "author", "tags"])
    writer.writeheader()
    writer.writerows(all_quotes)

Run the above script:

python requests_scraper.py

quotes.csv file will appear in the project folder:

The expected CSV file after data extraction with Requests and BeautifulSoup

How to Use Scrapy for Web Scraping

Now that you have learned how to use Requests for web scraping, you are ready to see how to use Scrapy with the same target page and objective.

Step #1: Setting Up the Environment and Installing Dependencies

Suppose you want to call the main folder of your project scrapy_scraper/.

First of all, create and activate a virtual environment as shown before and install Scrapy:

pip install scrapy

Launch Scrapy to populate the main folder with predefined files inside quotes_scraper/ with:

scrapy startproject quotes_scraper

This is the resulting structure of your project:

scrapy_scraper/
├── quotes_scraper/ # Main Scrapy project folder
│   ├── __init__.py
│   ├── items.py # Defines the data structure for scraped items
│   ├── middlewares.py # Custom middlewares
│   ├── pipelines.py # Handles post-processing of scraped data
│   ├── settings.py # Project settings
│   └── spiders/ # Folder for all spiders
├── venv/
└── scrapy.cfg # Scrapy configuration file

Step #2: Define the Items

The items.py file defines the structure of the data you want to scrape. Since you want to retrieve the quotes, authors, and tags, define it as follows:

import scrapy

class QuotesScraperItem(scrapy.Item):
    quote = scrapy.Field()
    author = scrapy.Field()
    tags = scrapy.Field()

Step #3: Define the Main Spider

Inside the spiders/ folder create the following Python files:

  • __init__.py, which marks the directory as a Python package
  • quotes_spider.py

The quotes_spider.py contains the actual scraping logic:

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ..items import QuotesScraperItem

class QuotesSpider(CrawlSpider):
    name = "quotes"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ["https://quotes.toscrape.com/"]

    # Define rules for following pagination links
    rules = (
        Rule(LinkExtractor(restrict_css="li.next a"), callback="parse_item", follow=True),
    )

    def parse_item(self, response):
        # Extract quotes, authors, and tags
        for quote in response.css("div.quote"):
            item = QuotesScraperItem()
            item["quote"] = quote.css("span.text::text").get()
            item["author"] = quote.css("small.author::text").get()
            item["tags"] = quote.css("div.tags a.tag::text").getall()
            yield item

The above snippet defines the QuotesSpider() class that does the following:

  • Defines the URL to scrape.
  • Defines the rule for pagination with the class Rule(), allowing the crawler to follow all the next pages.
  • Extracts the quote, author, and tag with the method parse_item().

Step #4: Define the Settings

Appending the data to a CSV requires some special configurations in Scrapy. To do so, open the settings.py file and add the following variables to the file:

FEED_FORMAT = "csv"
FEED_URI = "quotes.csv"

Here is what these settings do:

Step #5: Run the Crawler

The Python files not mentioned in the previous steps are not useful for this tutorial, so you can leave them with the default data.

To launch the crawler, go into the quotes_scraper/ folder:

cd quotes_scraper

Then, run the crawler:

scrapy crawl quotes

This command instantiates the class QuotesSpider() in the file quotes_spider.py, which is the one that launches the crawler. The final CSV file you get is identical to the one you got with Requests and BeautifulSoup!

So, this example shows:

  • How Scrapy is more suitable for large projects due to its nature.
  • How managing pagination is easier with Scrapy, as you only need to manage a rule instead of writing custom logic, as in the previous case.
  • How appending data to a CSV file is simpler with Scrapy. That is because you only need to add two settings instead of creating the classical custom logic you would create when writing a Python script that does so.

Common Limitations Between Scrapy and Requests

While Scrapy and Requests are widely used in web scraping projects, they do come with some downsides.

In detail, one of the common limitations every scraping library or framework is subject to is IP ban. You learned that Scrapy provides throttling, which helps adjust the speed at which the server is requested. Still, that is often not enough to get your IP from not being banned.

The solution to avoid your IP from being banned is to implement proxies into your code. Let’s see how!

Using Proxy With Requests

If you want to use a single proxy in Requests, use the following logic:

proxy = {
    "http": "<HTTP_PROXY_URL>",
    "https": "<HTTPs_PROXY_URL>"
}
response = requests.get(url, proxies=proxy)

To learn more about proxies and proxy rotation in requests, read these guides from our blog:

Using Proxy in Scrapy

If you want to implement a single proxy into your code, add the following settings to the settings.py file:

# Configure a single proxy
HTTP_PROXY = "<PROXY_URL>"

# Enable the HttpProxyMiddleware and disable the default UserAgentMiddleware
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
}

These settings will route all requests through the specified proxy. Learn more in our Scrapy proxy integration guide.

Instead, if you want to implement rotating proxies, you can use the scrapy-rotating-proxies library. Similarly, you can use an auto-rotating residential proxy.

If you are seeking reliable proxies, keep in mind that Bright Data’s proxy network is trusted by Fortune 500 companies and over 20,000 customers worldwide. This extensive network includes:

Conclusion

In this Scrapy vs Requests blog post, you learned about the role of the two libraries in web scraping. You explored their features for page retrieval and data extraction and compared their performance in a real-world pagination scenario.

Requests requires more manual logic but offers greater flexibility for custom use cases, while Scrapy is slightly less adaptable but provides most of the tools needed for structured scraping.

You also discovered their limitations, such as potential IP bans and issues with geo-restricted content. Fortunately, these challenges can be overcome using proxies or dedicated web scraping solutions like Bright Data’s Web Scrapers.

The Web Scrapers seamlessly integrate with both Scrapy and Requests, allowing you to extract public data from major websites without restrictions.

Create a free Bright Data account today to explore our proxy and scraper APIs and start your free trial!

No credit card required