In this Scrapy vs Requests guide, you will see:
- What Scrapy and Requests are
- A comparison between Scrapy and Requests for web scraping
- A comparison between Scrapy and Requests on a pagination scenario
- Common limitations between Scrapy and Requests in web scraping scenarios
Let’s dive in!
What Is Requests?
Requests is a Python library for sending HTTP requests. It is widely used in web scraping, generally coupled with HTML parsing libraries like BeautifulSoup.
Key features of Requests for web scraping include:
- Support for HTTP methods: You can use all major HTTP methods like
GET
,POST
,PUT
,PATCH
, andDELETE
, which are essential for interacting with web pages and APIs. - Custom headers: Set custom headers (e.g.,
User-Agent
and others) to mimic a real browser or handle basic authentication. - Session management: The
requests.Session()
object allows you to persist cookies and headers across multiple requests. That is useful for scraping websites that require login or maintaining session states. - Timeouts and error handling: You can set timeouts to avoid hanging requests and handle exceptions for robust scraping.
- Proxy support: You can route your requests through proxies, which is helpful for bypassing IP bans and accessing geo-restricted content.
What Is Scrapy?
Scrapy is an open-source web scraping framework written in Python. It is built for extracting data from websites in a fast, efficient, and scalable way.
Scrapy provides a complete framework for crawling websites, extracting data, and storing it in various formats (e.g., JSON, CSV, etc.). It is particularly useful for large-scale web scraping projects, as it can handle complex crawling tasks and concurrent requests while respecting crawling rules.
Key features of Scrapy for web scraping include:
- Built-in web crawling: Scrapy is designed to be a web crawler. This means that it can follow links on a webpage automatically, allowing you to scrape multiple pages or entire sites with minimal effort.
- Asynchronous requests: It uses an asynchronous architecture to handle multiple requests concurrently. That makes it much faster than Python HTTP clients like
requests
. - Selectors for data extraction: Scrapy provides the possibility to extract data from HTML by using XPaths and CSS Selectors.
- Middleware for customization: It supports middleware to customize how requests and responses are handled.
- Automatic throttling: It can automatically throttle requests to avoid overloading the target server. This means that it can adjust the crawling speed based on server response times and load.
- Handling
robots.txt
: It respects therobots.txt
file for web scraping, ensuring that your scraping activities comply with the site’s rules. - Proxy and user-agent rotation: Scrapy supports proxy rotation and
User-Agent
rotation through middlewares, which helps avoid IP bans and detection.
Scrapy vs Requests: Feature Comparison for Web Scraping
Now that you learned what Requests and Scrapy are, it is time to make a deep comparison of their uses for web scraping:
Feature | Scrapy | Requests |
---|---|---|
Use case | Large-scale and complex scraping projects | Simpler web scraping tasks and prototypes |
Asynchronous requests | Built-in support for asynchronous requests | No built-in support |
Crawling | Automatically follows links and crawls multiple pages | Requires manual implementation for crawling |
Data extraction | Built-in support for XPath and CSS selectors | Requires external libraries to manage data extraction |
Concurrency | Handles multiple requests concurrently out of the box | Requires external integrations to manage concurrency requests |
Middleware | Customizable middlewares for handling proxies, retries, and headers | No built-in middleware |
Throttling | Built-in auto-throttling to avoid overloading servers | No built-in throttling |
Proxy rotation | Supports proxy rotation via middlewares | Requires manual implementation |
Errors handling | Built-in retry mechanisms for failed requests | Requires manual implementation |
Files downloads | Supports file downloads but requires additional setup | Simple and straightforward file download support |
Use Cases
Scrapy is a full-fledged web scraping framework for large-scale and complex scraping projects. It is ideal for tasks that involve crawling multiple pages, concurrent requests, and data export in structured formats.
Requests, on the other hand, is a library that manages HHTP requests. So, it is better suited for simple tasks like fetching a single webpage, interacting with APIs, or downloading files.
Asynchronous Requests and Concurrency
Scrapy is built on Twisted, an event-driven networking framework for Python. That means it can handle asynchronous and multiple requests concurrently, making it much faster for large-scale scraping.
Requests, instead, does not support asynchronous or concurrent requests natively. If you want to make asynchronous HTTP requests you can integrate it with GRequests.
Crawling
When the ROBOTSTXT_OBEY
setting is set to True
, Scrapy will read the robots.txt
file, automatically following allowed links on a webpage, and crawling allowed pages.
Requests does not have built-in crawling capabilities, so you need to manually define links and make additional requests.
Data Extraction
Scrapy provides built-in support for extracting data using XPath and CSS selectors, making it easy to parse HTML and XML.
Requests does not include any data extraction abilities. You need to use external libraries like BeautifulSoup for parsing and extracting data.
Middleware
Scrapy offers customizable middlewares for handling proxies, retries, headers, and more. This makes it highly extensible for advanced scraping tasks.
Instead, Requests does not provide middleware support, so you need to manually implement features like proxy rotation or retries.
Throttling
Scrapy includes a built-in auto-throttling ability used to adjust the crawling speed based on server response times and load. That way, you can avoid flooding the target server with HTTP requests.
Requests does not have a built-in throttling feature. If you want to implement throttling , you need to manually add delays between requests, for example by using the method time.sleep()
.
Proxy Rotation
Scrapy supports proxy rotation through middlewares, making it easy to avoid IP bans and scrape sites anonymously.
Requests does not provide a built-in proxy rotation capability. If you want to manage proxies with requests
, you need to manually configure proxies and write custom logic, as explained in our guide.
Errors Handling
Scrapy includes built-in retry mechanisms for failed requests, making it robust for handling network errors or server issues.
On the contrary, Requests requires you to manually handle errors and exceptions—for example, by using the try-except
block. Consider also libraries like retry-requests
.
Files Downloads
Scrapy supports file downloads via the FilesPipeline
but requires additional setup to handle large files or streaming.
Requests provides a simple and straightforward file download support with the stream=True
parameter into the requests.get()
method.
Scrapy vs Requests: Comparing the Two Libraries on a Pagination Scenario
You now know what Requests and Scrapy are. Get ready to see a step-by-step tutorial comparison for a specific web scraping scenario!
The focus will be on showing a comparison between these two libraries in a pagination scenario. Handling pagination in web scraping requires custom logic for link following and data extraction on multiple pages.
The target site will be Quotes to Scrape, which provides quotes from famous authors on different pages:
The objective of the tutorial is to show how to use Scrapy and Requests to retrieve the quotes from all pages. We will start with Requests, as it is may be more complex to use than Scrapy.
Requirements
To replicate the tutorials for Scrapy and Requests, you must have Python 3.7 or higher installed on your machine.
How to Use Requests for Web Scraping
In this chapter, you will learn how to use Requests to scrape all the quotes from the target site.
Bear in mind that you can not use Requests alone to scrape data directly from web pages. You will also need an HTML parser like BeautifulSoup.
Step #1: Setting Up the Environment and Installing Dependencies
Suppose you call the main folder of your project requests_scraper/
. At the end of this step, the folder will have the following structure:
Where:
requests_scraper.py
is that Python file that contains all the codevenv/
contains the virtual environment
You can create the venv/
virtual environment directory like so:
To activate it, on Windows, run:
Equivalently, on macOS and Linux, execute:
Now you can install the required libraries with:
Step #2: Setting Up the Variables
You are now ready to start writing code into the requests_scraper.py
file.
First, set up the variables like so:
Here you defined:
base_url
as the starting URL of the website to scrapeall_quotes
as an empty list used to store all the quotes as they are scraped
Step #3: Create the Scraping Logic
You can implement the scraping and crawling logic with the following code:
This code:
- Instantiates a
while
loop that will continue to run until all the pages are scraped - Under the
while
loop:soup.``select``()
intercepts all quote HTML elements on the page. The HTML of the page is structured so that each quote element has a class calledquote
.- The
for
cycle iterates all over thequote
classes to extract the text, author, and tags from the quotes with the scraping methods from Beautiful Soup. Here, you need custom logic for tags that each quote element can contain more than one tag.
- After scraping the whole page, the script searches for the
next
button. If the button exists, it extracts the link to the next page. Then, the base URL is updated to be the next one via the variableurl = base_url + next_page
. When the process hits the last page, the next URL is set toNone
, and the process ends.
Step #4: Append the Data to a CSV File
Now that you have scraped all the data, you can append it to a CSV file as below:
This part of the script uses the csv
library to:
- Specify the name of the output CSV file as
quotes.csv
. - Open the CSV in writing mode (
mode="w"
) and:- Write the header row to the CSV
- Write all the scraped quotes to the file
Step #5: Put it All Together
This is the whole code for this Scrapy vs Requests part of the tutorial:
Run the above script:
A quotes.csv
file will appear in the project folder:
How to Use Scrapy for Web Scraping
Now that you have learned how to use Requests for web scraping, you are ready to see how to use Scrapy with the same target page and objective.
Step #1: Setting Up the Environment and Installing Dependencies
Suppose you want to call the main folder of your project scrapy_scraper/
.
First of all, create and activate a virtual environment as shown before and install Scrapy:
Launch Scrapy to populate the main folder with predefined files inside quotes_scraper/
with:
This is the resulting structure of your project:
Step #2: Define the Items
The items.py
file defines the structure of the data you want to scrape. Since you want to retrieve the quotes, authors, and tags, define it as follows:
Step #3: Define the Main Spider
Inside the spiders/
folder create the following Python files:
__init__.py
, which marks the directory as a Python packagequotes_spider.py
The quotes_spider.py
contains the actual scraping logic:
The above snippet defines the QuotesSpider()
class that does the following:
- Defines the URL to scrape.
- Defines the rule for pagination with the class
Rule()
, allowing the crawler to follow all the next pages. - Extracts the quote, author, and tag with the method
parse_item()
.
Step #4: Define the Settings
Appending the data to a CSV requires some special configurations in Scrapy. To do so, open the settings.py
file and add the following variables to the file:
Here is what these settings do:
- The
FEED_FORMAT
defines the output format of the file (which can be of different types) - The
FEED_URI
defines the name of the output file
Step #5: Run the Crawler
The Python files not mentioned in the previous steps are not useful for this tutorial, so you can leave them with the default data.
To launch the crawler, go into the quotes_scraper/
folder:
Then, run the crawler:
This command instantiates the class QuotesSpider()
in the file quotes_spider.py
, which is the one that launches the crawler. The final CSV file you get is identical to the one you got with Requests and BeautifulSoup!
So, this example shows:
- How Scrapy is more suitable for large projects due to its nature.
- How managing pagination is easier with Scrapy, as you only need to manage a rule instead of writing custom logic, as in the previous case.
- How appending data to a CSV file is simpler with Scrapy. That is because you only need to add two settings instead of creating the classical custom logic you would create when writing a Python script that does so.
Common Limitations Between Scrapy and Requests
While Scrapy and Requests are widely used in web scraping projects, they do come with some downsides.
In detail, one of the common limitations every scraping library or framework is subject to is IP ban. You learned that Scrapy provides throttling, which helps adjust the speed at which the server is requested. Still, that is often not enough to get your IP from not being banned.
The solution to avoid your IP from being banned is to implement proxies into your code. Let’s see how!
Using Proxy With Requests
If you want to use a single proxy in Requests, use the following logic:
To learn more about proxies and proxy rotation in requests
, read these guides from our blog:
Using Proxy in Scrapy
If you want to implement a single proxy into your code, add the following settings to the settings.py
file:
These settings will route all requests through the specified proxy. Learn more in our Scrapy proxy integration guide.
Instead, if you want to implement rotating proxies, you can use the scrapy-rotating-proxies
library. Similarly, you can use an auto-rotating residential proxy.
If you are seeking reliable proxies, keep in mind that Bright Data’s proxy network is trusted by Fortune 500 companies and over 20,000 customers worldwide. This extensive network includes:
- Residential proxies: Over 72M residential IPs in more than 195 countries.
- Datacenter proxies: Over 770,000 datacenter IPs.
- ISP proxies: Over 700,000 ISP IPs.
- Mobile proxies: Over 7M mobile IPs.
Conclusion
In this Scrapy vs Requests blog post, you learned about the role of the two libraries in web scraping. You explored their features for page retrieval and data extraction and compared their performance in a real-world pagination scenario.
Requests requires more manual logic but offers greater flexibility for custom use cases, while Scrapy is slightly less adaptable but provides most of the tools needed for structured scraping.
You also discovered their limitations, such as potential IP bans and issues with geo-restricted content. Fortunately, these challenges can be overcome using proxies or dedicated web scraping solutions like Bright Data’s Web Scrapers.
The Web Scrapers seamlessly integrate with both Scrapy and Requests, allowing you to extract public data from major websites without restrictions.
Create a free Bright Data account today to explore our proxy and scraper APIs and start your free trial!
No credit card required