In this Scrapy vs Pyspider guide, you will learn:
- What Scrapy and Pyspider are
- A comparison between Scrapy and Pyspider for web scraping
- How to use both Scrapy and Pyspider for web scraping
- Common limitations between Scrapy and Pyspider in web scraping scenarios
Let’s dive in!
What Is Scrapy?
Scrapy is an open-source web scraping framework written in Python. Its main goal is to extract data from websites quickly and efficiently. In detail, it allows you to:
- Define how to navigate and gather information from one or more web pages.
- Handle aspects like HTTP requests, link following, and data extraction.
- Avoid banning by adjusting the request speed with throttling and asynchronous requests.
- Manage proxies and proxy rotation via custom middleware or the
scrapy-rotating-proxies
library.
What Is Pyspider?
Pyspider is an open-source web crawling framework written in Python. It is built to extract data from websites with ease and flexibility, and enables you to:
- Define how to navigate and gather information from one or more web pages via either the CLI or a user-friendly web interface.
- Handle aspects like task scheduling, retries, and data storage.
- Limit blocks by supporting distributed crawling and prioritized tasks.
- Manage complex workflows and data processing with built-in support for databases and message queues.
Scrapy vs Pyspider: Features Comparison for Web Scraping
Now that you have learned what Scrapy and Pyspider are, it is time to compare them for web scraping:
Feature | Scrapy | Pyspider |
---|---|---|
Use case | Large-scale and complex scraping projects | Scheduled scraping tasks |
Scraping management | CLI | CLI and UI |
Parsing methods | XPath and CSS Selectors | CSS Selectors |
Data saving | Can export data to CSVs and other file formats | Automatically saves data into database |
Retry | Needs manual intervention to retry | Automatically retries failed tasks |
Tasks scheduling | Needs external integrations | Natively supported |
Proxy rotation | Supports proxy rotation via middlewares | Requires manual intervention |
Community | Hige community, currently with more than 54k GitHub stars, which actively contributes to it | Vast community, currently with more than 16k GitHub stars, but archived since June 11, 2024 |
The above Scrapy vs Pyspider comparison table shows that these two libraries are similar. The major differences at a high level are:
- Scrapy can be used only via the CLI, while Pyspider also provides a UI.
- Scrapy can parse XPath and CSS selectors, while Pyspider only supports CSS selectors.
- Scrapy automatically supports proxy rotation via custom middleware logic.
However, what is really important to consider is that Pyspider is no longer supported:
Scrapy vs Pyspider: Direct Scraping Comparison
After comparing Scrapy vs Pyspider, you learned that these two frameworks offer similar web scraping features. For that reason, the best way to compare them is through an actual coding example.
The next two sections will show you how to use Scrapy and Pyspider to scrape the same site. In detail, the target page will be the “Hokey Teams” page from Scrape This Site. This contains hockey data in a tabular form:
The goal of these sections is to retrieve all the data from the table and save them locally. Let’s see how!
How to Use Scrapy for Web Scraping
In this paragraph, you will learn how to use Scrapy to retrieve all the data from the table provided by the target website.
Requirements
To follow this tutorial, you must have Python 3.7 or higher installed on your machine.
Step #1: Setting Up the Environment and Installing Dependencies
Suppose you call the main folder of your project hockey_scraper/
. At the end of this step, the folder will have the following structure:
You can create the venv/
virtual environment directory like so:
To activate it, on Windows, run:
Equivalently, on macOS/Linux, execute:
Now you can install Scrapy with:
Step #2: Start a New Project
Now you can launch a new Scrapy project. Inside the hockey_scraper/
main folder, type:
With that command, Scrapy will create a hockey/
folder. Inside it, it will automatically generate all the files you need. This is the resulting folder structure:
Step #3: Generate the Spider
To generate a new spider to crawl the target website, first go into the hockey/
folder:
Then, generate a new spider with:
In this script, data
represents the name of the spider. Scrapy will automatically create a data.py
file inside the spiders/
folder. That file will contain the required scraping logic to retrieve the Hokey team data.
Step #4: Define the Scraping Logic
You are ready to code the scraping logic. First, inspect the table containing the data of interest in your browser. You can see that the data is contained inside a .table
element:
To get all the data, write the following code in the data.py
file:
Note that the variables name
, allowed_domains
, and start_urls
have been automatically created by Scrapy in the previous step.
Even the parse()
method has been automatically created by Scrapy. So, you only need to add the scraping logic in this step which is under the for
loop.
In detail, the response.css()
method searches for the table. Then, the code iterates over all the rows of the table and gets the data.
Step 5: Run the Crawler and Save the Data into a CSV File
To run the crawler and save the scraped data into a CSV file type the following:
With this code, Scrapy:
- Runs the
data.py
file that contains the scraping logic - Saves the scraped data into a CSV file called
output.csv
The expected output.csv
file produced by the scraper is:
Note that this way of using Scrapy is the shortest, but it is not the only one. Scrapy provides different customizations and settings, and you can learn more about that in our article on Scrapy vs Requests.
How to Use Pyspider for Web Scraping
See how to use Pyspider to scrape the same target website.
Requirements
Pyspider supports Python3.6 as the latest version. If you have later Python versions installed, read the following step to learn how to use its 3.6 version.
Step #1: Setting Up the Environment and Installing Dependencies
Suppose call the main folder of your project hockey_scraper/
.
If you have Python 3.7 or later, install pyenv
to get Python 3.6.
Use pyenv
to install Python 3.6 with this command:
Then make it the local version of Python, so you do not affect the whole system with a different version:
To make sure everything went alright, verify the Python version:
The result must be:
Create a virtual environment by selecting the correct Python version:
Activate the virtual environment as shown in the previous chapter of this guide. Now, you can install Pyspider with:
To launch the UI run:
Note that, since this repository is archived and you are using Python 3.6, you will receive some errors. To fix them, you may need to install the following libraries:
You might also receive other errors regarding the webdav.py
file. Search for the file, and fix the following:
- In the
ScriptProvider()
class, rename the methodgetResourceInst()
toget_resource_inst()
- At the bottom of the file, search for the variable
config = DEFAULT_CONFIG.copy()
and change all the subsequent code to:
The web UI of pyspider should now be able to start. Visit http://localhost:5000/
in your browser, and this is what you should be seeing:
Step #2: Create a New Project
Click on “Create” to create a new project and fill in the fields:
- Choose a project name of your choice, for example
Hockey_scraper
. - Set
https://www.scrapethissite.com/pages/forms/
in the start URL(s) field.
This should be the result:
Step #3: Define the Scraping Logic
Implement the scraping logic by writing the Python code directly in the editor in the right part of the UI:
Here is what changed from the default code:
- The
response.doc()
method searches for the target table. detail_page()
returns the rows that are intercepted via the methodrow.css()
.
Click “Save” and “Run” to start the scraping process. The resulting data will be similar to what you got with Scrapy.
Great! You now know how to use both Scrapy and Pyspider for web scraping.
Scrapy vs Pyspider: Which One to Use?
The comparison between Scrapy and Pyspider has shown how to use them, but which one is better? Time to find out!
Choose Scrapy:
- For high-performance projects that need parallel crawling and advanced features, like throttling.
- If you need to integrate your scraping with external pipelines or other tools.
- If you are confident using CLI and with web scraping scenarios and prefer an up-to-date framework.
Choose Pyspider if:
- You prefer using a UI instead of the CLI.
- You want to work on a distributed system and prefer simple configurations.
- You want to schedule scraping tasks.
As always, there is no definitive winner—the best scraping tool depends entirely on your specific needs and use case.
Limitations of Scrapy and Pyspider
Scrapy and Pyspider are powerful frameworks for web scraping, but they have their limitations.
First, they struggle with scraping dynamic content sites that use JavaScript for rendering or data retrieval. While both can be extended to scrape JavaScript-powered sites, they are inherently limited in that aspect. This also makes them more susceptible to common anti-scraping measures.
Also, both of these frameworks are subject to IP bans, as they make a lot of automated requests. These may trigger rate limiters, which leads to your IP getting blacklisted. A solution to prevent your IP from being banned is to integrate proxies into your code.
For proxy rotation, see our guide on how to use proxies to rotate IP addresses in Python.
Finally, if you are seeking reliable proxy servers, keep in mind that Bright Data’s proxy network is trusted by Fortune 500 companies and over 20,000 customers worldwide. This extensive network includes:
- Datacenter proxies: Over 770,000 datacenter IPs.
- Residential proxies: Over 72M residential IPs in more than 195 countries.
- ISP proxies: Over 700,000 ISP IPs.
- Mobile proxies: Over 7M mobile IPs.
Conclusion
In this Scrapy vs Pyspider blog post, you learned about the role of the two libraries in web scraping. You explored their features for data extraction and compared their performance in a real-world pagination scenario.
Pyspider provides a friendly UI while, unfortunately, being deprecated. Scrapy, instead, is useful for large projects as it provides most of the tools needed for structured scraping and its underlying technology is up-to-date with the latest Python versions.
You also discovered their limitations, such as potential IP bans. Fortunately, these challenges can be overcome using proxies or dedicated web scraping solutions like Bright Data’s Web Scraper API. This scraping-focused API seamlessly integrates with Scrapy, Pyspider, and any other HTTP client or web scraping tool, enabling unrestricted data extraction.
Create a free Bright Data account today to explore our proxy and scraper APIs!
No credit card required