Scrapy vs Pyspider: Which One Is Better for Web Scraping?

Compare Scrapy vs Pyspider for web scraping and choose the best tool for your web scraping needs.
11 min read
Scrapy vs Pyspider blog image

In this Scrapy vs Pyspider guide, you will learn:

  • What Scrapy and Pyspider are
  • A comparison between Scrapy and Pyspider for web scraping
  • How to use both Scrapy and Pyspider for web scraping
  • Common limitations between Scrapy and Pyspider in web scraping scenarios

Let’s dive in!

What Is Scrapy?

Scrapy is an open-source web scraping framework written in Python. Its main goal is to extract data from websites quickly and efficiently. In detail, it allows you to:

  • Define how to navigate and gather information from one or more web pages.
  • Handle aspects like HTTP requests, link following, and data extraction.
  • Avoid banning by adjusting the request speed with throttling and asynchronous requests.
  • Manage proxies and proxy rotation via custom middleware or the scrapy-rotating-proxies library.

What Is Pyspider?

Pyspider is an open-source web crawling framework written in Python. It is built to extract data from websites with ease and flexibility, and enables you to:

  • Define how to navigate and gather information from one or more web pages via either the CLI or a user-friendly web interface.
  • Handle aspects like task scheduling, retries, and data storage.
  • Limit blocks by supporting distributed crawling and prioritized tasks.
  • Manage complex workflows and data processing with built-in support for databases and message queues.

Scrapy vs Pyspider: Features Comparison for Web Scraping

Now that you have learned what Scrapy and Pyspider are, it is time to compare them for web scraping:

Feature Scrapy Pyspider
Use case Large-scale and complex scraping projects Scheduled scraping tasks
Scraping management CLI CLI and UI
Parsing methods XPath and CSS Selectors CSS Selectors
Data saving Can export data to CSVs and other file formats Automatically saves data into database
Retry Needs manual intervention to retry Automatically retries failed tasks
Tasks scheduling Needs external integrations Natively supported
Proxy rotation Supports proxy rotation via middlewares Requires manual intervention
Community Hige community, currently with more than 54k GitHub stars, which actively contributes to it Vast community, currently with more than 16k GitHub stars, but archived since June 11, 2024

The above Scrapy vs Pyspider comparison table shows that these two libraries are similar. The major differences at a high level are:

  • Scrapy can be used only via the CLI, while Pyspider also provides a UI.
  • Scrapy can parse XPath and CSS selectors, while Pyspider only supports CSS selectors.
  • Scrapy automatically supports proxy rotation via custom middleware logic.

However, what is really important to consider is that Pyspider is no longer supported:

Pyspider archived GitHub repository

Scrapy vs Pyspider: Direct Scraping Comparison

After comparing Scrapy vs Pyspider, you learned that these two frameworks offer similar web scraping features. For that reason, the best way to compare them is through an actual coding example.

The next two sections will show you how to use Scrapy and Pyspider to scrape the same site. In detail, the target page will be the “Hokey Teams” page from Scrape This Site. This contains hockey data in a tabular form:

The tabular data to scrape

The goal of these sections is to retrieve all the data from the table and save them locally. Let’s see how!

How to Use Scrapy for Web Scraping

In this paragraph, you will learn how to use Scrapy to retrieve all the data from the table provided by the target website.

Requirements

To follow this tutorial, you must have Python 3.7 or higher installed on your machine.

Step #1: Setting Up the Environment and Installing Dependencies

Suppose you call the main folder of your project hockey_scraper/. At the end of this step, the folder will have the following structure:

hockey_scraper/
   └── venv/

You can create the venv/ virtual environment directory like so:

python -m venv venv

To activate it, on Windows, run:

venv\Scripts\activate

Equivalently, on macOS/Linux, execute:

source venv/bin/activate

Now you can install Scrapy with:

pip install scrapy

Step #2: Start a New Project

Now you can launch a new Scrapy project. Inside the hockey_scraper/ main folder, type:

scrapy startproject hockey

With that command, Scrapy will create a hockey/ folder. Inside it, it will automatically generate all the files you need. This is the resulting folder structure:

hockey_scraper/
    ├── hockey/ # Main Scrapy project folder
    │   ├── __init__.py
    │   ├── items.py # Defines the data structure for scraped items
    │   ├── middlewares.py # Custom middlewares
    │   ├── pipelines.py # Handles post-processing of scraped data
    │   ├── settings.py # Project settings
    │   └── spiders/ # Folder for all spiders
    ├── venv/
    └── scrapy.cfg # Scrapy configuration file

Step #3: Generate the Spider

To generate a new spider to crawl the target website, first go into the hockey/ folder:

cd hockey

Then, generate a new spider with:

scrapy genspider data https://www.scrapethissite.com/pages/forms/

In this script, data represents the name of the spider. Scrapy will automatically create a data.py file inside the spiders/ folder. That file will contain the required scraping logic to retrieve the Hokey team data.

Step #4: Define the Scraping Logic

You are ready to code the scraping logic. First, inspect the table containing the data of interest in your browser. You can see that the data is contained inside a .table element:

The table class in the HTML n the HTML code of the target web page

To get all the data, write the following code in the data.py file:

import scrapy

class DataSpider(scrapy.Spider):
    name = "data"
    allowed_domains = ["www.scrapethissite.com"]
    start_urls = ["https://www.scrapethissite.com/pages/forms/"]

    def parse(self, response):
        for row in response.css("table.table tr"):
            yield {
                "name": row.css("td.name::text").get(),
                "year": row.css("td.year::text").get(),
                "wins": row.css("td.wins::text").get(),
                "losses": row.css("td.losses::text").get(),
                "ot_losses": row.css("td.ot-losses::text").get(),
                "pct": row.css("td.pct::text").get(),
                "gf": row.css("td.gf::text").get(),
                "ga": row.css("td.ga::text").get(),
                "diff": row.css("td.diff::text").get(),
            }

Note that the variables nameallowed_domains, and start_urls have been automatically created by Scrapy in the previous step.

Even the parse() method has been automatically created by Scrapy. So, you only need to add the scraping logic in this step which is under the for loop.

In detail, the response.css() method searches for the table. Then, the code iterates over all the rows of the table and gets the data.

Step 5: Run the Crawler and Save the Data into a CSV File

To run the crawler and save the scraped data into a CSV file type the following:

scrapy crawl data -o output.csv

With this code, Scrapy:

  • Runs the data.py file that contains the scraping logic
  • Saves the scraped data into a CSV file called output.csv

The expected output.csv file produced by the scraper is:

The expected CSV file

Note that this way of using Scrapy is the shortest, but it is not the only one. Scrapy provides different customizations and settings, and you can learn more about that in our article on Scrapy vs Requests.

How to Use Pyspider for Web Scraping

See how to use Pyspider to scrape the same target website.

Requirements

Pyspider supports Python3.6 as the latest version. If you have later Python versions installed, read the following step to learn how to use its 3.6 version.

Step #1: Setting Up the Environment and Installing Dependencies

Suppose call the main folder of your project hockey_scraper/.

If you have Python 3.7 or later, install pyenv to get Python 3.6.

Use pyenv to install Python 3.6 with this command:

pyenv install 3.6.15

Then make it the local version of Python, so you do not affect the whole system with a different version:

pyenv local 3.6.15

To make sure everything went alright, verify the Python version:

python --version

The result must be:

Python 3.6.15

Create a virtual environment by selecting the correct Python version:

python3.6 -m venv venv

Activate the virtual environment as shown in the previous chapter of this guide. Now, you can install Pyspider with:

pip install pyspider

To launch the UI run:

pyspider

Note that, since this repository is archived and you are using Python 3.6, you will receive some errors. To fix them, you may need to install the following libraries:

pip install tornado==4.5.3 requests==2.25.1

You might also receive other errors regarding the webdav.py file. Search for the file, and fix the following:

  • In the ScriptProvider() class, rename the method getResourceInst() to get_resource_inst()
  • At the bottom of the file, search for the variable config = DEFAULT_CONFIG.copy() and change all the subsequent code to:
config = DEFAULT_CONFIG.copy()
config.update({
    "mount_path": "/dav",
    "provider_mapping": {
        "/": ScriptProvider(app)
    },
    "domaincontroller": NeedAuthController(app),
    "verbose": 1 if app.debug else 0,
    "dir_browser": {"davmount": False,
                    "enable": True,
                    "msmount": False,
                    "response_trailer": ""},
})
dav_app = WsgiDAVApp(config)

The web UI of pyspider should now be able to start. Visit http://localhost:5000/ in your browser, and this is what you should be seeing:

The Pyspider UI

Step #2: Create a New Project

Click on “Create” to create a new project and fill in the fields:

  • Choose a project name of your choice, for example Hockey_scraper.
  • Set https://www.scrapethissite.com/pages/forms/ in the start URL(s) field.

This should be the result:

The result of the project creation in Pyspider

Step #3: Define the Scraping Logic

Implement the scraping logic by writing the Python code directly in the editor in the right part of the UI:

from pyspider.libs.base_handler import *

class Handler(BaseHandler):
    crawl_config = {}

    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl("https://www.scrapethissite.com/pages/forms/", callback=self.index_page)

    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        for each in response.doc("table.table tr").items():
            self.crawl(each.attr.href, callback=self.detail_page)

    @config(priority=2)
    def detail_page(self, response):
        return {
            "name": row.css("td.name::text").get(),
            "year": row.css("td.year::text").get(),
            "wins": row.css("td.wins::text").get(),
            "losses": row.css("td.losses::text").get(),
            "ot_losses": row.css("td.ot-losses::text").get(),
            "pct": row.css("td.pct::text").get(),
            "gf": row.css("td.gf::text").get(),
            "ga": row.css("td.ga::text").get(),
            "diff": row.css("td.diff::text").get(),
        }

Here is what changed from the default code:

  • The response.doc() method searches for the target table.
  • detail_page() returns the rows that are intercepted via the method row.css().

Click “Save” and “Run” to start the scraping process. The resulting data will be similar to what you got with Scrapy.

Great! You now know how to use both Scrapy and Pyspider for web scraping.

Scrapy vs Pyspider: Which One to Use?

The comparison between Scrapy and Pyspider has shown how to use them, but which one is better? Time to find out!

Choose Scrapy:

  • For high-performance projects that need parallel crawling and advanced features, like throttling.
  • If you need to integrate your scraping with external pipelines or other tools.
  • If you are confident using CLI and with web scraping scenarios and prefer an up-to-date framework.

Choose Pyspider if:

  • You prefer using a UI instead of the CLI.
  • You want to work on a distributed system and prefer simple configurations.
  • You want to schedule scraping tasks.

As always, there is no definitive winner—the best scraping tool depends entirely on your specific needs and use case.

Limitations of Scrapy and Pyspider

Scrapy and Pyspider are powerful frameworks for web scraping, but they have their limitations.

First, they struggle with scraping dynamic content sites that use JavaScript for rendering or data retrieval. While both can be extended to scrape JavaScript-powered sites, they are inherently limited in that aspect. This also makes them more susceptible to common anti-scraping measures.

Also, both of these frameworks are subject to IP bans, as they make a lot of automated requests. These may trigger rate limiters, which leads to your IP getting blacklisted. A solution to prevent your IP from being banned is to integrate proxies into your code.

For proxy rotation, see our guide on how to use proxies to rotate IP addresses in Python.

Finally, if you are seeking reliable proxy servers, keep in mind that Bright Data’s proxy network is trusted by Fortune 500 companies and over 20,000 customers worldwide. This extensive network includes:

Conclusion

In this Scrapy vs Pyspider blog post, you learned about the role of the two libraries in web scraping. You explored their features for data extraction and compared their performance in a real-world pagination scenario.

Pyspider provides a friendly UI while, unfortunately, being deprecated. Scrapy, instead, is useful for large projects as it provides most of the tools needed for structured scraping and its underlying technology is up-to-date with the latest Python versions.

You also discovered their limitations, such as potential IP bans. Fortunately, these challenges can be overcome using proxies or dedicated web scraping solutions like Bright Data’s Web Scraper API. This scraping-focused API seamlessly integrates with Scrapy, Pyspider, and any other HTTP client or web scraping tool, enabling unrestricted data extraction.

Create a free Bright Data account today to explore our proxy and scraper APIs!

No credit card required