Scrapy vs Playwright: A Comparison for Web Scraping

Explore the key differences and benefits of using Scrapy versus Playwright for effective web scraping.
16 min read
Scrapy vs Playwright blog image

In this guide, you will learn:

  • What Scrapy is
  • What Playwright is
  • The features they offer for web scraping and how these compare
  • An introduction to web scraping with both tools
  • How to build a scraper with Playwright
  • How to build a web scraping script with Scrapy
  • Which tool is better for web scraping
  • Their common limitations and how to overcome them

Let’s dive in!

What Is Scrapy?

Scrapy is an open-source web scraping framework written in Python, developed for efficient data extraction. It offers built-in support for capabilities such as parallel requests, link-following, and data export in formats like JSON and CSV. Also, it features middleware, proxy integration, and automatic request retries. Scrapy operates asynchronously and on static HTML pages.

What Is Playwright?

Playwright is an open-source automation framework for E2E testing and web scraping in the browser. It supports multiple browsers, such as Chrome, Firefox, and WebKit—each in both headed and headless mode. Also, the browser automation API is available in multiple programming languages, including TypeScript/JavaScript, Python, Java, and C#.

Scrapy vs Playwright: Head-to-Head Features for Web Scraping

Let’s compare Scrapy and Playwright across five different aspects that contribute to making them great web scraping tools.

For other head-to-head blog posts, read:

Now, begin the Scrapy vs Playwright comparison!

Ease of Setup and Configuration

Scrapy offers a straightforward setup with minimal configuration required. You can quickly create a project, define spiders, and export data, thanks to its built-in CLI. Conversely, Playwright requires more setup, as it involves installing browser dependencies and checking for proper configuration.

Learning Curve

Scrapy has a steeper learning curve for beginners due to its modular structure, extensive features, and unique configurations. Understanding concepts like spiders, middlewares, and pipelines can take time. Playwright is much easier to get started with, as its API is familiar to those with some browser automation knowledge.

Dynamic Content Handling

Scrapy struggles with websites that use JavaScript, as it can only deal with static HTML documents. Handling dynamic content is possible but requires integration with Splash or similar tools. Playwright excels in handling dynamic or JavaScript-rendered content because it natively renders pages in the browser. That means you can use it to scrape pages that rely on client frameworks like React, Angular, or Vue.

Customization and Extensibility

Scrapy offers high customization options via support for middlewares, extensions, and pipelines. Also, several plugins and add-ons are available. Playwright, on the other hand, is not natively extensible. Luckily, the community has addressed this limitation with the Playwright Extra project.

Other Scraping Features

Scrapy equips you with built-in functionality like proxy integration, automatic retries, and configurable data export. It also offers integrated methods for IP rotation and other advanced scenarios. Playwright does support proxy integration and other key scraping features. So, achieving the same results requires more manual effort compared to Scrapy.

Playwright vs Scrapy: Scraping Script Comparison

In the following two sections, you will learn how to scrape the same site using Playwright and Scrapy. We will start with Playwright, as that may take a bit longer since it is not specifically optimized for web scraping like Scrapy.

The target site will be the Books to Scrape scraping sandbox:

The target site

The goal of both scrapers is to retrieve all Fantasy books from the site, which requires handling pagination.

Scrapy will treat the pages as static and parse their HTML documents directly. Instead, Playwright will render them in a browser and interact with the elements on the pages, simulating user actions.

The Scrapy script will be written in Python, while the Playwright script will be in JavaScript—the two languages primarily supported by both tools. Still, you can easily convert the Playwright JavaScript script to Python by using the playwright-python library, which exposes the same underlying API.

In both cases, at the end of the script, you will have a CSV containing all Fantasy book details from Books to Scrape.

Now, let’s jump into the Playwright vs Scrapy scraping comparison!

How to Use Playwright for Web Scraping

Follow the steps below to write a simple web scraping script in JavaScript using Playwright. If you are not familiar with the process, read first our guide on Playwright web scraping.

Step #1: Project Setup

Before getting started, make sure you have the latest version of Node.js installed locally. If not, download it and follow the installation wizard.

Next, create a folder for your Playwright scraper and navigate into it using the terminal:

mkdir playwright-scraper
cd playwright-scraper

Inside the playwright-scraper folder, initialize an npm project by running:

npm init -y

Now, open the playwright-scraper folder in your favorite JavaScript IDE. IntelliJ IDEA or Visual Studio Code are great options. Inside the folder, create a script.js file, which will soon contain the scraping logic:

The Playwright scraping project file structure

Great! You are now fully set up for web scraping in Node.js with Playwright.

Step #2: Install and Configure Playwright

In the project folder, run the following command to install Playwright:

npm install playwright

Next, install the browser and any additional dependencies by running:

npx playwright install

Now, open script.js and add the following code to import Playwright and launch a Chromium browser instance:

const { chromium } = require("playwright");

(async () => {
  // initialize a Chromium browser
  const browser = await chromium.launch({
    headless: false, // comment out in production
  });

  // scraping logic goes here...

  // close the browser and release resources
  await browser.close();
})();

The headless: false option launches the browser in headed mode. That allows you to see what the script is doing—useful for debugging during development.

Step #3: Connect to the Target Page

Initialize a new page in the browser and use the goto() function to navigate to the target page:

const page = await browser.newPage();
await page.goto("https://books.toscrape.com/catalogue/category/books/fantasy_19/index.html");

If you run the script in the debugger with breakpoint before the close() function, you will see the browser open and navigate to the target page:

The Chromium browser window opened by Playwright

Amazing! Playwright is controlling the browser as expected.

Step #4: Implement the Data Parsing Logic

Before writing the scraping logic, you need to understand the page structure. To do so, open the target site in an incognito window in your browser. Then, right-click on a book element and select the “Inspect” option.

This is what you should be seeing in the DevTools:

The DevTools section for the book element

Above, you can notice that each book element can be selected using the .product_pod CSS selector.

Since the page contains multiple books, first initialize an array to store the scraped data:

books = []

Select them all and iterate over them as below:

const bookElements = await page.locator(".product_pod").all();
for (const bookElement of bookElements) {
  // extract book details...
}

From each book element, as shown in the image above, you can extract:

  • The book URL from the <a> tag
  • The book title from the h3 a node
  • The book image from the .thumbnail element
  • The book rating from the .star-rating element
  • The product price from the .product_price .price_color element
  • The product availability from the .availability element

Now, implement the scraping logic inside the loop:

const urlElement = await bookElement.locator("a").first();
const url = makeAbsoluteURL(
  await urlElement.getAttribute("href"),
  "https://books.toscrape.com/catalogue/"
);

const titleElement = await bookElement.locator("h3 a");
const title = await titleElement.getAttribute("title");

const imageElement = await bookElement.locator(".thumbnail");
const image = makeAbsoluteURL(
  await imageElement.getAttribute("src"),
  "https://books.toscrape.com/"
);

const ratingElement = await bookElement.locator(".star-rating");
const ratingClass = await ratingElement.getAttribute("class");
let rating;
switch (true) {
  case ratingClass.includes("One"):
    rating = 1;
    break;
  case ratingClass.includes("Two"):
    rating = 2;
    break;
  case ratingClass.includes("Three"):
    rating = 3;
    break;
  case ratingClass.includes("Four"):
    rating = 4;
    break;
  case ratingClass.includes("Five"):
    rating = 5;
    break;
  default:
    rating = null;
}

const priceElement = await bookElement.locator(
  ".product_price .price_color"
);
const price = (await priceElement.textContent()).trim();

const availabilityElement = await bookElement.locator(".availability");
const availability = (await availabilityElement.textContent()).trim();

The above snippet uses getAttribute() and textContent() Playwright functions to extract specific HTML attributes and text from HTML nodes, respectively. Note the custom logic to retrieve the rating score.

Additionally, since the URLs on the page are relative, they can be converted to absolute URLs using the following custom function:

function makeAbsoluteURL(url, baseURL) {
  // use a regular expression to remove any ../ or ../../ patterns
  const cleanURL = url.replace(/(\.\.\/)+/, "");

  // combine the base URL with the cleaned relative URL
  return baseURL + cleanURL;
}

Next, populate a new object with the scraped data and add it to the books array:

const book = {
  "url": url,
  "title": title,
  "image": image,
  "rating": rating,
  "price": price,
  "availability": availability,
};
books.push(book);

Perfect! The Playwright scraping logic is now complete.

Step #4: Implement the Crawling Logic

If you take a look at the target site, you will notice that some pages have a “next” button at the bottom:

The HTML code of the "next" button

Clicking it loads the next page. Note that the last pagination page does not include it for obvious reasons.

Thus, you can implement the web crawling logic with a while (true) loop that:

  1. Scrapes data from the current page
  2. Clicks the “next” button if it is present and waits for the new page to load
  3. Repeats the process until the “next” button is no longer found

Below is how you can achieve that:

while (true) {
  // select the book elements ...

  // select the "next" button and check if it is on the page
  const nextElement = await page.locator("li.next a");
  if ((await nextElement.count()) !== 0) {
    // click the "next" button and go to the next page
    await nextElement.click();
    // wait for the page to have been loaded
    await page.waitForLoadState("domcontentloaded")
  } else {
    break;
  }
}

Terrific! Crawling logic implemented.

Step #5: Export to CSV

The last step is to export the scraped data to a CSV file. While you could achieve this using vanilla Node.js, it is much easier with a dedicated library like fast-csv.

Install the fast-csv package by running the following command:

npm install fast-csv

At the beginning of your scraping.js file, import the required modules:

const { writeToPath } = require("fast-csv");

Next, use the following snippet to write the scraped data to a CSV file:

writeToPath("books.csv", books, { headers: true });

Et voilà! The Playwright web scraping script is ready.

Step #6: Put It All Together

Your script.js file should contain:

const { chromium } = require("playwright");
const { writeToPath } = require("fast-csv");

(async () => {
  // initialize a Chromium browser
  const browser = await chromium.launch({
    headless: false, // comment out in production
  });

  // initialize a new page in the browser
  const page = await browser.newPage();

  // visit the target page
  await page.goto(
    "https://books.toscrape.com/catalogue/category/books/fantasy_19/index.html"
  );

  // where to store the scraped data
  books = [];

  while (true) {
    // select the book elements
    const bookElements = await page.locator(".product_pod").all();
    // iterate over them to extract data from them
    for (const bookElement of bookElements) {
      // data extraction logic
      const urlElement = await bookElement.locator("a").first();
      const url = makeAbsoluteURL(
        await urlElement.getAttribute("href"),
        "https://books.toscrape.com/catalogue/"
      );

      const titleElement = await bookElement.locator("h3 a");
      const title = await titleElement.getAttribute("title");

      const imageElement = await bookElement.locator(".thumbnail");
      const image = makeAbsoluteURL(
        await imageElement.getAttribute("src"),
        "https://books.toscrape.com/"
      );

      const ratingElement = await bookElement.locator(".star-rating");
      const ratingClass = await ratingElement.getAttribute("class");
      let rating;
      switch (true) {
        case ratingClass.includes("One"):
          rating = 1;
          break;
        case ratingClass.includes("Two"):
          rating = 2;
          break;
        case ratingClass.includes("Three"):
          rating = 3;
          break;
        case ratingClass.includes("Four"):
          rating = 4;
          break;
        case ratingClass.includes("Five"):
          rating = 5;
          break;
        default:
          rating = null;
      }

      const priceElement = await bookElement.locator(
        ".product_price .price_color"
      );
      const price = (await priceElement.textContent()).trim();

      const availabilityElement = await bookElement.locator(".availability");
      const availability = (await availabilityElement.textContent()).trim();

      // populate a new book item with the scraped data and
      // then add it to the array
      const book = {
        "url": url,
        "title": title,
        "image": image,
        "rating": rating,
        "price": price,
        "availability": availability,
      };
      books.push(book);
    }

    // select the "next" button and check if it is on the page
    const nextElement = await page.locator("li.next a");
    if ((await nextElement.count()) !== 0) {
      // click the "next" button and go to the next page
      await nextElement.click();
      // wait for the page to have been loaded
      await page.waitForLoadState("domcontentloaded");
    } else {
      break;
    }
  }

  // export the scraped data to CSV
  writeToPath("books.csv", books, { headers: true });

  // close the browser and release resources
  await browser.close();
})();

function makeAbsoluteURL(url, baseURL) {
  // use a regular expression to remove any ../ or ../../ patterns
  const cleanURL = url.replace(/(\.\.\/)+/, "");

  // combine the base URL with the cleaned relative URL
  return baseURL + cleanURL;
}

Launch it with this Node.js command:

node script.js

The result will be the following books.csv file:

The output CSV file

Mission complete! Now, it is time to see how to get the same result with Scrapy.

How to Use Scrapy for Web Scraping

Follow the steps below and see how to build a simple web scraper with Scrapy. For more guidance, check out our tutorial on Scrapy web scraping.

Step #1: Project Setup

Before getting started, verify that you have Python 3 installed locally. If not, download it from the official site and install it.

Create a folder for your project and initialize a virtual environment inside it:

mkdir scrapy-scraper
cd scrapy-scraper
python -m venv venv

On Windows, run the following command to activate the environment:

venv\Scripts\activate

Equivalently, on Unix or macOS, run:

source venv/bin/activate

In an activated environment, install Scrapy with:

pip install scrapy

Next, launch the command below to create a Scrapy project called “books_scraper”:

scrapy startproject books_scraper

Sweet! You are set up for web scraping with Scrapy.

Step #2: Create the Scrapy Spider

Enter the Scrapy project folder and generate a new spider for the target site:

cd books_scraper
scrapy genspider books books.toscrape.com

Scrapy will automatically create all the required files for you. Specifically, the books_scraper directory should now contain the following file structure:

books_scraper/
   │── __init__.py
   │── items.py
   │── middlewares.py
   │── pipelines.py
   │── settings.py
   └── spiders/
       │── __init__.py
       └── books.py

To implement the desired scraping logic, replace the contents of books_scraper/spiders/books.py with the following code:

import scrapy

class BooksSpider(scrapy.Spider):
    name = "books"
    allowed_domains = ["books.toscrape.com"]
    start_urls = ["https://books.toscrape.com/catalogue/page-1.html"]

    def parse(self, response):
        # Extract book details
        for book in response.css(".product_pod"):
            yield {
                "title": book.css("h3 a::attr(title)").get(),
                "url": response.urljoin(book.css("h3 a::attr(href)").get()),
                "image": response.urljoin(book.css(".thumbnail::attr(src)").get()),
                "rating": book.css(".star-rating::attr(class)").get().split()[-1],
                "price": book.css(".product_price .price_color::text").get(),
                "availability": book.css(".availability::text").get().strip(),
            }

        # Handle pagination
        next_page = response.css("li.next a::attr(href)").get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

Step #3: Launch the Spider

In the books_scraper folder, in an activated virtual environment, run the following command to execute your Scrapy spider and export the scraped data to a CSV file:

scrapy crawl books -o books.csv

This will generate a books.csv file containing the scraped data, just like the one produced by the Playwright script. Again, mission complete!

Scrapy vs Playwright: Which One to Use?

The Playwright scraping script required six lengthy steps, while Scrapy only needed three. This is not surprising since Scrapy is designed for web scraping, whereas Playwright is a general browser automation tool used for both testing and scraping.

In particular, the key difference was in web crawling logic. Playwright required manual interactions and custom logic for pagination, while Scrapy can handle it with just a few lines of code.

In short, choose Scrapy over Playwright in one of these scenarios:

  1. You need large-scale data extraction with built-in crawling support.
  2. Performance and speed are priorities, as Scrapy is optimized for fast, parallel requests.
  3. You prefer a framework that handles pagination, retries, data extraction in many formats, and parallel scraping for you.

On the contrary, prefer Playwright over Scrapy when:

  1. You need to extract data from JavaScript-heavy websites requiring browser rendering.
  2. Dynamic interactions like infinite scrolling are necessary.
  3. You want more control over user interactions (e.g., in complex web scraping navigation patterns).

As the final step in this Scrapy vs Playwright comparison, refer to the summary table below:

Features Scrapy Playwright
Developed by Zyte + the community Microsoft + the community
GitHub stars 54k+ 69k+
Downloads 380k+, weekly 12M+, weekly
Programming languages Python Python, JavaScript, TypeScript, C#
Main goal Web scraping and crawling Browser automation, testing, and web scraping
JavaScript rendering ❌ (possible with some plugins) ✔️
Browser interaction ❌ (possible with some plugins) ✔️
Autoamted crawling ✔️ ❌ (requires manual handling)
Proxy integration Supported Supported
Parallel requests Efficient and easily configurable Limited, but possible
Data export CSV, JSON, XML, etc. Requires custom logic

Limitations of Both Playwright and Scrapy

Both Scrapy and Playwright are powerful tools for web scraping, but they each have certain limitations.

Scrapy, for instance, struggles with scraping dynamic content from sites that rely on JavaScript for rendering or data retrieval. Since many modern websites now require JavaScript, Scrapy is more vulnerable to common anti-scraping measures. Sure, Playwright can handle JavaScript-heavy sites, but it faces challenges like IP bans.

When making many requests, you may trigger rate limiters, leading to request refusals or even IP bans. To mitigate that, you can integrate a proxy server to rotate IPs.

If you need reliable proxy servers, Bright Data’s proxy network is trusted by Fortune 500 companies and over 20,000 customers worldwide. Their network includes:

Another challenge with Playwright is CAPTCHAs, which are designed to block automated scraping bots operating in browsers. To overcome them, you can explore solutions for bypassing CAPTCHAs in Playwright.

Conclusion

In this Playwright vs Scrapy blog post, you learned about the roles of both libraries in web scraping. You explored their features for data extraction and compared their performance in a real-world pagination scenario.

Scrapy provides everything you need for data parsing and crawling websites, while Playwright is more focused on simulating user interactions.

You also discovered their limitations, such as IP bans and CAPTCHAs. Fortunately, these challenges can be overcome using proxies or dedicated anti-bot solutions like Bright Data’s CAPTCHA Solver.

Create a free Bright Data account today to explore our proxy and scraping solutions!

No credit card required