Scrapy vs. Beautiful Soup

Scrapy vs. Beautiful Soup comparison. Learn about the two popular choices for web scraping.
9 min read
Scrapy vs Beautiful Soup

When it comes to web scraping, Python offers a large variety of tools to choose from. Selenium, MechanicalSoup, Scrapy, Requests, Beautiful Soup, and lxml are often used within this context. However, these tools are not created equal as each of them has its own set of use cases in which they shine. Some of them are even complementary, as this article will demonstrate.

In this article, you’ll take a closer look at Scrapy and Beautiful Soup, two popular choices for web scraping.

Beautiful Soup is a parsing library. It enables the navigation of documents using XPath and CSS selectors. This facilitates the transformation of data from markup languages (such as HTML and XML) into structured data. In contrast, Scrapy is a complete web scraping framework that loads a document and (optionally) stores it.

Learn more about web scraping with Beautiful Soup.

In this comparison, you’ll consider the following aspects: crawling usability, scraping usability, speed, multi step execution, proxy rotation, and CAPTCHA solving.

Scrapy vs. Beautiful Soup: Quick Comparison

If you’re in a hurry, here’s a quick comparison between Scrapy and Beautiful Soup for web scraping with Python.

Scrapy is a comprehensive web scraping framework, perfect for large-scale data extraction projects and offers built-in support for crawling, whereas Beautiful Soup is a parsing library best suited for smaller, more straightforward scraping tasks without the built-in crawling capabilities.

Scrapy excels in speed and efficiency for extensive scraping operations, and Beautiful Soup shines in simplicity and ease of use for quick tasks. Choose Scrapy for complex projects or Beautiful Soup for simple, direct parsing needs.

Scrapy

Scrapy is an all-in-one suite for crawling the web, downloading documents, processing them, and storing the resulting data in an accessible format. Installing Scrapy is easily done with pip or conda:

pip install scrapy
conda install -c conda-forge scrapy

Web Crawling with Scrapy

Scrapy helps you crawl sets of pages and websites to gather URLs to scrape or to discover if a page contains the specific information you’re looking for. Scrapy works with spiders, which are Python classes in which one can define how to navigate a website, how deep it should go in the website structure, which data it should extract, and how it should be stored. To assemble a list of URLs, Scrapy can navigate HTML, XML, and CSV documents and even load sitemaps.

On top of that, Scrapy offers the Scrapy shell, an interactive shell for testing and debugging XPath and CSS expressions on specific pages. Using the shell can save you time when it comes to crawling and scraping since it eliminates the need to restart the spider every time you make changes.

Web Scraping with Scrapy

When it comes to scraping, you usually need a lot of flexibility. Scrapy offers two ways for selecting items in a document: through XPath and CSS expressions. The former is mainly used for XML documents, while the latter is exclusively for HTML documents.

A unique Scrapy feature is the ability to define pipelines. When an item is scraped, it can be sent to a pipeline in which a sequence of actions is performed on it: cleaning, validation, hashing, deduplication, and enrichment.

Speed

Another important aspect of scraping web documents is the time that it takes. Assessing the speed of Scrapy is not easy since it has a lot of overhead that needs to be processed. For this reason, the overhead is only loaded once, while the crawling and extracting happen ten times.

In the following example, the h2 of a simple (ie nondynamic) web page is extracted. All code runs in a Jupyter Notebook.

First, load the required Scrapy libraries:

import scrapy
from scrapy.crawler import CrawlerProcess

Second, establish the MySpider class that describes the scraping job:

class MySpider(scrapy.Spider):
    name = "myspider"
    start_urls = [
        'https://edition.cnn.com' # Or repeat this 10 times to calculate marginal time
    ]
    def parse(self, response):
        yield {'output': response.css('h2.container_lead-package__title_url-text::text').extract()}
process = CrawlerProcess(
    settings={
        "FEEDS": {
            "scrapy_output.json": {"format": "json", "overwrite": True}
        }
    }
)
process.crawl(MySpider)

Third, run the script and time it:

%%timeit -n 1 -r 1
process.start()

The sequence of crawling, scraping, and storing a single web document took approximately 400 milliseconds. However, repeating the same process ten times took 1,200 milliseconds. This implies that a single sequence takes around 80 milliseconds, which is impressive. Given the overhead, Scrapy should be your first choice for intensive jobs.

Multistep Scraping with Scrapy

Many websites, if not the most popular websites, like X/Twitter, Substack, and LinkedIn, are dynamic. This means that large swaths of information are hidden behind login screens, search queries, pop-ups, scrolls, or mouse overs. Consequently, having your spider simply visit a page is often not enough to extract data from it.

Scrapy offers various approaches for handling these jobs as a stand-alone tool. One could produce the necessary HTTP requests or execute the relevant JavaScript snippets. However, using a headless browser offers the most flexibility. For example, there are Playwright and Selenium integrations for Scrapy that can be used for interfacing with dynamic elements.

Proxy Rotation and CAPTCHA Prevention with Scrapy

The arrival of large language models has motivated many companies to fine-tune models, but this requires specific (often scraped) data. Additionally, many organizations don’t want bots straining their website’s servers and have no commercial interest in sharing their data. This is why many websites are not only set up as dynamic but also introduce antiscraping technologies, such as automatic IP blocking and CAPTCHA.

To prevent getting locked out, Scrapy doesn’t offer out-of-the-box tools for rotating proxies (and IP addresses). However, Scrapy can be extended through the Middleware framework, a set of hooks to modify Scrapy’s request and response process. To rotate proxies, one can attach a Python module, such as scrapy-rotating-proxies, that is specifically made for doing so. Through the same mechanism, one can attach the DeCAPTCHA module.

Beautiful Soup

Unlike Scrapy, Beautiful Soup does not offer a full-suite solution for extracting and processing data from web documents; it only offers the scraping part. You just need to feed it a downloaded document, and Beautiful Soup can turn it into structured data through CSS and XPath selectors.

Installing Beautiful Soup can be done via pip and conda:

pip install BeautifulSoup4
conda install -c anaconda beautifulsoup4

Web Crawling with Beautiful Soup

While Scrapy deploys spiders to navigate a website, Beautiful Soup does not offer such capabilities. However, with some Python creativity, using both Beautiful Soup and the Requests library, one can write a script to navigate a website to a certain depth. Nevertheless, it’s certainly not as easy as with Scrapy.

Web Scraping with Beautiful Soup

Web scraping is what makes Beautiful Soup 4 tick. Not only does it offer CSS and XPath selectors, but it also comes with a multitude of methods to traverse documents. When documents have a complex structure, methods like .parent and .next_sibling can extract elements that are otherwise hard to reach. Additionally, through find_all() and similar methods, you can specify text filters, regular expressions, and even custom functions to find the required elements.

Finally, Beautiful Soup has various output formatters to pretty-print output, encode it, remove Microsoft’s smart quotes, and even parse and validate HTML.

Speed

Unlike Scrapy, Requests and Beautiful Soup have no overhead and can simply run ten times to assess their speed.

First, load the required libraries:

import requests, json
from bs4 import BeautifulSoup

Second, time the code by wrapping it in a timeit magic command:

%%timeit -n 10 -r 1
page = requests.get('https://edition.cnn.com')
page_html = BeautifulSoup(page.text, 'html.parser')
page_html = page_html.select_one('h2.container_lead-package__title_url-text').text
json_object = json.dumps({'output': page_html})
with open("bs4_output.json", "w") as output_file:
    output_file.write(json_object)

Running it once takes approximately 300 milliseconds. Running it ten times takes 3,000 milliseconds, which is considerably slower than Scrapy. However, it requires a lot less configuration and relatively little knowledge of a particular framework.

Multistep Scraping with Beautiful Soup

Since Beautiful Soup has no crawling capabilities, it certainly cannot handle dynamic web pages. However, like Scrapy, it works perfectly well together with automation tools, such as Playwright, Puppeteer, and Selenium. Pairing automation tools with Beautiful Soup always works in the same way: the headless browsers handle the dynamic elements, while Beautiful Soup extracts the rendered data in those browsers.

Proxy Rotation and CAPTCHA Prevention with Beautiful Soup

Since Beautiful Soup is a scraping tool and not a crawling tool, it doesn’t offer any tools to prevent getting blocked by a website’s servers. If you need this, these features should be part of the crawling tool you choose.

Conclusion

This article outlined how Beautiful Soup and Scrapy differ in usability for web crawling and web scraping in terms of speed, handling of dynamic web documents, and circumvention of anti-scraping measures.

As an end-to-end tool, Scrapy is a clear favorite for day-to-day scraping jobs. However, it does require some middleware to scrape dynamic websites and to ensure one does not get blocked.

Although Beautiful Soup (together with the request package) is quite slow, it offers a very familiar and simple way for ad hoc scraping jobs. Like Scrapy, it requires extra tools for scraping dynamic websites and blocking prevention.

If you’re looking for a one-stop shop for scraping websites, consider Bright Data. Bright Data offers numerous products, such as proxy services and Web Unlocker, to assist with all your web scraping needs, no matter which option you decide to use.

Talk to one of Bright Data’s data experts to find the right solution for you.

Interesetd in learning how to integrate Bright Data proxies? Read our Scrapy proxies integration and BeautifulSoup proxies guide.