Cheerio vs. Puppeteer for Web Scraping

When choosing a Node.js web scraping tool, there are several options. Two of the most common are Cheerio and Puppeteer.

Cheerio was first created as a more performant version of jQuery to be used for parsing and manipulating HTML documents. In comparison, Puppeteer was made for automating tests for web pages and applications.

Regardless, both tools can be useful in web scraping: Cheerio enables you to parse the HTML of a web page to find the information you need, and Puppeteer enables you to automate a web browser to scrape dynamic sites that use JavaScript.

This article will look at both of these tools and compare them in regard to their functionality, performance, and ease of use.

Cheerio vs. Puppeteer

Cheerio and Puppeteer have one primary difference: Cheerio is an HTML parser, while Puppeteer is a browser automation tool. This means that the two tools operate very differently.

Cheerio enables you to take an HTML document and find the HTML elements that you’re looking for via CSS selectors.

For example, take a look at the following selector. It searches for elements that have the h1 tag:

const title = $('h1');

You can run this selector on HTML code, such as the following (taken from example.com):

<div>
    <h1>Example Domain</h1>
    <p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>
    <p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>

It results in Cheerio returning the h1 elements, from which you can extract information, such as the heading names:

<h1>Example Domain</h1>

To work with Cheerio, you need to use a library like axios to get the HTML code of a web page. Then you can parse the HTML code and find the information you need.

For example, the following code sample downloads the HTML code of example.com, parses it, and then finds the text of the h1 element:

(async () => {

  const url = 'https://example.com/';
  const response = await axios.get(url); // get HTML
  const $ = cheerio.load(response.data); // parse HTML with Cheerio
  const title = $('h1'); // use selectors to find data you need
  console.log(title.text());

})();

In comparison, Puppeteer opens a dedicated browser instance and works with what the browser instance can provide. This means that it can interact with JavaScript elements that are not present in the HTML of the page. For instance, it can click buttons for navigation, scroll the page, or even execute JavaScript in the context of the page.

Here’s an example of a script that launches a browser, opens a page, and extracts the titles of all h1 elements. In comparison to Cheerio, Puppeteer is also able to find clickable elements, such as the More Information button, and use them for navigation:

(async () => {
  const browser = await puppeteer.launch({
    headless: false,
    defaultViewport: null
  });

  const page = await browser.newPage(); // open browser
  await page.goto('https://example.com/'); // go to the page

  const title = await page.evaluate(async () => { // use selectors to find data you need

    const h1 = document.querySelector('h1');
    return h1.textContent;
  });

  console.log(title);

  more_information = await page.waitForSelector('a');  // can also click on elements!
  await more_information.click();

  await new Promise(r => setTimeout(r, 2000));

  await browser.close(); // close the browser

})();

Because of their operational differences, Puppeteer is better suited for scraping modern websites that use JavaScript to make the website interact like an application. In comparison, Cheerio is better suited for static websites such as blogs.

Features

Once a page with all the necessary information is open, both libraries operate similarly—they use CSS selectors to locate the necessary information and extract it.

Cheerio locates information with a built-in jQuery-like syntax, which makes it convenient for most JavaScript developers:

const title = $('h1');

Puppeteer usually locates information by evaluating JavaScript (in particular, the querySelector and querySelectorAll methods) on the page and returning the result:

const title = await page.evaluate(async () => {

  const h1 = document.querySelector('h1');
  return h1.textContent;
});

However, since Puppeteer runs in a browser, it has additional features. For instance, while on the page, Puppeteer can do any action a user can do:

await button.click(); \ clicking
await form.type('User'); \ typing

This enables it to pass any kind of user flow, such as registration and authentication. You can also use it to search for information using the website’s UI and find information that is otherwise hidden.

For example, if there is a website that you need to log into, doing that with Puppeteer is as simple as it should be—you just need to click the necessary fields and type your username and password. The browser takes care of the rest. Meanwhile, you technically can’t log into an account with Cheerio since it only concerns itself with parsing a single page, not managing a web session.

Puppeteer can even execute arbitrary JavaScript to manipulate the contents of the page. This is commonly used to scroll the page. Therefore, you can use Puppeteer to scrape websites with infinite scroll, while Cheerio won’t be able to load anything besides the first page. It also makes it easier to bypass custom anti scraping measures.

The fact that the scraping happens in the browser with natural actions is also great for debugging! You can disable the headless mode to watch the execution and notice issues with the website during execution:

Performance

Because Puppeteer has to start and run a browser to implement web scraping, it’s significantly slower to start up and execute the script, and it takes more computing resources than Cheerio.

For example, the following is a quick speed check on how long it takes the libraries to scrape a basic web page.

Using the following scripts, you can open the Bright Data blog and extract the links of the blog posts from the first page:

Cheerio

async function cheerio_scrape() {

  const url = 'https://brightdata.com/blog';
  const response = await axios.get(url);
  const $ = cheerio.load(response.data);
  const h5s = $('h5');
  let titles = []
  h5s.each((i, el) => titles.push($(el).text().trim()));
  console.log(titles);

};

Puppeteer

async function puppeteer_scrape() {
  const browser = await puppeteer.launch({
    headless: false,
    defaultViewport: null
  });

  const page = await browser.newPage();
  await page.goto('https://brightdata.com/blog');


  await page.waitForSelector('h5');

  const titles = await page.evaluate(async () => {

    let titles = [];
    const h5s = document.querySelectorAll('h5');
    h5s.forEach(el => titles.push(el.textContent.trim()));
    return titles;
  });

  console.log(titles);
 await browser.close();
};

Then you can time the execution of both functions.

The following code times the execution of the Cheerio script, which takes around 500 milliseconds (but your mileage may vary):

let start = Date.now();
cheerio_scrape().then(() => {
  let end = Date.now();
  console.log(`Execution time: ${end - start} ms`);
});

The following code times the execution of the Puppeteer script:

let start = Date.now();
puppeteer_scrape().then(() => {
  let end = Date.now();
  console.log(`Execution time: ${end - start} ms`);
});

With Puppeteer, it takes around 4,000 milliseconds for the script to complete, which is significantly longer than the 500 milliseconds that Cheerio takes.

Ease of Use

If you’re new to web scraping, Cheerio may be the better fit because it works only with the HTML code of the page. The user doesn’t need to interact with web elements and adjust the script for their loading times, which means they can focus on important web scraping essentials, such as creating the right selectors.

In addition, with Cheerio, the HTML code of a web page doesn’t change once you download it. In contrast, for a website that is running JavaScript and gets interacted with, the HTML changes all the time, with the timing of changes being somewhat unpredictable.

Because of this, browser automation tools use waits. Puppeteer, in particular, has a waitForSelector function that waits until a condition is fulfilled, such as when an element is present on the page. If the element is not available after a given time length (thirty seconds by default), the script throws an error:

await page.waitForSelector('h1')

If you don’t set them up properly, delays can make your scripts much less reliable.

In addition, the syntax of Cheerio should feel simpler and more natural for JavaScript developers. Puppeteer, while powerful, is not really made with web scraping in mind, and that shows when you try to use it.

Conclusion

This article looked at two commonly used web scraping libraries in the JavaScript ecosystem: Cheerio and Puppeteer. Due to the differences in their modes of operation, they each have strengths and weaknesses. Cheerio is much better suited for simple web scraping scripts that target static pages, while Puppeteer is useful for scraping information from modern JavaScript-rich web pages.

It’s important to note that both of these tools aren’t exactly meant for web scraping. They’ve been adapted by developers who want to scrape websites because these tools offer capabilities to work with HTML and automate browsers. This means the interface these tools provide is not tuned for the wants and needs of web scrapers.

If you’re searching for a powerful and easy-to-use solution, you should check out Bright Data, an all-encompassing web scraping service. In addition to tools for scraping websites and automating browsers, Bright Data is the largest proxy services provider, serving dozens of Fortune 500 companies and over 20,000 customers. Its worldwide proxy network involves: