Cheerio vs. Puppeteer for Web Scraping

A look at the differences between Puppeteer and Cheerio, by building a web scraper with both.
11 min read
Sunil Sandhu
Founder of In Plain English
Cheerio vs. Puppeteer featured image

Cheerio and Puppeteer are two Node.js libraries that allow you to programmatically browse the internet. Because of this, they are both popular choices for those who wish to build a node.js web scraper from scratch.

In order to compare Cheerio and Puppeteer, we will be building a simple web scraper with Cheerio and a web scraper with Puppeteer. We will use both tools to scrape all the blog links from In Plain English, a popular programming platform.

But before we begin, let’s see what we are going to discuss in this article:

Differences between Cheerio and Puppeteer

There are many differences between those 2 libraries, and each one comes with its own special features that you can leverage for web scraping.

Cheerio

  • Cheerio is a DOM parser, able to parse HTML and XML files.
  • It is a fast and lean implementation of core jQuery designed specifically for the server.
  • If you plan to use this to scrape a website, you will need to use Cheerio in conjunction with a Node.js http client library such as Axios.
  • Cheerio doesn’t render the website like a browser (it doesn’t apply CSS or load external resources).
  • Because of this, you will have a hard time trying to scrape SPAs built with frontend technologies such as React.
  • Cheerio cannot interact with a site (e.g. it cannot click on buttons) or access content behind scripts.
  • It has an easy learning curve thanks to its simple syntax. Users of jQuery will feel at home here.
  • Cheerio is fast in comparison to Puppeteer.

Puppeteer

  • Puppeteer is a browser automation tool. You get access to the entire browser engine (usually Chromium).
  • This makes it a more versatile option, compared to Cheerio.
  • It can execute JavaScript, making it able to scrape dynamic pages like single-page applications (SPAs).
  • Puppeteer can interact with websites, meaning it can be used to click buttons, type into login forms, etc.
  • It has a steep learning curve as it has more functionalities and often requires use of asynchronous code (ie promises/async await).
  • Puppeteer is slow in comparison to Cheerio.

Building a Web Scraper with Cheerio

First, let’s create a folder called scraper for our code. Inside of scraper, run npm init -y or yarn init -y, depending on whether you have opted to use npm or yarn.

Now that we have our folder ready and package.json initialized, let’s install our packages.

Note: You can refer to our main node.js web scraping guide that includes the use of Cheerio and Axios for web scraping in more detail.

Step 1 – Installing Cheerio

To install Cheerio, run the following command in your terminal:

// using npm
npm install cheerio

// or using yarn
yarn add cheerio

Step 2 – Installing Axios

Axios is a popular library for making HTTP requests in Node.js. It can be used to make API calls, fetch data from websites, and more.

To install, run the following command in your terminal:

// using npm
npm install axios

// or using yarn
yarn add axios

We use Axios to make HTTP requests to the website we want to scrape. The response we get from the website is in the form of HTML, which we can then parse and extract the information we need using Cheerio.

Step 3 – Preparing Our Scraper

Let’s go into our scraper folder and create a file called cheerio.js.

Here is the basic code structure to get you started with web scraping using Cheerio and Axios:

const axios = require('axios');
const cheerio = require('cheerio');

axios
 .get("https://plainenglish.io/blog")
 .then((response) => {
   // Initialize links array which we will push the links to later
   let links = [];

   // HTML Markup
   const body = response.data;

   // Load HTML data and initialize cheerio
   const $ = cheerio.load(body);

   // CSS selector for the target element
   const element = ".PostPreview_container__82q9E";

   // Loop through each matching element and extract the text content
   $(element).each(function () {
     // Loop through each matching element and get the href attribute of each element
     const _link = $(this).find("a").prop("href");

     // We check if the link is undefined because cheerio will return undefined if the element doesn't exist
     if (_link !== undefined) {
       // Add the link to the links array
       links.push(`https://plainenglish.io` + _link);
     }
   });

   return links;
 })
 .then((response) => {
   console.log(response);
 });

In the above code, we first require the Axios and Cheerio libraries.

Step 4 – Requesting The Data

Next, we make a get() request to “https://plainenglish.io/blog”. Because Axios is asynchronous, we chain our get() function with then().

We initialize an empty links array to capture the links we plan to scrape.

We then pass the response.data from Axios to Cheerio with:

// HTML Markup
const body = response.data;

// Load HTML data and initialize cheerio
const $ = cheerio.load(body);
We choose which selector we plan to target, in our case:
// CSS selector for the target element
const element = ".PostPreview_container__82q9E";

Step 5 – Processing The Data

Then, we loop through each matching element, find the <a> tag, and grab the value from the href property. For each match, we push it to our links array:

// Loop through each matching element and extract the text content
$(element).each(function () {
 // Loop through each matching element and get the href attribute of each element
 const _link = $(this).find("a").prop("href");

 // We check if the link is undefined because cheerio will return undefined if the element doesn't exist
 if (_link !== undefined) {
   // Add the link to the links array
   links.push(`https://plainenglish.io` + _link);
 }
});

We then return links, chain another then() and console.log our response.

Step 6 – End Results

Finally, if we open a terminal from inside of our scraper folder, we can run node.js cheerio.js. This will execute all of the code from our cheerio.js file. You should see the URLs from our links array being output to the console. It will look something like this:

 'https://plainenglish.io/blog/how-to-implement-a-search-bar-in-react-js',
 'https://plainenglish.io/blog/how-to-build-data-driven-surveys-with-react-rest-api-surveyjs',
 'https://plainenglish.io/blog/handle-errors-in-angular-with-httpclient-and-rxjs',
 'https://plainenglish.io/blog/deploying-a-localhost-server-with-node-js-and-express-js',
 'https://plainenglish.io/blog/complete-guide-to-data-center-migration',
 'https://plainenglish.io/blog/build-a-stripe-app-with-typescript-and-node-js-part-2',
 ... 861 more items

And just like that, we’ve managed to scrape the In Plain English website!

From here, we can go one step further and save the data to a file, rather than simply outputting it to the console.

Cheerio and Axios make it easy to perform web scraping in Node.js. With just a few lines of code, you can extract data from websites and use it for various purposes.

Building a Web Scraper with Puppeteer

Let’s go into our scraper folder and create a file called puppeteer.js. We have already initialized our package.json, but if you’ve skipped ahead to this section, go ahead and initialize that file now.

Once initialized, let’s go ahead and install Puppeteer.

Step 1 – Installing Puppeteer

To install Puppeteer, run either of the following commands:

// using npm
npm install puppeteer

// or using yarn
yarn add puppeteer

Step 2 – Preparing Our Scraper

Let’s go into our scraper folder and create a file called puppeteer.js.

Here is the basic code structure to get you started with web scraping using Puppeteer:

const puppeteer = require("puppeteer");

// Because everything in Puppeteer is asynchronous,
// we wrap all of our code inside of an async IIFE
(async () => {
 // Initialize links array which we will push the links to later
 let links = [];

 // Launch Puppeteer
 const browser = await puppeteer.launch();

 // Create a new page
 const page = await browser.newPage();

 // Go to URL
 await page.goto("https://plainenglish.io/blog");

 // Set screen size
 await page.setViewport({ width: 1080, height: 1024 });

 // CSS selector for the target element
 const element = ".PostPreview_container__82q9E";

 // Get all matching elements
 const elements = await page.$$(element);

 // Wrapped with Promise.all to wait for all promises to resolve before continuing
 const _links = await Promise.all(
   // Get the href attribute of each element
   elements.map(async (el) => el.evaluate((el) => el.children[0].href))
 );

 if (_links.length) {
   // If there are any links
   _links.forEach((url) => {
     // Loop through each link
     links.push(url); // Add the link to the links array
   });
 }

 console.log(links);

 await browser.close();
})();

In the above code, we first require the Puppeteer library.

Step 3 – Creating an IIFE

Next, we create an immediately invoked function expression (IIFE). Because everything in Puppeteer is asynchronous, we put async at the beginning. In other words, we have this:

(async () => {
// ...code goes here
}()

Inside of our async IIFE, we create an empty links array, which we will use to capture the links from the blog we are scraping.

// Initialize links array which we will push the links to later
let links = []

Next, we launch Puppeteer, open a new page, browse to a URL, and set the page viewport (the screen size).

 // Launch Puppeteer
 const browser = await puppeteer.launch();

 // Create a new page
 const page = await browser.newPage();

 // Go to URL
 await page.goto("https://plainenglish.io/blog");

 // Set screen size
 await page.setViewport({ width: 1080, height: 1024 });

By default, Puppeteer runs in ‘headless mode’. This means that it doesn’t open a browser that you can visually see. Nevertheless, we still set a viewport size as we want Puppeteer to browse the site at a certain width and height.

Note: If you decide you would like to watch what Puppeteer is doing in real time, you can pass in the headless: false option as a parameter, like so:

// Launch Puppeteer
const browser = await puppeteer.launch({ headless: false });

Step 4 – Requesting The Data

From here, we choose which selector we plan to target, in our case:

// CSS selector for the target element
const element = ".PostPreview_container__82q9E";

And run what is kind of the equivalent to querySelectorAll() for our target element:

// Get all matching elements
const elements = await page.$$(element);

Note: $$ is not the same as querySelectorAll, so don’t expect to have access to all of the same things.

Step 5 – Processing The Data

Now that we have our elements stored inside of elements, we map over each element to pull out the href property:

// Wrapped with Promise.all to wait for all promises to resolve before continuing
const _links = await Promise.all(
 // Get the href attribute of each element
 elements.map(async (el) => el.evaluate((el) => el.children[0].href))
);

In our specific use case, we have el.children[0] as I know that the first child element of our target element is an a tag, and it’s the a tag I want the value of.

Next, we loop through each mapped element and push the value into our links array, like so:

if (_links.length) {
 // If there are any links
 _links.forEach((url) => {
   // Loop through each link
   links.push(url); // Add the link to the links array
 });
}

Lastly, we console.log the links, and then close the browser:

console.log(links);

await browser.close();

Note: If you do not close the browser, it will stay open and your terminal will hang.

Step 6 – End Results

Now, if we open a terminal from inside of our scraper folder, we can run node.js puppeteer.js. This will execute all of the code from our puppeteer.js file. You should see the URLs from our links array being output to the console. It will look something like this:

'https://plainenglish.io/blog/how-to-implement-a-search-bar-in-react-js',
 'https://plainenglish.io/blog/how-to-build-data-driven-surveys-with-react-rest-api-surveyjs',
 'https://plainenglish.io/blog/handle-errors-in-angular-with-httpclient-and-rxjs',
 'https://plainenglish.io/blog/deploying-a-localhost-server-with-node-js-and-express-js',
 'https://plainenglish.io/blog/complete-guide-to-data-center-migration',
 'https://plainenglish.io/blog/build-a-stripe-app-with-typescript-and-node-js-part-2',
 ... 861 more items

And just like that, we’ve managed to scrape the website with Puppeteer!

Puppeteer is a powerful tool for web scraping and automating browser tasks. It provides a rich API for web scraping and automating browser tasks. You can use it to extract information from websites, generate screenshots and PDFs, and perform many other tasks.

If you want to use Puppeteer to scrape major websites, you should consider integrating Puppeteer with a proxy, to prevent getting blocked.

Note: There are other Puppeteer alternatives, such as Selenium, or the Web Scraper IDE. Or if you want to save time, you can skip the entire web scraping process entirely by looking at ready made datasets.

Conclusion

If you are looking to scrape static pages without the need for interactions such as clicks and form submissions (or any sort of event handling for that matter), Cheerio is the optimal choice.

However, if the website relies on JavaScript for content injection, or you need to handle events, Puppeteer is necessary.

Whatever approach you decide, it is worth noting that this specific use case was fairly simple. If you try to scrape something more complex, like a dynamic website (YouTube, Twitter, or Facebook for example), you might find yourself in fairly deep waters.

If you’re looking to scrape websites and don’t want to waste weeks trying to piece together a solution, you might be better placed reaching for an off-the-shelf solution such as Bright Data’s Web Scraper IDE.

Bright Data’s IDE includes pre-made scraping functions, built-in sophisticated unblocking proxy infrastructure, browser scripting in JavaScript, debugging, and several ready-to-use scraping templates for popular websites.

More from Bright Data

Datasets Icon
Get immediately structured data
Easily access structured public web data for any use case. The datasets can be downloaded or delivered in a variety of formats. Get updated or new records from your preferred dataset based on a pre-defined schedule.
Web scraper IDE Icon
Build the scraper
Build scrapers in a cloud environment with code templates and functions that speed up the development. The solution is based on Bright Data’s Web Unlocker and proxy infrastructure making it easy to scale and never get blocked.
Web Unlocker Icon
Implement an automated unlocking solution
Boost the unblocking process with fingerprint management, CAPTCHA-solving, and IP rotation. Any scraper, written in any language, can integrate it via a regular proxy interface.

Ready to get started?