When choosing a Node.js web scraping tool, there are several options. Two of the most common are Cheerio and Puppeteer.
Cheerio was first created as a more performant version of jQuery to be used for parsing and manipulating HTML documents. In comparison, Puppeteer was made for automating tests for web pages and applications.
Regardless, both tools can be useful in web scraping: Cheerio enables you to parse the HTML of a web page to find the information you need, and Puppeteer enables you to automate a web browser to scrape dynamic sites that use JavaScript.
This article will look at both of these tools and compare them in regard to their functionality, performance, and ease of use.
Cheerio vs. Puppeteer
Cheerio and Puppeteer have one primary difference: Cheerio is an HTML parser, while Puppeteer is a browser automation tool. This means that the two tools operate very differently.
Cheerio enables you to take an HTML document and find the HTML elements that you’re looking for via CSS selectors.
For example, take a look at the following selector. It searches for elements that have the h1
tag:
const title = $('h1');
You can run this selector on HTML code, such as the following (taken from example.com):
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.</p>
<p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
It results in Cheerio returning the h1
elements, from which you can extract information, such as the heading names:
<h1>Example Domain</h1>
To work with Cheerio, you need to use a library like axios
to get the HTML code of a web page. Then you can parse the HTML code and find the information you need.
For example, the following code sample downloads the HTML code of example.com, parses it, and then finds the text of the h1
element:
(async () => {
const url = 'https://example.com/';
const response = await axios.get(url); // get HTML
const $ = cheerio.load(response.data); // parse HTML with Cheerio
const title = $('h1'); // use selectors to find data you need
console.log(title.text());
})();
In comparison, Puppeteer opens a dedicated browser instance and works with what the browser instance can provide. This means that it can interact with JavaScript elements that are not present in the HTML of the page. For instance, it can click buttons for navigation, scroll the page, or even execute JavaScript in the context of the page.
Here’s an example of a script that launches a browser, opens a page, and extracts the titles of all h1
elements. In comparison to Cheerio, Puppeteer is also able to find clickable elements, such as the More Information button, and use them for navigation:
(async () => {
const browser = await puppeteer.launch({
headless: false,
defaultViewport: null
});
const page = await browser.newPage(); // open browser
await page.goto('https://example.com/'); // go to the page
const title = await page.evaluate(async () => { // use selectors to find data you need
const h1 = document.querySelector('h1');
return h1.textContent;
});
console.log(title);
more_information = await page.waitForSelector('a'); // can also click on elements!
await more_information.click();
await new Promise(r => setTimeout(r, 2000));
await browser.close(); // close the browser
})();
Because of their operational differences, Puppeteer is better suited for scraping modern websites that use JavaScript to make the website interact like an application. In comparison, Cheerio is better suited for static websites such as blogs.
Features
Once a page with all the necessary information is open, both libraries operate similarly—they use CSS selectors to locate the necessary information and extract it.
Cheerio locates information with a built-in jQuery-like syntax, which makes it convenient for most JavaScript developers:
const title = $('h1');
Puppeteer usually locates information by evaluating JavaScript (in particular, the querySelector and querySelectorAll methods) on the page and returning the result:
const title = await page.evaluate(async () => {
const h1 = document.querySelector('h1');
return h1.textContent;
});
However, since Puppeteer runs in a browser, it has additional features. For instance, while on the page, Puppeteer can do any action a user can do:
await button.click(); \\ clicking
await form.type('User'); \\ typing
This enables it to pass any kind of user flow, such as registration and authentication. You can also use it to search for information using the website’s UI and find information that is otherwise hidden.
For example, if there is a website that you need to log into, doing that with Puppeteer is as simple as it should be—you just need to click the necessary fields and type your username and password. The browser takes care of the rest. Meanwhile, you technically can’t log into an account with Cheerio since it only concerns itself with parsing a single page, not managing a web session.
Puppeteer can even execute arbitrary JavaScript to manipulate the contents of the page. This is commonly used to scroll the page. Therefore, you can use Puppeteer to scrape websites with infinite scroll, while Cheerio won’t be able to load anything besides the first page. It also makes it easier to bypass custom anti scraping measures.
The fact that the scraping happens in the browser with natural actions is also great for debugging! You can disable the headless mode to watch the execution and notice issues with the website during execution:
Performance
Because Puppeteer has to start and run a browser to implement web scraping, it’s significantly slower to start up and execute the script, and it takes more computing resources than Cheerio.
For example, the following is a quick speed check on how long it takes the libraries to scrape a basic web page.
Using the following scripts, you can open the Bright Data blog and extract the links of the blog posts from the first page:
Cheerio
async function cheerio_scrape() {
const url = 'https://brightdata.com/blog';
const response = await axios.get(url);
const $ = cheerio.load(response.data);
const h5s = $('h5');
let titles = []
h5s.each((i, el) => titles.push($(el).text().trim()));
console.log(titles);
};
Puppeteer
async function puppeteer_scrape() {
const browser = await puppeteer.launch({
headless: false,
defaultViewport: null
});
const page = await browser.newPage();
await page.goto('https://brightdata.com/blog');
await page.waitForSelector('h5');
const titles = await page.evaluate(async () => {
let titles = [];
const h5s = document.querySelectorAll('h5');
h5s.forEach(el => titles.push(el.textContent.trim()));
return titles;
});
console.log(titles);
await browser.close();
};
Then you can time the execution of both functions.
The following code times the execution of the Cheerio script, which takes around 500 milliseconds (but your mileage may vary):
let start = Date.now();
cheerio_scrape().then(() => {
let end = Date.now();
console.log(`Execution time: ${end - start} ms`);
});
The following code times the execution of the Puppeteer script:
let start = Date.now();
puppeteer_scrape().then(() => {
let end = Date.now();
console.log(`Execution time: ${end - start} ms`);
});
With Puppeteer, it takes around 4,000 milliseconds for the script to complete, which is significantly longer than the 500 milliseconds that Cheerio takes.
Ease of Use
If you’re new to web scraping, Cheerio may be the better fit because it works only with the HTML code of the page. The user doesn’t need to interact with web elements and adjust the script for their loading times, which means they can focus on important web scraping essentials, such as creating the right selectors.
In addition, with Cheerio, the HTML code of a web page doesn’t change once you download it. In contrast, for a website that is running JavaScript and gets interacted with, the HTML changes all the time, with the timing of changes being somewhat unpredictable.
Because of this, browser automation tools use waits. Puppeteer, in particular, has a waitForSelector function that waits until a condition is fulfilled, such as when an element is present on the page. If the element is not available after a given time length (thirty seconds by default), the script throws an error:
await page.waitForSelector('h1')
If you don’t set them up properly, delays can make your scripts much less reliable.
In addition, the syntax of Cheerio should feel simpler and more natural for JavaScript developers. Puppeteer, while powerful, is not really made with web scraping in mind, and that shows when you try to use it.
Conclusion
This article looked at two commonly used web scraping libraries in the JavaScript ecosystem: Cheerio and Puppeteer. Due to the differences in their modes of operation, they each have strengths and weaknesses. Cheerio is much better suited for simple web scraping scripts that target static pages, while Puppeteer is useful for scraping information from modern JavaScript-rich web pages.
It’s important to note that both of these tools aren’t exactly meant for web scraping. They’ve been adapted by developers who want to scrape websites because these tools offer capabilities to work with HTML and automate browsers. This means the interface these tools provide is not tuned for the wants and needs of web scrapers.
If you’re searching for a powerful and easy-to-use solution, you should check out Bright Data, an all-encompassing web scraping service. In addition to tools for scraping websites and automating browsers, Bright Data is the largest proxy services provider, serving dozens of Fortune 500 companies and over 20,000 customers. Its worldwide proxy network involves:
- Datacenter proxies – Over 770,000 IPs from datacenters.
- Residential proxies – Over 72M IPs from residential devices in more than 195 countries.
- ISP proxies – Over 700,000 IPs from ISP-registered devices.
- Mobile proxies – Over 7M IPs from mobile networks.
Start your free trial today.