Web Scraping with Puppeteer

Learn how to scrape static and dynamic websites using Puppeteer in this step by step guide.
9 min read
web scraping with puppeteer

Puppeteer is a browser testing and automation library that’s also nice for web scraping. In comparison to simpler tools like Axios and cheerio, Puppeteer enables developers to scrape dynamic content (ie content that changes based on user actions).

This means you can use it to scrape web applications (single-page apps) that load their content using JavaScript, which is exactly what you’ll do here.

Web Scraping Using Puppeteer

In this tutorial, you’ll learn how to scrape static and dynamic data (ie post titles and links from Bright Data’s Blog) with Puppeteer.

Setting Up

Before you begin the tutorial, you need to make sure that you have Node.js installed on your computer. You can download it from their official Downloads page.

Then create a new directory for the project and navigate to it with the following commands:

mkdir puppeteer_tutorial 
cd puppeteer_tutorial 
npm init -y 

Next, install Puppeteer with this command:

npm i puppeteer --save

This command also downloads a dedicated browser that the library will use.

Scraping a Static Site

Like all web scraping tools, Puppeteer lets you scrape the HTML code of web pages.

Following are the steps you can take to use Puppeteer to scrape the first page of posts from Bright Data’s blog:

Create an index.js file and import Puppeteer:

const puppeteer = require('puppeteer');

Then insert the boilerplate necessary for running Puppeteer:

(async () => {
  const browser = await puppeteer.launch({
    headless: false,
    defaultViewport: null
  });

  const page = await browser.newPage();
  await page.goto('https://brightdata.com/blog');

  // all the web scraping will happen here  

  await browser.close();

})();

This function opens a browser, navigates to the page to scrape it, and closes the browser.

All that’s left to do is to scrape the data from the page.

In Puppeteer, the easiest way to access HTML data is through the page.evaluate method. While Puppeteer has $ and $$ methods, which are wrappers useful for fetching elements, it’s simpler to just get all the data from page.evaluate.

On the Bright Data blog, all blog post data is wrapped in an <a> tag with the class of brd_post_entry. The title of the post is in an <h3> element with the class of brd_post_title. The link to the post is the href value of the brd_post_entry.

Here’s what a page.evaluate function that extracts those values looks like:

  const data = await page.evaluate( () => {

    let data = [];
    const titles = document.querySelectorAll('.brd_post_entry');

    for (const title of titles) {
      const titleText = title.querySelector('.brd_post_title').textContent;
      const titleLink = title.href;

      const article = { title: titleText, link: titleLink };
      data.push(article);
    }

    return data;

  })

Finally, you can print out the data in the console:

  console.log(data);

The full script looks like this:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch({
    headless: false,
    defaultViewport: null
  });

  const page = await browser.newPage();
  await page.goto('https://brightdata.com/blog');

  const data = await page.evaluate(() => {

    let data = [];
    const titles = document.querySelectorAll('.brd_post_entry');

    for (const title of titles) {
      const titleText = title.querySelector('.brd_post_title').textContent;
      const titleLink = title.href;

      const article = { title: titleText, link: titleLink };
      data.push(article);
    }

    return data;

  })

  console.log(data);

  await browser.close();

})();

Run it by calling node index.js in the terminal. The script should return a list of post titles and links:

[
  {
    title: 'APIs for Dummies: Learning About APIs',
    link: 'https://brightdata.com/blog/web-data/apis-for-dummies'
  },
  {
    title: 'Guide to Using cURL with Python',
    link: 'https://brightdata.com/blog/how-tos/curl-with-python'
  },
  {
    title: 'Guide to Scraping Walmart',
    link: 'https://brightdata.com/blog/how-tos/guide-to-scraping-walmart'
  },
…

Scraping Dynamic Content

Scraping static content is a simple task that can easily be done with uncomplicated tools. Thankfully, Puppeteer can be used to accomplish a wide range of actions, such as clicking, typing, and scrolling. You can use all these to interact with dynamic pages and simulate user actions.

A common web scraping task with a library like this would be to search for a certain data set on the page. For example, you might want to use Puppeteer to search for all the Bright Data posts about Puppeteer.

Here’s how you can do that:

Step 1: Accept Cookies

When a person visits the Bright Data blog, a cookie banner sometimes appears:

To deal with that, you need to click on the Accept all button with the following code:

  await page.waitForSelector('#brd_cookies_bar_accept', {timeout: 5000})
    .then(element => element.click())
    .catch(error => console.log(error));

The first line of the code waits for an item with the #brd_cookies_bar_accept to appear for 5 seconds. The second line clicks on that element. The third line makes sure that the script doesn’t crash if the cookie bar doesn’t appear.

Note that waiting in Puppeteer happens by providing some condition that you want to wait for, not setting a certain wait time for the previous action to be executed. The former is called implicit waiting, and the latter is called explicit waiting.

Explicit waiting in Puppeteer is heavily discouraged, as it leads to execution issues. If you provide an explicit time for waiting, it’s bound to be either too long (which is inefficient) or too short (which means the script won’t execute correctly).

Step 2: Search for Posts

After that, the script needs to click on the search icon. Type in “Puppeteer” and press the search icon again to trigger a search:

This can be done with the following code:


await page.click('.search_icon');

  await page.waitForSelector('.search_container.active');
  const search_form = await page.waitForSelector('#blog_search');
  await search_form.type('puppeteer');

 await page.click('.search_icon');

  await new Promise(r => setTimeout(r, 2000));

This example works similarly to the cookie banner example. After clicking the button, you need to wait until the search container appears, which is why the code waits for an element that matches the CSS selector of .search_container.active.

Additionally, in the end, you need to add a stop of two seconds for the items to load. While explicit waits are discouraged in Puppeteer, there are no other good options here right now.

In most websites, if a change in the URL happens, you can use the waitForNavigation method. If a new element appears, you can use the waitForSelector method. Figuring out if some elements are refreshed is a bit more difficult and out of scope for this article.

If you want to try it on your own, this Stack Overflow answer can be of help.

Step 3: Collect the Posts

After searching for posts, you can use the code you already used for the static page scraping to get the titles of the blog posts:

  const data = await page.evaluate( () => {

    let data = [];
    const titles = document.querySelectorAll('.brd_post_entry');

    for (const title of titles) {
      const titleText = title.querySelector('.brd_post_title').textContent;
      const titleLink = title.href;

      const article = { title: titleText, link: titleLink };
      data.push(article);
    }

    return data;

  })

  console.log(data);

Here’s the full code for the script:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch({
    headless: false,
    defaultViewport: null
  });

  const page = await browser.newPage();
  await page.goto('https://brightdata.com/blog');

  const cookie_bar_accept = await page.waitForSelector('#brd_cookies_bar_accept');
  await cookie_bar_accept.click();
  await new Promise(r => setTimeout(r, 500));

  await page.click('.search_icon');

  await page.waitForSelector('.search_container.active');
  const search_form = await page.waitForSelector('#blog_search');
  await search_form.type('puppeteer');

  await page.click('.search_icon');

  await new Promise(r => setTimeout(r, 2000));

  const data = await page.evaluate( () => {

    let data = [];
    const titles = document.querySelectorAll('.brd_post_entry');

    for (const title of titles) {
      const titleText = title.querySelector('.brd_post_title').textContent;
      const titleLink = title.href;

      const article = { title: titleText, link: titleLink };
      data.push(article);
    }

    return data;

  })

  console.log(data);

  await browser.close();

})();

Can You Do Better?

While web scraping scripts with Puppeteer is possible, it’s not ideal. Puppeteer is made for test automation, which makes it a bit awkward for accomplishing web scraping.

For example, if you want to achieve scale and efficiency in your scripts, it’s important to be able to scrape without being blocked. To do this, you can use proxies—gateways between you and the website you scrape. While Puppeteer supports the use of proxies, you need to find and contract with a proxy network on your own (learn more about Puppeteer proxy integration with Bright Data).

Additionally, optimizing Puppeteer for parallel use isn’t easy. If you want to scrape a lot of data, you’ll need to work hard to get optimal performance.

These downsides mean that Puppeteer is a good choice for small scripts for hobby use, but it will take a lot of time to scale up your operations if you use it.

In case you want something that’s easier to use, you can pick a web data platform like Bright Data. It enables companies to gather massive amounts of structured data from the web using easy-to-use tools, like the Scraping Browser (Puppeteer/Playwright compatible), that’s specially made for scraping.

Conclusion

In this article, you learned how to use Puppeteer for scraping static and dynamic web pages.

Puppeteer can execute most of the actions that a browser can do, including clicking on items, typing text, and executing JavaScript. And due to the use of implicit waits, scripts written in Puppeteer are fast and easy to write.

But there can also be some problems—Puppeteer isn’t the most efficient tool for web scraping, and its documentation is not well-suited for beginners. It’s also hard to scale scraping operations using Puppeteer if you’re not already well-versed in using it.

Tired of scraping data yourself? Get precollected or custom datasets.