Web Scraping with Playwright Guide

Web scraping with Playwright guide. In this step by step tutorial, you will learn how to scrape data using Playwright, a powerful scraping tool.
14 min read
Web Scraping with Playwright

The web contains vast amounts of data that can be invaluable for research and business decisions. That’s why knowing how to use tools like Playwright is so important.

Playwright is a powerful Node.js library developed by Microsoft that can scrape data from websites. In this post, you’ll see practical, detailed examples using Playwright to scrape data from the Bright Data home page. You can then apply these examples to any other website you want to scrape using Playwright.

Why Use Playwright

Web scraping is not a new concept. In the JavaScript ecosystem, tools such as CheerioSeleniumPuppeteer, and Playwright all help simplify web scraping.

As a newer web scraping library, Playwright is particularly appealing due to the following features:

Powerful Locators

Playwright uses locators, which have built-in auto-waiting and retry logic, to select elements on a web page. The auto-waiting logic simplifies your web scraping code because you don’t have to wait for a web page to load manually.

The retry logic also makes Playwright a suitable library for scraping modern, single-page applications (SPA) that load data dynamically after the initial page has loaded.

Multiple Locator Methods

When using locators, Playwright lets you specify what elements to locate on the web page using several different syntaxes, including the CSS selector syntaxXPath syntax, and element text content. You can also apply filters to locators to further refine the locator.

Web Scraping with Playwright

In this section, you’ll create a Node.js project, install Playwright, and learn how to locate, interact with, and extract data from a web page using Playwright.

Prerequisites

The code snippets in this article run on the latest long-term support (LTS) version of Node.js, which, at the time of writing, is v18.15.0. Make sure you have Node.js installed before you begin.

A code editor capable of JavaScript syntax highlighting and autocomplete, such as Visual Studio Code, is also highly recommended.

Creating a New Project

Open a new terminal window and create a new folder for your Node.js project and navigate into it using the following command:

mkdir playwright-demo
cd playwright-demo

Next, create your Node.js project by running the following npm command:

npm init -y

Installing Playwright

Once you’ve created a Node.js project, install the Playwright library using the following command in your terminal window:

npm install playwright

The library might take a while to install because Playwright will download the necessary browsers as part of its installation.

Opening the Bright Data Home Page

With the Playwright library installed, create a new file in your project folder named index.js. Then copy the following code into it:

// Import the Playwright library to use it
const playwright = require("playwright");

(async () => {
  // Launch a new instance of a Chromium browser
  const browser = await playwright.chromium.launch({
    // Set headless to false so you can see how Playwright is
    // interacting with the browser
    headless: false,
  });
  // Create a new Playwright context
  const context = await browser.newContext();
  // Create a new page/tab in the context.
  const page = await context.newPage();

  // Navigate to the Bright Data home page.
  await page.goto("https://brightdata.com/");

  // Wait 10 seconds (or 10,000 milliseconds)
  await page.waitForTimeout(10000);

  // Close the browser
  await browser.close();
})();

Run the snippet using the following command in your terminal:

node index.js

Chromium browser should open and load the Bright Data home page:

Locating Elements

Now that you’ve navigated to Bright Data’s home page using Playwright, you can use locators to select specific elements on the web page. Playwright has several locators, and the following sections will demonstrate how each locator works.

Locating Elements with CSS Selectors

Playwright lets you locate elements on a web page using CSS selectors, a concise yet powerful syntax used in CSS to apply styles to particular HTML elements on the web page.

For example, the Bright Data logo is an <svg> element in the page header, with the page_header_logo_svg class attached:

Using this information, you can locate the SVG element using a CSS selector:

const logoSvg = page.locator(".page_header_logo_svg");

The locator is stored in the logoSvg variable and can later be used to interact with or extract information from the element.

Locating Elements with XPath Queries

XPath is another selector syntax you can use to locate elements within an XML document. Since HTML is XML, you can use the syntax to find HTML elements on a web page.

For instance, you can select the same SVG logo as seen in the previous section with the following XPath query:

const logoSvg = page.locator("//*[@class='page_header_logo_svg']");

The query looks for any elements with the page_header_logo_svg class attached to it and stores their location in the logoSvg variable.

Locating Elements by Role

HTML elements can have different roles attached to them. These roles provide semantic meaning to a web page, making it easier for screen readers and other tools to support the page. You can read more about roles here.

The following code snippet shows you how you can find the Sign up button using the role and name attached to it:

const signupButton = page.getByRole("button", {
  name: "Start free trial",
});

This snippet will locate the Start free trial button on the home page:

const signupButton = page.getByRole("button", {
  name: "Start free trial",
});

This snippet will locate the Start free trial button on the home page:

Locating Elements by Text

If an HTML element has no meaningful identifier attribute, such as an id or class attribute, you can select the element by its text using the getByText method.

For example, the Bright Data home page has a title in the hero section with the words “structured data” in blue:

Screenshot of the Bright Data home page with the word "structured data"

You can select the <span> element containing those words using the following Playwright snippet:

const structuredData = page.getByText("structured data");

Locating Elements by Label

In an HTML form, input elements often have labels. Playwright can use these labels to identify the input element associated with that label using the getByLabel method.

For example, the Bright Data Log in page has an input element with a label containing the words “Work email”:

You can locate the input element on the page and store it in a variable to use later using the following code snippet:

// Navigate to the Bright Data login page.
await page.goto("https://brightdata.com/cp/start");

// Locate the <input> using the label
const emailInput = page.getByLabel("Work email");

Locating Elements by Placeholder

You can also locate an input element based on the placeholder value displayed using the getByPlaceholder method.

You’ll notice the the Bright Data Log in page’s email field has placeholder text to give the user context about what information to enter.

The following snippet will locate this element based on the placeholder value shown by the input:

// Navigate to the Bright Data login page.
await page.goto("https://brightdata.com/cp/start");

// Locate the <input> using the placeholder
const emailInput = page.getByPlaceholder("[email protected]");

Locating Elements by Alt Text

HTML lets you add a text description to images using the alt attribute, which is shown if the picture doesn’t load and read aloud by screen readers to describe the image. Playwright’s getByAltText method lets you locate an img element using its alt attribute.

For instance, Bright Data lists industries that use their data. You can retrieve the image used for the healthcare industry using its alt value, "healthcare use case":

The following code snippet will locate the image element:

const healthcareImage = page.getByAltText("healthcare use case");

Locating Elements by Title

The final Playwright selector you can use for scraping is the getByTitle method, which locates an HTML element by its title attribute. You’ll see the title value when hovering over the HTML component with your pointer.

For example, the Bright Data help desk website contains a sign-in link with a title attribute:

You can use the following Playwright snippet to locate the link using its title attribute:

// Navigate to the Bright Data helpdesk webpage.
await page.goto("https://help.brightdata.com/hc/en-us");

// Locate the Sign in link using its title attribute
const signInLink = page.getByTitle("Opens a dialog");

Now that you’ve seen a couple of methods you can use to locate elements on a web page using Playwright, let’s learn how to interact with and extract data from those elements.

Interacting with Elements

After locating an element on a web page, you can interact with it. For example, you may need to log into a website before scraping protected pages.

This code snippet demonstrates different Playwright methods to interact with elements on a web page. You’ll find an explanation of each function in the following code:

// Import the Playwright library to use it
const playwright = require("playwright");

(async () => {
  // Launch a new instance of a Chromium browser
  const browser = await playwright.chromium.launch({
    // Set headless to false so you can see how Playwright is
    // interacting with the browser
    headless: false,
  });
  // Create a new Playwright context
  const context = await browser.newContext();
  // Create a new page/tab in the context.
  const page = await context.newPage();

  // Navigate to the Bright Data login page.
  await page.goto("https://brightdata.com/");

  // Locate and click on the signup button
  await page
    .locator("#hero_new")
    .getByRole("button", {
      name: "Start free trial",
    })
    .click();

  // Locate the first name field and FILL in a first name
  await page.locator(".hs_firstname input").fill("John");

  // Locate the last name field and FILL in a last name
  await page.locator(".hs_lastname input").fill("Smith");

  // Locate the email field and TYPE in an email address
  await page.locator(".hs_email input").type("[email protected]");

  // Locate the company size field and SELECT an option
  await page.locator(".hs_numemployees select").selectOption("1-9 employees");

  // Locate the terms and conditions checkbox and CHECK it.
  await page.locator(".legal-consent-container input").check();

  // Wait 10 seconds so you can see the result.
  await page.waitForTimeout(10000);

  // Close the browser
  await browser.close();
})();

Paste this snippet into your index.js file and rerun it using the following command:

node index.js

The Bright Data home page will appear briefly before displaying a sign-up dialog. Next, you’ll see how Playwright populates the Sign up form using the different methods in this snippet:

Clicking Elements

In the previous snippet, Playwright first clicked on the Sign up button so the dialog would appear:

// Locate and click on the signup button
await page
  .getByRole("button", {
    name: "Start free trial",
  })
  .click();

Playwright has two methods to click on elements:

  1. The click method simulates single-clicking an element.
  2. The dblclick method simulates double-clicking an element.

In this example, you only needed to single-click the Sign up button, which is why the snippet uses the click method.

Populating Text Fields

In this example, the snippet used two methods to fill text fields on the Sign up form:

// Locate the first name field and FILL in a first name
await page.locator(".hs_firstname input").fill("John");

// Locate the last name field and FILL in a last name
await page.locator(".hs_lastname input").fill("Smith");

// Locate the email field and TYPE in an email address
await page.locator(".hs_email input").type("[email protected]");

The snippet uses the fill and type methods on different fields. Both functions populate a text field, but they do so slightly differently:

  • The fill method inserts the specified value into the text field. While this works for most forms, some websites might block you from inserting an entire value.
  • The type method helps mitigate against this by simulating each keystroke to enter the specified value.

You’d probably use the fill method in most cases, but where necessary, you can use the type method to simulate typing the value manually.

Selecting a Drop-Down Option

The Sign up form has a drop-down field to select the company size, which Playwright populated with “1–9 employees”:

// Locate the company size field and SELECT an option
await page.locator(".hs_numemployees select").selectOption("1-9 employees");

Playwright lets you use the selectOption method to populate drop-down fields on a form. The function lets you select a drop-down item based on value or label and choose multiple options in a multiselect.

Checking Radio Buttons and Checkboxes

Before submitting the form, you need to accept the terms and conditions. The following snippet checks the appropriate checkbox:

// Locate the terms and conditions checkbox and CHECK it.
await page.locator(".legal-consent-container input").check();

To modify a checkbox, you can use the check and uncheck method:

  • The check method ensures the checkbox is checked.
  • The uncheck method ensures the checkbox is unchecked.

Now that you’ve seen how Playwright lets you interact with HTML elements on a page, the next section will show you how to extract data from the page.

Extracting Data from Elements

Extracting data is essential for web scraping. Playwright lets you use several methods to retrieve different data types from the elements you’ve located. The following sections go through some of these methods.

Extracting Inner Text

The innerText method lets you extract the text inside an element. For example, the Bright Data home page has a hero element at the top:

You can extract the title of the hero on the Bright Data home page using the following snippet:

const headerText = await page.locator(".brd_hero__title.h1").innerText();
// headerText = "Turn websites\ninto structured data"

If your locator points to more than one element, you can retrieve the text in all the elements as a string of arrays using the allInnerTexts method. For example, the Bright Data home page has a list of use cases for their data:

You can extract a list of all the Bright Data use cases using the following snippet:

  const useCases = await page
    .locator(".section_cases_row_col  .elementor-image-box-title")
    .allInnerTexts();
  // useCases = [
  //   'E-commerce',
  //   'Social Media for Marketing',
  //   'SERP & SEO',
  //   'Ad Tech',
  //   'Market Research',
  //   'Travel',
  //   'Financial Services',
  //   'Healthcare',
  //   'Real Estate',
  //   'Data for Good'
  // ]

Extracting Inner HTML

Playwright also lets you extract the inner HTML of an element using the innerHTML method. For instance, you can get the HTML for the footer on the Bright Data home page using the following snippet:

const footerHtml = await page.locator("#footer").innerHTML();
// footerHtml = '<div class="container"><div class="footer__logo">...'

Extracting Attribute Values

You might need to extract data from attributes on an HTML element, such as the href attribute on a link. The following Playwright snippet demonstrates how you can scrape the href property on the Log in link:

const signUpHref = await page.getByText("Log in").getAttribute("href");
// signUpHref = '/cp/start'

Screenshotting Pages

When scraping data, you may need to take screenshots for auditing purposes. You can use the screenshot method to do this. The function lets you configure several options, such as where to save the screenshot file and whether to take a full-page screenshot.

The following snippet takes a full-page screenshot of the Bright Data home page and saves it:

await page.screenshot({
  // Save the screenshot to the "homepage.png" file
  path: "homepage.png",
  // Take a screenshot of the entire page
  fullPage: true,
});

Using Automated Scraping Services

The previous snippets detail how to locate, interact with, and extract data from a web page. These methods will let you scrape almost any data from a web page. However, they require effort, as you must identify appropriate elements before locating them. You also need to be aware of CAPTCHAs and rate limits when scraping multiple pages on a single website.

Bright Data offers several solutions that will let you focus on extracting data. Bright Data provides a Web Scraper IDE with ready-made JavaScript functions and templates to help you scrape from popular websites. You can also bypass CAPTCHAs using Web Unlocker and avoid rate limits and geolocation blocks using Bright Data’s proxy services. These services remove many hurdles in Playwright, helping you scrape data faster and easier.

Conclusion

In this article, you learned about Playwright, a library developed by Microsoft that helps scrape data from websites, and you learned how to use Playwright to locate, interact with, and extract data from elements on a web page. Finally, you saw how an automated scraping service such as Bright Data can simplify your web scraping processes.