Web Scraping With JavaScript and Node.JS

We will cover why frontend JavaScript isn’t the best option for web scraping and will teach you how to build a Node.js scraper from scratch.
Antonello Zanini author image
Antonello Zanini
28-Dec-2022

In this article, we will discuss:

Web Scraping With Frontend JavaScript

When it comes to web scraping, frontend JavaScript is a limited solution. First, because you would have to run your JavaScript web scraping script directly from the browser console. This is not an operation you can perform programmatically.

In particular, you can scrape data in a page from the console as follows:

Running a JS web scraping script in the frontend
Running a JS web scraping script in the frontend

Second, if you wanted to scrape data from other web pages, you would have to download them via AJAX. But do not forget that web browsers apply a Same-Origin Policy to AJAX. So, with frontend JavaScript, you can only get access to web pages within the same origin.

Let’s understand what this means with a simple example. Let’s assume you are visiting a page from brightdata.com. Then, your frontend JavaScript web scraping script could only download web pages under the brightdata.com domain.

Note that this does not mean at all that JavaScript is not a good technology for web crawling. Actually, Node.js allows you to run JavaScript on servers and avoid the two limitations presented above.

Let’s now understand how you can build a JavaScript web scraper with Node.js.

Prerequisites

Before you start working on the Node.js web scraping app, you need to meet the following list of prerequisites:

  • Node.js 18+ with npm 8+: Any LTS (Long Term Support) version of Node.js 18+ including npm will be fine. This tutorial is based on Node.js 18.12 with npm 8.19, which represents the latest LTS version of Node.js at the time of writing.
  • An IDE supporting JavaScript: The Community Edition of IntelliJ IDEA is the IDE chosen for this tutorial, but any other IDE with JavaScript and Node.js support will do.

Click the links above and follow the installation wizards to set up everything you need. You can verify that Node.js was installed correctly by launching the command below in your terminal:

node -v

This should return something like:

v18.12.1

Similarly, verify that npm was installed correctly with

npm -v 

This should return a string like:

8.19.2

The two commands above indicate the version of Node.js and npm available globally on your machine, respectively.

Fantastic! You are now ready to see how to perform JavaScript web scraping in Node.js!

Best JavaScript Web Scraping Libraries for Node.js

Let’s explore the best JavaScript libraries for web scraping in Node.js:

  • Axios: An easy-to-use library that helps you make HTTP requests in JavaScript. You can use Axios both in the browser and in Node.js, and it represents one of the most popular JavaScript HTTP client available.
  • Cheerio: A lightweight library that provides jQuery-like API to explore HTML and XML documents. You can use Cheerio to parse an HTML document, select HTML elements, and extract data from them. In other words, Cheerio offers an advanced web scraping API.
  • Selenium: A library that supports several programming languages you can use to use to build automated testing for web applications. You can also use it for its headless browser capabilities for web scraping purposes.
  • Playwright: A tool for creating automated test scripts for web applications developed by Microsoft. It offers a way to instruct the browser to perform specific actions. So, you can use Playwright for web scraping as a headless browser solution.
  • Puppeteer: A tool for automating the testing of web applications developed by Google. Puppeteer is built on top of the Chrome DevTools protocol. Just like Selenium and Playwright, it allows you to interact programmatically with the browser as a human user would. Learn more about the differences between Selenium and Puppeteer.

Building a JavaScript Web Scraper in Node.js

Here, you will learn how to build a JavaScript web scraper in Node.js that is able to automatically extract data from a website. In detail, the target webpage will be the Bright Data home page. The goal of the Node.js web scraping process will be to select the HTML elements of interest from the page, retrieve data from them, and convert the scraped data into a more useful format.

At the time of writing, this is what the Bright Data home page looks like:

Bright Data home page gif
A general view of the Bright Data home page

As you can note, the Bright Data home page contains a lot of data and info in different formats, from text descriptions to images. Also, it involves a lot of useful links. You will learn how to retrieve all this data.

Let’s now take a look at how to scrape data with Node.js in a step-by-step tutorial!

Step 1: Set up a Node.js Project

First, create the folder that will contain your Node.js web scraping project with:

mkdir web-scraper-nodejs

You should now have an empty web-scraper-nodejs directory. Note that you can give the project folder whatever name you want. Enter the folder with:

cd web-scraper-nodejs

Now, initialize an npm project with:

npm init -y

This command will set up a new npm project for you. Note that the -y flag is required to make npm initialize a default project without going through an interactive process. If you omit the -y flag, you will be asked some questions in the terminal.

web-scraper-nodejs should now contain a package.json that looks like as follows:

{
  "name": "web-scraper-nodejs",
  "version": "1.0.0",
  "description": "",
  "main": "index.js",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "keywords": [],
  "author": "",
  "license": "ISC"
}

Now, create an index.js file in the root folder of your project and initialize it as below:

// index.js

console.log("Hello, World!")

This JavaScript file will contain the Node.js web scraping logic.

Open your package.json file and add the following script in the scripts section:

"start": "node index.js"

You can now run the command below in your terminal to launch your Node.js script:

npm run start

This should return:

Hello, World!

This means that your Node.js app is working correctly. Now, open the project in your IDE and get ready to write some scraping logic in Node.js!

If you are an IntelliJ IDEA user, you should be seeing the following:

Step 2: Install Axios and Cheerio

It is time to install the dependencies required to implement the web scraper in Node.js. To figure out which JavaScript web scraping libraries you should adopt, visit the target web page, right-click on a blank section, and select the “Inspect” option. This should open the DevTools window of your browser. In the Network tab, take a look at the Fetch/XHR section.

The Fetch XHR section is almost empty
Note that the Fetch/XHR section is almost empty

Above, you can see the AJAX requests performed by your target web page. If you open the three XHR requests executed by the website, you will see they do not return interesting data. In other words, the desired data is directly embedded in the source code of the web page. This is what usually happens with server-side rendered websites.

The target web page does not rely on JavaScript to retrieve data or for rendering purposes. Thus, you do not need a tool that is able to run JavaScript in the browser. In other word, you do not have to use a headless browser library to extract data from the target web page. You can use such a library, but it is not necessary.

Since libraries that provide headless browser capabilities open web pages in a browser, this introduces an overhead. This is because browsers are heavy applications. You can easily avoid this overhead by opting for Cheerio along with Axios.

So, install cheerio and axios with:

npm install cheerio axios

Then, import cheerio and axios by adding the following two lines of code to index.js:

// index.js

const cheerio = require("cheerio")
const axios = require("axios")

Let’s now code a Node.js web scraping script that performs web scraping with Cheerio and Axios!

Step 3: Download your target website

Use Axios to connect to your target website with the following lines of code:

// downloading the target web page 
// by performing an HTTP GET request in Axios
const axiosResponse = await axios.request({
    method: "GET",
    url: "https://brightdata.com",
})

Thanks to the Axios request() method, you can execute any HTTP request. In detail, if you want to download the source code of a web page, you have to perform an HTTP GET request to its URL. Normally, Axios would immediately return a Promise. You can wait for a Promise and get its value synchronously with the await keyword.

Note that if request() fails, an Error will be thrown. This can happen for several reasons, from an invalid URL to a server temporarily unavailable. Also, do not forget that several implement anti-scraping measures. One of the most popular ones involves blocking requests that do not have a valid User-Agent HTTP header. Learn more about User-Agents for web scraping.

By default, Axios will use the following User-Agent:

axios <axios_version>

This is not what the User-Agent used by a browser looks like. So, anti-scraping technologies may detect and block tour Node.js web scraper.

Set a valid User-Agent header in Axios by adding the following attribute to the object passed to request():

headers: {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
}

The headers attribute allows you to set any HTTP header in Axios.

Your index.js file should now look like as follows:

// index.js

const cheerio = require("cheerio")
const axios = require("axios")

async function performScraping() {
    // downloading the target web page
    // by performing an HTTP GET request in Axios
    const axiosResponse = await axios.request({
        method: "GET",
        url: "https://brightdata.com",
        headers: {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
        }
    })
}

performScraping()

Note that you can use await only in functions marked with async. This is why you have to embed your JavaScript web scraping logic in the async performScraping() function.

Let’s now spend some time analyzing the target web page to define a web scraping strategy.

Step 4: Inspect the HTML page

If you take a look at the Bright Data home page, you will see a list of industries where you can use Bright Data. This is interesting data to scrape.

Right-click on one of these HTML elements and select “Inspect”:

The DevTools window related to a target HTML element
The DevTools window related to a target HTML element

By analyzing the HTML code of the selected node, you will see that the card is an <a> HTML element. Specifically, this <a> contains:

  1. A <figure> HTML element containing the image associated to the industry field
  2. A <div> HTML element containing the name of the industry field

Now, notice the CSS classes that characterize those HTML elements. Using them, you will be able to define the CSS selectors required to select those HTML elements from the DOM. In detail, note that the .e-container cards are contained in the .elementor-element-7a85e3a8 <div>. Then, given a card, you can extract all its relevant data with the following the CSS selectors:

  1. .elementor-image-box-img img
  2. .elementor-image-box-content .elementor-image-box-title

Similarly, you can apply the same logic to define the CSS selectors required to:

  • Extract the reasons why Bright Data is the industry leader.
  • Select the reasons that make the customer experience offered by Bright Data the best in the market.

In other words, the target web page has three scraping goals:

  1. Data on the industries where you can take advantage of Bright Data.
  2. Data on the reasons why Bright Data is the industry leader.
  3. Data about why Bright Data offers the best customer experience in the industry.

Step 5: Select HTML elements with Cheerio

Cheerio offers several ways to select HTML elements from a web page. But first, you have to initialize Cheerio with:

// parsing the HTML source of the target web page with Cheerio
const $ = cheerio.load(axiosResponse.data)

The Cheerio load() method accepts HTML content in string form. Note that the Axios response object contains the data returned by the HTTP request in the data attribute. In this case, data will store the HTML source code of the web page returned by the server. So, you pass axiosResponse.data to load() to initialize Cheerio.

You should call the Cheerio variable $ because Cheerio shares basically the same syntax as jQuery. This way, you will be able to copy jQuery snippets from the Internet.

You can select an HTML element with Cheerio by using its class with:

const htmlElement = $(".elementClass")

Similarly, you can retrieve an HTML element by ID with:

const htmlElement = $("#elementId")

In detail, you can select HTML elements by passing to $ any valid CSS selector, just as you would do in jQuery. You can also concatenate selection logic with the find() method:

// retrieving the list of industry cards
const industryCards = $(".elementor-element-7a85e3a8").find(".e-container")

find() gives you access to the descendants of the current HTML element filtered by a CSS selector. You can then iterate on a list of Cheerio nodes with the each() method, as follows:

// iterating over the list of industry cards
$(".elementor-element-7a85e3a8")
    .find(".e-container")
    .each((index, element) => {
         // scraping logic...
    })

Let’s now learn how to use Cheerio to extract data from the HTML elements of interest.

Step 6: Scrape data from a target webpage with Cheerio

You can expand the logic shown previously to extract the desired data from the selected HTML elements as below:

// initializing the data structure
// that will contain the scraped data
const industries = []

// scraping the "Learn how web data is used in your market" section
$(".elementor-element-7a85e3a8")
    .find(".e-container")
    .each((index, element) => {
        // extracting the data of interest
        const pageUrl = $(element).attr("href")
        const image = $(element).find(".elementor-image-box-img img").attr("data-lazy-src")
        const name = $(element).find(".elementor-image-box-content .elementor-image-box-title").text()

        // filtering out not interesting data
        if (name && pageUrl) {
            // converting the data extracted into a more
            // readable object
            const industry = {
                url: pageUrl,
                image: image,
                name: name
            }

            // adding the object containing the scraped data
            // to the industries array
            industries.push(industry)
        }
    })

This web scraping Node.js snippet selects all industry cards from the Bright Data home page. Then, it iterates over all the HTML card elements. For each card, it scrapes the url of the web page associated with the card, the image, and the name of the industry. Thanks to the attr() and text() methods from Cheerio, you can retrieve the HTML attribute value and text, respectively. Finally, it stores the scraped data in an object and adds it to the industries array.

At the end of the each() loop, industries will contain all data of interest related to the first scraping goal. Let’s now see how to achieve the other two goals as well.

Similarly, you can scrape the data to back why Bright Data is the industry leader as follows:

const marketLeaderReasons = []

// scraping the "What makes Bright Data
// the undisputed industry leader" section
$(".elementor-element-ef3e47e")
    .find(".elementor-widget")
    .each((index, element) => {
        const image = $(element).find(".elementor-image-box-img img").attr("data-lazy-src")
        const title = $(element).find(".elementor-image-box-title").text()
        const description = $(element).find(".elementor-image-box-description").text()

        const marketLeaderReason = {
            title: title,
            image: image,
            description: description,
        }

        marketLeaderReasons.push(marketLeaderReason)
    })

Lastly, you can scrape the data on why Bright Data offers a great customer experience with:

const customerExperienceReasons = []
// scraping the "The best customer experience in the industry" section
$(".elementor-element-288b23cd .elementor-text-editor")
    .find("li")
    .each((index, element) => {
        const title = $(element).find("strong").text()
        // since the title is part of the text, you have
        // to remove it to get only the description
        const description = $(element).text().replace(title, "").trim()

        const customerExperienceReason = {
            title: title,
            description: description,
        }

        customerExperienceReasons.push(customerExperienceReason)
    })

Congrats! You just learned how to achieve all your three Node.js web scraping goals!

Keep in mind that you can scrape data from other web pages by following the links you discovered in the current page. This is what web crawling is about. Therefore, you can define web scraping logic to extract data from those pages as well.

industries, marketLeaderReasons, customerExperienceReasons will store all scraped data in JavaScript objects. Let’s learn how to convert it to a more useful format.

Step 7: Convert the extracted data to JSON

JSON is one of the best data format when it comes to JavaScript. This is because JSON derives from JavaScript and is the format generally used by API to accept or return data. So, chances are that you will have to convert your JavaScript scraping data to JSON. You can easily achieve this with the logic below:

// trasforming the scraped data into a general object
const scrapedData = {
    industries: industries,
    marketLeader: marketLeaderReasons,
    customerExperience: customerExperienceReasons,
}

// converting the scraped data object to JSON
const scrapedDataJSON = JSON.stringify(scrapedData)

First, you have to create a JavaScript object containing all the scraped data. Then, you can transform that JavaScript object into JSON with JSON.stringify().

scrapedDataJSON will contain the following JSON data:

{
  "industries": [
    {
      "url": "https://brightdata.com/use-cases/ecommerce",
      "image": "https://brightdata.com/wp-content/uploads/2022/07/E_commerce.svg",
      "name": "E-commerce"
    },

    // ...

    {
      "url": "https://brightdata.com/use-cases/data-for-good",
      "image": "https://brightdata.com/wp-content/uploads/2022/07/Data_for_Good_N.svg",
      "name": "Data for Good"
    }
  ],
  "marketLeader": [
    {
      "title": "Most reliable",
      "image": "https://brightdata.com/wp-content/uploads/2022/01/reliable.svg",
      "description": "Highest quality data, best network uptime, fastest output "
    },

    // ...

    {
      "title": "Most efficient",
      "image": "https://brightdata.com/wp-content/uploads/2022/01/efficient.svg",
      "description": "Minimum in-house resources needed"
    }
  ],
  "customerExperience": [
    {
      "title": "You ask, we develop",
      "description": "New feature releases every day"
    },

    // ...

    {
      "title": "Tailored solutions",
      "description": "To meet your data collection goals"
    }
  ]
}

Congrats! You started from connecting to a website and you can now scrape its data and convert it to JSON. You are now ready to have a look at the complete web scraping Node.js script.

Putting it all together

This is what the Node.js web scraper looks like:

// index.js

const cheerio = require("cheerio")
const axios = require("axios")

async function performScraping() {
    // downloading the target web page
    // by performing an HTTP GET request in Axios
    const axiosResponse = await axios.request({
        method: "GET",
        url: "https://brightdata.com/",
        headers: {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
        }
    })

    // parsing the HTML source of the target web page with Cheerio
    const $ = cheerio.load(axiosResponse.data)

    // initializing the data structures
    // that will contain the scraped data
    const industries = []
    const marketLeaderReasons = []
    const customerExperienceReasons = []

    // scraping the "Learn how web data is used in your market" section
    $(".elementor-element-7a85e3a8")
        .find(".e-container")
        .each((index, element) => {
            // extracting the data of interest
            const pageUrl = $(element).attr("href")
            const image = $(element).find(".elementor-image-box-img img").attr("data-lazy-src")
            const name = $(element).find(".elementor-image-box-content .elementor-image-box-title").text()

            // filtering out not interesting data
            if (name && pageUrl) {
                // converting the data extracted into a more
                // readable object
                const industry = {
                    url: pageUrl,
                    image: image,
                    name: name
                }

                // adding the object containing the scraped data
                // to the industries array
                industries.push(industry)
            }
        })

    // scraping the "What makes Bright Data
    // the undisputed industry leader" section
    $(".elementor-element-ef3e47e")
        .find(".elementor-widget")
        .each((index, element) => {
            // extracting the data of interest
            const image = $(element).find(".elementor-image-box-img img").attr("data-lazy-src")
            const title = $(element).find(".elementor-image-box-title").text()
            const description = $(element).find(".elementor-image-box-description").text()

            // converting the data extracted into a more
            // readable object
            const marketLeaderReason = {
                title: title,
                image: image,
                description: description,
            }

            // adding the object containing the scraped data
            // to the marketLeaderReasons array
            marketLeaderReasons.push(marketLeaderReason)
        })

    // scraping the "The best customer experience in the industry" section
    $(".elementor-element-288b23cd .elementor-text-editor")
        .find("li")
        .each((index, element) => {
            // extracting the data of interest
            const title = $(element).find("strong").text()
            // since the title is part of the text, you have
            // to remove it to get only the description
            const description = $(element).text().replace(title, "").trim()

            // converting the data extracted into a more
            // readable object
            const customerExperienceReason = {
                title: title,
                description: description,
            }

            // adding the object containing the scraped data
            // to the customerExperienceReasons array
            customerExperienceReasons.push(customerExperienceReason)
        })

    // trasforming the scraped data into a general object
    const scrapedData = {
        industries: industries,
        marketLeader: marketLeaderReasons,
        customerExperience: customerExperienceReasons,
    }

    // converting the scraped data object to JSON
    const scrapedDataJSON = JSON.stringify(scrapedData)

    // storing scrapedDataJSON in a database via an API call...
}

performScraping()

As shown here, you can build a web scraper in Node.js in less than 100 lines of code. With Cheerio and Axios, you can download an HTML web page, parse it, and automatically retrieve all its data. Then, you can easily convert the scraped data to JSON. This is what Node.js web scraping is about.

Launch your web scraper in Node.js with:

npm run start

Et voilà! You just learned how to perform JavaScript web scraping in Node.js!

Conclusion

In this tutorial, you saw why web scraping in the frontend with JavaScript is a limited solution, and why Node.js is a better option. Also, you had a look at what you need to create a Node.js web scraping script, and how you can scrape data from the web in JavaScript. Specifically, you learned how to use Cheerio and Axios to create a JavaScript web scraping application in Node.js based on a real-world example. As you learned, web scraping with Node.js takes only a few lines of code.

But keep in mind web scraping might not be that easy. The reason is that there are many challenges you may have to address. In detail, anti-scraping and anti-bot solutions are becoming increasingly common. Luckily, you can easily avoid all this with a next-generation and advanced web scraping tool, provided by Bright Data. Don’t want to deal with web scraping? Explore our datasets.

If you want to find out more about how to avoid being blocked, adopt a web proxy from one of the several proxy services available in Bright Data or start using the advanced Web Unlocker.

Antonello Zanini author image
Antonello Zanini

Antonello is a software engineer, but he prefers to call himself a technology bishop. Spreading knowledge through writing is his mission.

You might also be interested in

What is data aggregation

Data Aggregation – Definition, Use Cases, and Challenges

This blog post will teach you everything you need to know about data aggregation. Here, you will see what data aggregation is, where it is used, what benefits it can bring, and what obstacles it involves.
What is a data parser featured image

What Is Data Parsing? Definition, Benefits, and Challenges

In this article, you will learn everything you need to know about data parsing. In detail, you will learn what data parsing is, why it is so important, and what is the best way to approach it.
What is a web crawler featured image

What is a Web Crawler?

Web crawlers are a critical part of the infrastructure of the Internet. In this article, we will discuss: Web Crawler Definition A web crawler is a software robot that scans the internet and downloads the data it finds. Most web crawlers are operated by search engines like Google, Bing, Baidu, and DuckDuckGo. Search engines apply […]

A Hands-On Guide to Web Scraping in R

In this tutorial, we’ll go through all the steps involved in web scraping in R with rvest with the goal of extracting product reviews from one publicly accessible URL from Amazon’s website.

The Ultimate Web Scraping With C# Guide

In this tutorial, you will learn how to build a web scraper in C#. In detail, you will see how to perform an HTTP request to download the web page you want to scrape, select HTML elements from its DOM tree, and extract data from them.
Web scraping with JSoup

Web Scraping in Java With Jsoup: A Step-By-Step Guide

Learn to perform web scraping with Jsoup in Java to automatically extract all data from an entire website.
Static vs. Rotating Proxies

Static vs Rotating Proxies: Detailed Comparison

Proxies play an important role in enabling businesses to conduct critical web research.

Web Scraping With Python – Step-By-Step Guide

Learn to perform web scraping with Python in order to gather data from multiple websites quickly, saving you both time, and effort.