Using Node Unblocker for Web Scraping

In this tutorial, you will explain to the readers what node-unblocker is, its advantages in web scraping projects, and how it’s used.
8 min read
Node Unblocker for Web Scraping

When web scraping with Node.js, you might encounter obstacles such as internet censorship and slow proxies. Thankfully, there’s a solution called node unblocker that can help.

Unblocker is a web proxy that enables developers to bypass internet censorship and access geo-restricted content. It’s an open source solution with benefits including fast data relay, easy customization options, and support for multiple protocols. With unblocker, you can overcome internet restrictions and efficiently scrape data from websites that would otherwise be inaccessible.

In this article, you’ll learn all about unblocker, including its advantages in web scraping projects. You’ll also learn how to use it to create a proxy that can be used to scrape geo-restricted content.

Advantages of Using Node Unblocker

Node Unblocker offers a wide array of benefits and functionalities that make it a valuable tool for internet users seeking unrestricted access to web content. In addition to being an open source solution, its other advantages include the following:

  • Bypasses internet censorship: One of unblocker’s key features is that it acts as an intermediary between the client and the target website. This feature is especially valuable in web scraping situations, enabling you to extract data from websites that may otherwise be inaccessible due to geo-restrictions or censorship.
  • Relays data fast and efficiently: Unblocker excels at delivering data to the client without buffering. As a result, it’s one of the fastest proxy solutions available.
  • Is easy to use: Unblocker provides a user-friendly interface that’s great for users of all skill levels. If you want to integrate the solution into your project, unblocker offers an accessible API that’s easy to implement.
  • Can be highly customizable: With unblocker, developers have the flexibility to customize the proxy according to their specific scraping requirements. For instance, you can configure parameters like request headers and response handling, providing a personalized and efficient scraping process.
  • Supports multiple protocols: Unblocker supports various protocols such as HTTP, HTTPS, and WebSockets. This versatility enables seamless integration with different scraping scenarios, offering developers the flexibility and convenience to interact with a wide range of data sources.

How to Get Started with Unblocker

Now that you know all the benefits that unblocker offers, it’s time to get started using it. Before you begin, you need to make sure that you have Node.js and npm installed on your system. You also need a web browser to test the project and a free Render account to host the solution.

Once you’ve completed these prerequisites, it’s time to create the web proxy. To do so, create a folder named node-unblocker-proxy, open it in your terminal, and execute the following command to initialize a new Node.js project:

npm init -y

Then execute the following command to install the dependencies that you need:

npm install express unblocker

express is the web application framework that you use to set up a web server. node-unblocker is the npm package that helps you create the web proxy.

Write the Script to Create the Proxy

Once all your dependencies are set up, it’s time to implement the web proxy script.

Create an index.js file in the project root folder and paste the following code into it:

// import required dependencies
const express = require("express");
const Unblocker = require("unblocker");

// create an express app instance
const app = express();
// create a new Unblocker instance
const unblocker = new Unblocker({ prefix: "/proxy/" });

// set the port
const port = 3000;

// add the unblocker middleware to the Express application
app.use(unblocker);

// listen on specified port
app.listen(port).on("upgrade", unblocker.onUpgrade);
console.log(`proxy running on http://localhost:${port}/proxy/`);

In this code, you import the required dependencies and create an instance of the Express app. Additionally, you create a new Unblocker instance, which allows for a wide range of configuration options. Here, you only set the prefix option, specifying the path that the proxied URLs should begin with.

Because unblocker exports an Express-compatible API, integrating it into an Express
application is easy. All you have to do is call the use() method of the Express app instance and pass the Unblocker instance. Then you start the Express application using the listen() method. .on("upgrade", unblocker.onUpgrade) ensures that the WebSocket connections are correctly handled by unblocker.

Test the Proxy Locally

To test the proxy implementation locally, execute the following command in your terminal:

node index.js

You can also use the command DEBUG=unblocker:* node index.js if you want to see detailed information on each request made via the proxy.

Next, take any URL and prefix it with localhost/proxy/ (eg localhost/proxy/https://brightdata.com/), and open it in your web browser.

You should see the Bright Data home page. Quickly inspect your browser’s Network tab, and you’ll see that all the requests are going through the proxy (you can see this by viewing the Domain column in the Network tab):

Local proxy test commands and browser check

Deploy the Proxy to Render

Now that you’ve tested the proxy, it’s time to deploy it. Before you do that, open the package.json file in the project root folder and modify the scripts key-value pair with the following:

"scripts": {
   "start": "node index"
}

This provides a command to start the Express web server once it’s hosted on Render.

To deploy the web proxy and upload the proxy code to a GitHub repository. Then sign into your Render account:

Setting up and deploying proxy on Render.

Click the New + button and select Web Service:

Connect your web proxy repository by selecting the Connect button. You might need to configure your account for Render to be able to access the repository. This is only necessary if you haven’t configured Render to access the specific GitHub repository:

Fill out the required details for your web service and select Create Web Service at the bottom of the page:

You can leave the start command as is if you prefer using Yarn, or you can change it to npm run start if you want to use npm.

After the web proxy has been successfully deployed, it’s time to test it. Take any URL and prefix it with the deployed <DEPLOYED-APP-URL>/proxy/ (eg https://node-web-proxy-gvn6.onrender.com/proxy/https://brightdata.com/). Then open it in your web browser.

Inspect your browser’s network tab, and you should see that all the requests are going through the deployed proxy:

Use the Proxy to Make Scraping Requests

Once you’ve verified that all the requests are going through the deployed proxy, it’s time to make a scraping request. In this tutorial, you’ll use the Puppeteer library, but any other testing library, such as Cheerio or Nightmare, works.

If you don’t already have Puppeteer installed, do so now by running npm i puppeteer. Then create a scrape.js file in the project root folder and add the following code:

// import puppeteer
const puppeteer = require("puppeteer");

const scrapeData = async () => {
   // launch the browser
   const browser = await puppeteer.launch({
    headless: false,
   });

   // open a new page and navigate to the defined URL
   const page = await browser.newPage();
   await page.goto("<DEPLOYED-APP-URL>/proxy/https://brightdata.com/blog");

   // get the content of the webpage
   const data = await page.evaluate(() => {
    // variable to hold all the posts data
    let blogData = [];

    // extract all elements with the specified class
    const posts = document.querySelectorAll(".post_item");

    // loop through the posts object, extract required data and push it to the blogData array
    for (const post of posts) {
        const title = post.querySelector("h5").textContent;
        const link = post.href;
        const author = post
            .querySelector(".author_box")
            .querySelector(".author_box__details")
            .querySelector("div").textContent;

        const article = { title, link, author };

        blogData.push(article);
    }

    return blogData;
   });

   // log the data to the console
   console.log(data);

   // close the browser instance
   await browser.close();
};

// call the scrapeData function
scrapeData();

Remember to replace <DEPLOYED-APP-URL> with the URL of the app that you deployed on Render.

This code snippet sets up Puppeteer and scrapes blog post data from the Bright Data blog. All the blog post cards on the Bright Data website have a class name of .post_item. It retrieves all the posts; loops through the posts object; extracts the title, link, and author of each post; pushes this data into the blogData array; and finally, logs all this information to the console.

Conclusion

Node Unblocker provides a robust solution for web scraping in Node.js, offering developers the ability to bypass internet censorship and access geo-restricted content. Its user-friendly interface, extensive customization options, and support for multiple protocols make it a valuable tool for efficiently scraping data from websites. In this guide, you learned all about unblocker, its advantages in web scraping projects, and how you can use it.

In today’s data-driven world, web scraping has become an indispensable tool for gathering valuable insights and information. However, web scraping comes with its fair share of challenges, such as IP blocking, rate limiting, and geo-restrictions, which can impede data collection efforts and hinder the acquisition of crucial data.

Bright Data offers an all-encompassing platform that addresses these challenges. With a vast network of residential, ISP, datacenter, and mobile IPs, Bright Data allows users to route their scraping requests through a diverse range of IP addresses from around the world. This not only ensures anonymity but also provides the ability to access geo-restricted content and overcome obstacles that may hinder data collection efforts.

Not sure which Bright Data proxy you need? Talk to one of our data experts and find the best solution for your needs.