Switch to Bright Data Scraping Browser for Scalable Web Scraping

Web scraping often requires you to navigate anti-bot mechanisms, load dynamic content using browser automation tools like Puppeteer, use proxy rotation to avoid IP blocks, and solve CAPTCHAs. Even with these strategies, scaling and maintaining stable sessions remains challenging.

This article teaches you how to transition from traditional proxy-based scraping to Bright Data Scraping Browser. Learn how to automate proxy management and scaling to reduce development costs and maintenance. Both methods will be compared, covering configuration, performance, scalability, and complexity.

Note: The examples in this article are for educational purposes. Always consult the target website’s terms of service and comply with relevant laws and regulations before scraping any data.

Prerequisites

Before starting the tutorial, make sure you have the following prerequisites:

Node.js
Visual Studio Code
A free Bright Data account so that you can use their Scraping Browser

Start by creating a new Node.js project folder where you can store your code.

Then, open your terminal or shell and create a new directory using the following commands:

mkdir scraping-tutorialrncd scraping-tutorial

Initialize a new Node.js project:

npm init -y

The -y flag automatically answers yes to all questions, creating a package.json file with default settings.

Proxy-Based Web Scraping

In a typical proxy-based approach, you use a browser automation tool like Puppeteer to interact with your target domain, load dynamic content, and extract data. While doing so, you integrate proxies to avoid IP bans and maintain anonymity.

Let’s quickly create a web scraping script using Puppeteer that scrapes data from an e-commerce website using proxies.

Create a Web Scraping Script Using Puppeteer

Start by installing Puppeteer:

npm install puppeteer

Then, create a file called proxy-scraper.js (you can name it whatever you like) in the scraping-tutorial folder and add the following code:

const puppeteer = require(u0022puppeteeru0022);rnrn(async () =u003e {rn  // Launch a headless browserrn  const browser = await puppeteer.launch({rn    headless: true,rn  });rn  const page = await browser.newPage();rnrn  const baseUrl = u0022https://books.toscrape.com/catalogue/page-u0022;rn  const books = [];rnrn  for (let i = 1; i u003c= 5; i++) { // Loop through the first 5 pagesrn    const url = `${baseUrl}${i}.html`;rnrn    console.log(`Navigating to: ${url}`);rnrn    // Navigate to the pagern    await page.goto(url, { waitUntil: u0022networkidle0u0022 });rnrn    // Extract book data from the current pagern    const pageBooks = await page.evaluate(() =u003e {rn      let books = [];rn      document.querySelectorAll(u0022.product_podu0022).forEach((item) =u003e {rn        let title = item.querySelector(u0022h3 au0022)?.getAttribute(u0022titleu0022) || u0022u0022;rn        let price = item.querySelector(u0022.price_coloru0022)?.innerText || u0022u0022;rn        books.push({ title, price });rn      });rn      return books;rn    });rnrn    books.push(...pageBooks); // Append books from this page to the main listrn  }rnrn  console.log(books); // Print the collected datarnrn  await browser.close();rn})();rn

This script uses Puppeteer to scrape book titles and prices from the first five pages of the Books to Scrape website. It launches a headless browser, opens a new page, and navigates through each catalog page.

For each page, the script uses DOM selectors within page.evaluate() to extract book titles and prices, storing the data in an array. Once all pages are processed, the data is printed to the console, and the browser is closed. This approach efficiently extracts data from a paginated website.

Test and run the code using the following command:

node proxy-scraper.js

Your output should look like this:

Navigating to: https://books.toscrape.com/catalogue/page-1.htmlrnNavigating to: https://books.toscrape.com/catalogue/page-2.htmlrnNavigating to: https://books.toscrape.com/catalogue/page-3.htmlrnNavigating to: https://books.toscrape.com/catalogue/page-4.htmlrnNavigating to: https://books.toscrape.com/catalogue/page-5.htmlrn[rn  { title: 'A Light in the Attic', price: '£51.77' },rn  { title: 'Tipping the Velvet', price: '£53.74' },rn  { title: 'Soumission', price: '£50.10' },rn  { title: 'Sharp Objects', price: '£47.82' },rn  { title: 'Sapiens: A Brief History of Humankind', price: '£54.23' },rn  { title: 'The Requiem Red', price: '£22.65' },rn…output omitted…rn  {rn    title: 'In the Country We Love: My Family Divided',rn    price: '£22.00'rn  }rn]

Set Up Proxies

Proxies are commonly employed in scraping configurations to both divide up the requests and make the requests untraceable. A common approach is to maintain a pool of proxies and rotate them dynamically.

Place your proxies in an array or store them in a separate file if you like:

const proxies = [rn  u0022proxy1.example.com:portu0022, rn  u0022proxy2.example.com:portu0022rn  // Add more proxies herern];

Utilize Proxy Rotation Logic

Let’s enhance the code with logic that rotates through the proxy array each time you launch the browser. Update proxy-scraper.js to include the following code:

const puppeteer = require(u0022puppeteeru0022);rnrnconst proxies = [rn  u0022proxy1.example.com:portu0022, rn  u0022proxy2.example.com:portu0022rn  // Add more proxies herern];rnrn(async () =u003e {rn  // Choose a random proxyrn  const randomProxy =rn    proxies[Math.floor(Math.random() * proxies.length)];rnrn  // Launch Puppeteer with proxyrn  const browser = await puppeteer.launch({rn    headless: true,rn    args: [rn      `u002du002dproxy-server=http=${randomProxy}`,rn      u0022u002du002dno-sandboxu0022,rn      u0022u002du002ddisable-setuid-sandboxu0022,rn      u0022u002du002dignore-certificate-errorsu0022,rn    ],rn  });rnrn  const page = await browser.newPage();rnrn  const baseUrl = u0022https://books.toscrape.com/catalogue/page-u0022;rn  const books = [];rnrn  for (let i = 1; i u003c= 5; i++) {rn    // Loop through the first 5 pagesrn    const url = `${baseUrl}${i}.html`;rnrn    console.log(`Navigating to: ${url}`);rnrn    // Navigate to the pagern    await page.goto(url, { waitUntil: u0022networkidle0u0022 });rnrn    // Extract book data from the current pagern    const pageBooks = await page.evaluate(() =u003e {rn      let books = [];rn      document.querySelectorAll(u0022.product_podu0022).forEach((item) =u003e {rn        let title = item.querySelector(u0022h3 au0022)?.getAttribute(u0022titleu0022) || u0022u0022;rn        let price = item.querySelector(u0022.price_coloru0022)?.innerText || u0022u0022;rn        books.push({ title, price });rn      });rn      return books;rn    });rnrn    books.push(...pageBooks); // Append books from this page to the main listrn  }rnrn  console.log(`Using proxy: ${randomProxy}`);rn  console.log(books); // Print the collected datarnrn  await browser.close();rn})();rn

Note : Instead of doing the rotation of the proxies manually, you can use a library like luminati-proxy to automate the process.

In this code, a random proxy is selected from the proxies list and applied to Puppeteer using the --proxy-server=${randomProxy} option. To avoid detection, a random user agent string is also assigned. The scraping logic is then repeated, and the proxy used for scraping product data is recorded.

When you run the code again, you should see an output like before but with an addition to the proxy that was used:

Navigating to: https://books.toscrape.com/catalogue/page-1.htmlrnNavigating to: https://books.toscrape.com/catalogue/page-2.htmlrnNavigating to: https://books.toscrape.com/catalogue/page-3.htmlrnNavigating to: https://books.toscrape.com/catalogue/page-4.htmlrnNavigating to: https://books.toscrape.com/catalogue/page-5.htmlrnUsing proxy: 115.147.63.59:8081rn…output omitted…

Challenges with Proxy-Based Scraping

Although a proxy-based approach can work for many use cases, you may face some of the following challenges:

Frequent blocks: Proxies might get blocked if the site has stringent anti-bot detection.
Performance overheads: Rotating proxies and retrying requests slow down your data collection pipeline.
Complex scalability: Managing and rotating a large proxy pool for optimal performance and availability is complex. It requires load balancing, preventing overuse of proxies, cooling down periods, and handling failures in real time. The challenge grows with concurrent requests as the system must evade detection while continuously monitoring and replacing blacklisted or underperforming IPs.
Browser maintenance: Browser maintenance can be both technically challenging and resource-intensive. You need to continuously update and handle the browser’s fingerprint (cookies, headers, and other identifying attributes) to mimic real user behavior and evade advanced anti-bot controls.
Cloud browser overhead: Cloud-based browsers generate additional operational overhead through heightened resource requirements and complex infrastructure control which leads to elevated operational expenses. Scaling browser instances for consistent performance further complicates the process.

DynamicScraping with the Bright Data Scraping Browser

To help with these challenges, you can use a single API solution like the Bright Data Scraping Browser. It simplifies your operation, eliminates the need for manual proxy rotation and complex browser setups, and often leads to a higher success rate in data retrieval.

Set Up Your Bright Data Account

To start, log in to your Bright Data account and navigate to Proxies & Scraping, scroll down to Scraping Browser, and click Get Started:

Keep the default configuration and click Add to create a new Scraping Browser instance:

After you’ve created a Scraping Browser instance, take note of the Puppeteer URL as you’ll need it soon:

Adjust the Code to Use the Bright Data Scraping Browser

Now, let’s adjust the code so that instead of using rotating proxies, you connect directly to the Bright Data Scraping Browser endpoint.

Create a new file called brightdata-scraper.js and add the following code:

const puppeteer = require(u0022puppeteeru0022);rnrn(async () =u003e {rn  // Choose a random proxyrn  const SBR_WS_ENDPOINT = u0022YOUR_BRIGHT_DATA_WS_ENDPOINTu0022rnrn  // Launch Puppeteer with proxyrn  const browser = await puppeteer.connect({rn    browserWSEndpoint: SBR_WS_ENDPOINT,rn  });rnrn  const page = await browser.newPage();rnrn  const baseUrl = u0022https://books.toscrape.com/catalogue/page-u0022;rn  const books = [];rnrn  for (let i = 1; i u003c= 5; i++) {rn    // Loop through the first 5 pagesrn    const url = `${baseUrl}${i}.html`;rnrn    console.log(`Navigating to: ${url}`);rnrn    // Navigate to the pagern    await page.goto(url, { waitUntil: u0022networkidle0u0022 });rnrn    // Extract book data from the current pagern    const pageBooks = await page.evaluate(() =u003e {rn      let books = [];rn      document.querySelectorAll(u0022.product_podu0022).forEach((item) =u003e {rn        let title = item.querySelector(u0022h3 au0022)?.getAttribute(u0022titleu0022) || u0022u0022;rn        let price = item.querySelector(u0022.price_coloru0022)?.innerText || u0022u0022;rn        books.push({ title, price });rn      });rn      return books;rn    });rnrn    books.push(...pageBooks); // Append books from this page to the main listrn  }rnrn  console.log(books); // Print the collected datarnrn  await browser.close();rn})();

Make sure you replace YOUR_BRIGHT_DATA_WS_ENDPOINT with the URL that you retrieved in the previous step.

This code is similar to the previous code, but instead of having a list of proxies and juggling between different proxies, you connect directly to the Bright Data endpoint.

Run the following code:

node brightdata-scraper.js

Your output should be the same as before, but now, you won’t need to manually rotate proxies or configure user agents. The Bright Data Scraping Browser handles everything—from proxy rotation to bypassing CAPTCHAs—ensuring uninterrupted data scraping.

Turn the Code into an Express Endpoint

If you want to integrate the Bright Data Scraping Browser into a larger application, consider exposing it as an Express endpoint.

Start by installing Express:

npm install express

Create a file called server.js and add the following code:

const express = require(u0022expressu0022);rnconst puppeteer = require(u0022puppeteeru0022);rnrnconst app = express();rnconst PORT = 3000;rnrn// Needed to parse JSON bodies:rnapp.use(express.json());rnrn// Your Bright Data Scraping Browser WebSocket endpointrnconst SBR_WS_ENDPOINT =rn u0022wss://brd-customer-hl_264b448a-zone-scraping_browser2:abd0aajm6il7@brd.superproxy.io:9222u0022;rnrn/**rn POST /scraperrn Body example:rn {rn    u0022baseUrlu0022: u0022https://books.toscrape.com/catalogue/page-u0022rn }rn*/rnapp.post(u0022/scrapeu0022, async (req, res) =u003e {rn  const { baseUrl } = req.body;rnrn  if (!baseUrl) {rn    return res.status(400).json({rn      success: false,rn      error: 'Missing u0022baseUrlu0022 in request body.',rn    });rn  }rnrn  try {rn    // Connect to the existing Bright Data (Luminati) Scraping Browserrn    const browser = await puppeteer.connect({rn      browserWSEndpoint: SBR_WS_ENDPOINT,rn    });rnrn    const page = await browser.newPage();rn    const books = [];rnrn    // Example scraping 5 pages of the base URLrn    for (let i = 1; i u003c= 5; i++) {rn      const url = `${baseUrl}${i}.html`;rn      console.log(`Navigating to: ${url}`);rnrn      await page.goto(url, { waitUntil: u0022networkidle0u0022 });rnrn      const pageBooks = await page.evaluate(() =u003e {rn        const data = [];rn        document.querySelectorAll(u0022.product_podu0022).forEach((item) =u003e {rn          const title = item.querySelector(u0022h3 au0022)?.getAttribute(u0022titleu0022) || u0022u0022;rn          const price = item.querySelector(u0022.price_coloru0022)?.innerText || u0022u0022;rn          data.push({ title, price });rn        });rn        return data;rn      });rnrn      books.push(...pageBooks);rn    }rnrn    // Close the browser connectionrn    await browser.close();rnrn    // Return JSON with the scraped datarn    return res.json({rn      success: true,rn      books,rn    });rn  } catch (error) {rn    console.error(u0022Scraping error:u0022, error);rn    return res.status(500).json({rn      success: false,rn      error: error.message,rn    });rn  }rn});rnrn// Start the Express serverrnapp.listen(PORT, () =u003e {rn  console.log(`Server is listening on http://localhost:${PORT}`);rn});

In this code, you initialize an Express app, accept JSON payloads, and define a POST /scrape route. Clients send a JSON body containing the baseUrl, which is then forwarded to the Bright Data Scraping Browser endpoint with the targeted URL.

Run your new Express server:

node server.js

To test the endpoint, you can use a tool like Postman (or any other REST client of your choice), or you can use curl from your terminal or shell like this:

curl -X POST http://localhost/scrape rn-H 'Content-Type: application/json' rn-d '{u0022baseUrlu0022: u0022https://books.toscrape.com/catalogue/page-u0022}'rn

Your output should look like this:

{rn  u0022successu0022: true,rn  u0022booksu0022: [rn    {rn      u0022titleu0022: u0022A Light in the Atticu0022,rn      u0022priceu0022: u0022£51.77u0022rn    },rn    {rn      u0022titleu0022: u0022Tipping the Velvetu0022,rn      u0022priceu0022: u0022£53.74u0022rn    },rn    {rn      u0022titleu0022: u0022Soumissionu0022,rn      u0022priceu0022: u0022£50.10u0022rn    },rn    {rn      u0022titleu0022: u0022Sharp Objectsu0022,rn      u0022priceu0022: u0022£47.82u0022rn    },rn    {rn      u0022titleu0022: u0022Sapiens: A Brief History of Humankindu0022,rn      u0022priceu0022: u0022£54.23u0022rn    },rn    {rn      u0022titleu0022: u0022The Requiem Redu0022,rn      u0022priceu0022: u0022£22.65u0022rn    },rn    {rn      u0022titleu0022: u0022The Dirty Little Secrets of Getting Your Dream Jobu0022,rn      u0022priceu0022: u0022£33.34u0022rn    },rn    {rn      u0022titleu0022: u0022The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhullu0022,rn      u0022priceu0022: u0022£17.93u0022rn    },rn    rn    ... output omitted...rn    rn    {rn      u0022titleu0022: u0022Judo: Seven Steps to Black Belt (an Introductory Guide for Beginners)u0022,rn      u0022priceu0022: u0022£53.90u0022rn    },rn    {rn      u0022titleu0022: u0022Joinu0022,rn      u0022priceu0022: u0022£35.67u0022rn    },rn    {rn      u0022titleu0022: u0022In the Country We Love: My Family Dividedu0022,rn      u0022priceu0022: u0022£22.00u0022rn    }rn  ]rn}

Following is a diagram showing the contrast between the manual setup (rotating proxy) and the Bright Data Scraping Browser approach:

Manual setup vs. Bright Data Scraping Browser

Managing manually rotating proxies requires constant attention and tuning, which results in frequent blocks and limited scalability.

Using the Bright Data Scraping Browser streamlines the process by eliminating the need to manage proxies or headers, while also delivering faster response times through optimized infrastructure. Its integrated anti-bot strategies boost success rates, making it less likely for you to get blocked or flagged.

All the code for this tutorial is available in this GitHub repository.

Calculate ROI

Switching from a manual proxy-based scraping setup to the Bright Data Scraping Browser can significantly cut development time and costs.

Traditional Setup

Scraping news websites daily requires the following:

Initial development: ~50 hours ($5,000 USD at $100 USD/hour)
Ongoing maintenance: ~10 hours/month ($1,000 USD) for code updates, infrastructure, scaling, and proxy management
Proxy/IP costs: ~$250 USD/month (varies based on IP needs)

Total estimated monthly cost: ~$1,250 USD

Bright Data Scraping Browser Setup

Development time: 5–10 hours ($1,000 USD)
Maintenance: ~2–4 hours/month ($200 USD)
No proxy or infrastructure management is needed
Bright Data service costs:
- Traffic usage: $8.40 USD/GB (eg 30GB/month = $252 USD)

Total estimated monthly cost: ~$450 USD

Automating proxy management and scaling the Bright Data Scraping Browser reduces both upfront development costs and ongoing maintenance, making large-scale data scraping more efficient and cost-effective.

Conclusion

Transitioning from a traditional proxy-based web scraping configuration to the Bright Data Scraping Browser removes the hassle of proxy rotation and manual anti-bot handling.

Beyond fetching HTML, Bright Data also offers additional tools to streamline data extraction: