Web scraping often requires you to navigate anti-bot mechanisms, load dynamic content using browser automation tools like Puppeteer, use proxy rotation to avoid IP blocks, and solve CAPTCHAs. Even with these strategies, scaling and maintaining stable sessions remains challenging.
This article teaches you how to transition from a traditional, proxy-based setup to a more streamlined solution using the Bright Data Scraping Browser for dynamic scraping. Both methods will be compared, covering configuration, performance, scalability, and complexity.
Note: The examples in this article are for educational purposes. Always consult the target website’s terms of service and comply with relevant laws and regulations before scraping any data.
Prerequisites
Before starting the tutorial, make sure you have the following prerequisites:
- Node.js and npm installed on your machine
- A basic understanding of JavaScript and the command line
- A text editor or IDE, like Visual Studio Code or WebStorm, for writing the code
- A free Bright Data account so that you can use their Scraping Browser
Start by creating a new Node.js project folder where you can store your code.
Then, open your terminal or shell and create a new directory using the following commands:
mkdir scraping-tutorial
cd scraping-tutorial
Initialize a new Node.js project:
npm init -y
The -y
flag automatically answers yes
to all questions, creating a package.json
file with default settings.
Proxy-Based Web Scraping
In a typical proxy-based approach, you use a browser automation tool like Puppeteer to interact with your target domain, load dynamic content, and extract data. While doing so, you integrate proxies to avoid IP bans and maintain anonymity.
Let’s quickly create a web scraping script using Puppeteer that scrapes data from an e-commerce website using proxies.
Create a Web Scraping Script Using Puppeteer
Start by installing Puppeteer:
npm install puppeteer
Then, create a file called proxy-scraper.js
(you can name it whatever you like) in the scraping-tutorial
folder and add the following code:
const puppeteer = require("puppeteer");
(async () => {
// Launch a headless browser
const browser = await puppeteer.launch({
headless: true,
});
const page = await browser.newPage();
const baseUrl = "https://books.toscrape.com/catalogue/page-";
const books = [];
for (let i = 1; i <= 5; i++) { // Loop through the first 5 pages
const url = `${baseUrl}${i}.html`;
console.log(`Navigating to: ${url}`);
// Navigate to the page
await page.goto(url, { waitUntil: "networkidle0" });
// Extract book data from the current page
const pageBooks = await page.evaluate(() => {
let books = [];
document.querySelectorAll(".product_pod").forEach((item) => {
let title = item.querySelector("h3 a")?.getAttribute("title") || "";
let price = item.querySelector(".price_color")?.innerText || "";
books.push({ title, price });
});
return books;
});
books.push(...pageBooks); // Append books from this page to the main list
}
console.log(books); // Print the collected data
await browser.close();
})();
This script uses Puppeteer to scrape book titles and prices from the first five pages of the Books to Scrape website. It launches a headless browser, opens a new page, and navigates through each catalog page.
For each page, the script uses DOM selectors within page.evaluate()
to extract book titles and prices, storing the data in an array. Once all pages are processed, the data is printed to the console, and the browser is closed. This approach efficiently extracts data from a paginated website.
Test and run the code using the following command:
node proxy-scraper.js
Your output should look like this:
Navigating to: https://books.toscrape.com/catalogue/page-1.html
Navigating to: https://books.toscrape.com/catalogue/page-2.html
Navigating to: https://books.toscrape.com/catalogue/page-3.html
Navigating to: https://books.toscrape.com/catalogue/page-4.html
Navigating to: https://books.toscrape.com/catalogue/page-5.html
[
{ title: 'A Light in the Attic', price: '£51.77' },
{ title: 'Tipping the Velvet', price: '£53.74' },
{ title: 'Soumission', price: '£50.10' },
{ title: 'Sharp Objects', price: '£47.82' },
{ title: 'Sapiens: A Brief History of Humankind', price: '£54.23' },
{ title: 'The Requiem Red', price: '£22.65' },
…output omitted…
{
title: 'In the Country We Love: My Family Divided',
price: '£22.00'
}
]
Set Up Proxies
Proxies are commonly employed in scraping configurations to both divide up the requests and make the requests untraceable. A common approach is to maintain a pool of proxies and rotate them dynamically.
Place your proxies in an array or store them in a separate file if you like:
const proxies = [
"proxy1.example.com:port",
"proxy2.example.com:port"
// Add more proxies here
];
Utilize Proxy Rotation Logic
Let’s enhance the code with logic that rotates through the proxy array each time you launch the browser. Update proxy-scraper.js
to include the following code:
const puppeteer = require("puppeteer");
const proxies = [
"proxy1.example.com:port",
"proxy2.example.com:port"
// Add more proxies here
];
(async () => {
// Choose a random proxy
const randomProxy =
proxies[Math.floor(Math.random() * proxies.length)];
// Launch Puppeteer with proxy
const browser = await puppeteer.launch({
headless: true,
args: [
`--proxy-server=http=${randomProxy}`,
"--no-sandbox",
"--disable-setuid-sandbox",
"--ignore-certificate-errors",
],
});
const page = await browser.newPage();
const baseUrl = "https://books.toscrape.com/catalogue/page-";
const books = [];
for (let i = 1; i <= 5; i++) {
// Loop through the first 5 pages
const url = `${baseUrl}${i}.html`;
console.log(`Navigating to: ${url}`);
// Navigate to the page
await page.goto(url, { waitUntil: "networkidle0" });
// Extract book data from the current page
const pageBooks = await page.evaluate(() => {
let books = [];
document.querySelectorAll(".product_pod").forEach((item) => {
let title = item.querySelector("h3 a")?.getAttribute("title") || "";
let price = item.querySelector(".price_color")?.innerText || "";
books.push({ title, price });
});
return books;
});
books.push(...pageBooks); // Append books from this page to the main list
}
console.log(`Using proxy: ${randomProxy}`);
console.log(books); // Print the collected data
await browser.close();
})();
Note : Instead of doing the rotation of the proxies manually, you can use a library like luminati-proxy to automate the process.
In this code, a random proxy is selected from the proxies
list and applied to Puppeteer using the --proxy-server=${randomProxy}
option. To avoid detection, a random user agent string is also assigned. The scraping logic is then repeated, and the proxy used for scraping product data is recorded.
When you run the code again, you should see an output like before but with an addition to the proxy that was used:
Navigating to: https://books.toscrape.com/catalogue/page-1.html
Navigating to: https://books.toscrape.com/catalogue/page-2.html
Navigating to: https://books.toscrape.com/catalogue/page-3.html
Navigating to: https://books.toscrape.com/catalogue/page-4.html
Navigating to: https://books.toscrape.com/catalogue/page-5.html
Using proxy: 115.147.63.59:8081
…output omitted…
Challenges with Proxy-Based Scraping
Although a proxy-based approach can work for many use cases, you may face some of the following challenges:
- Frequent blocks: Proxies might get blocked if the site has stringent anti-bot detection.
- Performance overheads: Rotating proxies and retrying requests slow down your data collection pipeline.
- Complex scalability: Managing and rotating a large proxy pool for optimal performance and availability is complex. It requires load balancing, preventing overuse of proxies, cooling down periods, and handling failures in real time. The challenge grows with concurrent requests as the system must evade detection while continuously monitoring and replacing blacklisted or underperforming IPs.
- Browser maintenance: Browser maintenance can be both technically challenging and resource-intensive. You need to continuously update and handle the browser’s fingerprint (cookies, headers, and other identifying attributes) to mimic real user behavior and evade advanced anti-bot controls.
- Cloud browser overhead: Cloud-based browsers generate additional operational overhead through heightened resource requirements and complex infrastructure control which leads to elevated operational expenses. Scaling browser instances for consistent performance further complicates the process.
DynamicScraping with the Bright Data Scraping Browser
To help with these challenges, you can use a single API solution like the Bright Data Scraping Browser. It simplifies your operation, eliminates the need for manual proxy rotation and complex browser setups, and often leads to a higher success rate in data retrieval.
Set Up Your Bright Data Account
To start, log in to your Bright Data account and navigate to Proxies & Scraping, scroll down to Scraping Browser, and click Get Started:
Keep the default configuration and click Add to create a new Scraping Browser instance:
After you’ve created a Scraping Browser instance, take note of the Puppeteer URL as you’ll need it soon:
Adjust the Code to Use the Bright Data Scraping Browser
Now, let’s adjust the code so that instead of using rotating proxies, you connect directly to the Bright Data Scraping Browser endpoint.
Create a new file called brightdata-scraper.js
and add the following code:
const puppeteer = require("puppeteer");
(async () => {
// Choose a random proxy
const SBR_WS_ENDPOINT ="YOUR_BRIGHT_DATA_WS_ENDPOINT"
// Launch Puppeteer with proxy
const browser = await puppeteer.connect({
browserWSEndpoint: SBR_WS_ENDPOINT,
});
const page = await browser.newPage();
const baseUrl = "https://books.toscrape.com/catalogue/page-";
const books = [];
for (let i = 1; i <= 5; i++) {
// Loop through the first 5 pages
const url = `${baseUrl}${i}.html`;
console.log(`Navigating to: ${url}`);
// Navigate to the page
await page.goto(url, { waitUntil: "networkidle0" });
// Extract book data from the current page
const pageBooks = await page.evaluate(() => {
let books = [];
document.querySelectorAll(".product_pod").forEach((item) => {
let title = item.querySelector("h3 a")?.getAttribute("title") || "";
let price = item.querySelector(".price_color")?.innerText || "";
books.push({ title, price });
});
return books;
});
books.push(...pageBooks); // Append books from this page to the main list
}
console.log(books); // Print the collected data
await browser.close();
})();
Make sure you replace YOUR_BRIGHT_DATA_WS_ENDPOINT
with the URL that you retrieved in the previous step.
This code is similar to the previous code, but instead of having a list of proxies and juggling between different proxies, you connect directly to the Bright Data endpoint.
Run the following code:
node brightdata-scraper.js
Your output should be the same as before, but now, you won’t need to manually rotate proxies or configure user agents. The Bright Data Scraping Browser handles everything—from proxy rotation to bypassing CAPTCHAs—ensuring uninterrupted data scraping.
Turn the Code into an Express Endpoint
If you want to integrate the Bright Data Scraping Browser into a larger application, consider exposing it as an Express endpoint.
Start by installing Express:
npm install express
Create a file called server.js
and add the following code:
const express = require("express");
const puppeteer = require("puppeteer");
const app = express();
const PORT = 3000;
// Needed to parse JSON bodies:
app.use(express.json());
// Your Bright Data Scraping Browser WebSocket endpoint
const SBR_WS_ENDPOINT =
"wss://brd-customer-hl_264b448a-zone-scraping_browser2:[email protected]:9222";
/**
* POST /scrape
* Body example:
* {
* "baseUrl": "https://books.toscrape.com/catalogue/page-"
* }
*/
app.post("/scrape", async (req, res) => {
const { baseUrl } = req.body;
if (!baseUrl) {
return res.status(400).json({
success: false,
error: 'Missing "baseUrl" in request body.',
});
}
try {
// Connect to the existing Bright Data (Luminati) Scraping Browser
const browser = await puppeteer.connect({
browserWSEndpoint: SBR_WS_ENDPOINT,
});
const page = await browser.newPage();
const books = [];
// Example scraping 5 pages of the base URL
for (let i = 1; i <= 5; i++) {
const url = `${baseUrl}${i}.html`;
console.log(`Navigating to: ${url}`);
await page.goto(url, { waitUntil: "networkidle0" });
const pageBooks = await page.evaluate(() => {
const data = [];
document.querySelectorAll(".product_pod").forEach((item) => {
const title = item.querySelector("h3 a")?.getAttribute("title") || "";
const price = item.querySelector(".price_color")?.innerText || "";
data.push({ title, price });
});
return data;
});
books.push(...pageBooks);
}
// Close the browser connection
await browser.close();
// Return JSON with the scraped data
return res.json({
success: true,
books,
});
} catch (error) {
console.error("Scraping error:", error);
return res.status(500).json({
success: false,
error: error.message,
});
}
});
// Start the Express server
app.listen(PORT, () => {
console.log(`Server is listening on http://localhost:${PORT}`);
});
In this code, you initialize an Express app, accept JSON payloads, and define a POST /scrape
route. Clients send a JSON body containing the baseUrl
, which is then forwarded to the Bright Data Scraping Browser endpoint with the targeted URL.
Run your new Express server:
node server.js
To test the endpoint, you can use a tool like Postman (or any other REST client of your choice), or you can use curl from your terminal or shell like this:
curl -X POST http://localhost/scrape \
-H 'Content-Type: application/json' \
-d '{"baseUrl": "https://books.toscrape.com/catalogue/page-"}'
Your output should look like this:
{"success":true,"books":[{"title":"A Light in the Attic","price":"£51.77"},{"title":"Tipping the Velvet","price":"£53.74"},{"title":"Soumission","price":"£50.10"},{"title":"Sharp Objects","price":"£47.82"},{"title":"Sapiens: A Brief History of Humankind","price":"£54.23"},{"title":"The Requiem Red","price":"£22.65"},{"title":"The Dirty Little Secrets of Getting Your Dream Job","price":"£33.34"},{"title":"The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull","price":"£17.93"},
… output omitted…
{"title":"Judo: Seven Steps to Black Belt (an Introductory Guide for Beginners)","price":"£53.90"},{"title":"Join","price":"£35.67"},{"title":"In the Country We Love: My Family Divided","price":"£22.00"}]}
Following is a diagram showing the contrast between the manual setup (rotating proxy) and the Bright Data Scraping Browser approach:
Managing manually rotating proxies requires constant attention and tuning, which results in frequent blocks and limited scalability.
Using the Bright Data Scraping Browser streamlines the process by eliminating the need to manage proxies or headers, while also delivering faster response times through optimized infrastructure. Its integrated anti-bot strategies boost success rates, making it less likely for you to get blocked or flagged.
All the code for this tutorial is available in this GitHub repository.
Calculate ROI
Switching from a manual proxy-based scraping setup to the Bright Data Scraping Browser can significantly cut development time and costs.
Traditional Setup
Scraping news websites daily requires the following:
- Initial development: ~50 hours ($5,000 USD at $100 USD/hour)
- Ongoing maintenance: ~10 hours/month ($1,000 USD) for code updates, infrastructure, scaling, and proxy management
- Proxy/IP costs: ~$250 USD/month (varies based on IP needs)
Total estimated monthly cost: ~$1,250 USD
Bright Data Scraping Browser Setup
- Development time: 5–10 hours ($1,000 USD)
- Maintenance: ~2–4 hours/month ($200 USD)
- No proxy or infrastructure management is needed
- Bright Data service costs:
- Traffic usage: $8.40 USD/GB (eg 30GB/month = $252 USD)
Total estimated monthly cost: ~$450 USD
Automating proxy management and scaling the Bright Data Scraping Browser reduces both upfront development costs and ongoing maintenance, making large-scale data scraping more efficient and cost-effective.
Conclusion
Transitioning from a traditional proxy-based web scraping configuration to the Bright Data Scraping Browser removes the hassle of proxy rotation and manual anti-bot handling.
Beyond fetching HTML, Bright Data also offers additional tools to streamline data extraction:
- Web Scrapers to help tidy up your data extraction
- Web Unlocker API to scrape tougher sites
- Datasets so you can access pre-collected, structured data
These solutions can simplify your scraping process, reduce workloads, and improve scalability.
No credit card required