In the fast-paced world of eCommerce, staying ahead of the curve means keeping an eye on the competition. One way to do this is through web scraping, a technique for extracting data from websites. Whether you’re a seasoned developer or a newbie dipping your toes into the world of data extraction, this article is designed to help you understand the ins and outs of web scraping eCommerce websites.
There are all kinds of reasons you may be interested in scraping eCommerce websites, including competitive analysis, market research, price monitoring, lead generation, or data-driven decision-making.
In this tutorial, you’ll learn about some of the common challenges you’ll face when scraping eCommerce websites and how to scrape them using Playwright, a Python library, and Bright Data’s Scraping Browser.
Problems with Web Scraping Using Local Browsers
While having the ability to write scrapers that let you extract vast amounts of data is amazing, doing so for the first time can be a challenging process. For instance, when using local browsers, developers often run into various issues that can hinder their efficiency and effectiveness. Some of the most common problems include the following:
- IP blocking: Websites often track the IP addresses making requests. If they detect an abnormal number of requests from a single IP (typical in web scraping or brute-forcing), they may block that IP. When using a local browser, all requests come from a single IP, making this a significant issue.
- Rate limiting: Many websites implement rate limiting, allowing only a certain number of requests from an IP address within a given time period. If this limit is exceeded, further requests may be blocked or slowed down, hindering the scraping process.
- Lack of proxies: Without a pool of proxies, all requests in a scraping operation come from the same IP address. This makes it easier for websites to detect and block scraping activity. In contrast, using a pool of proxies allows the requests to come from different IP addresses, reducing the risk of detection.
- CAPTCHA challenges: Websites may use CAPTCHA challenges to verify that the user is a human and not a bot. Local browsers often lack the functionality to automatically solve these challenges, making them a substantial roadblock in scraping efforts.
- Dynamic website content: Many modern websites use JavaScript to load content dynamically. A local browser might struggle to scrape these websites accurately because the content might not be fully loaded before the scraping begins.
In the context of scraping with a local browser, these issues compound to make web scraping challenging. The lack of advanced features, such as IP rotation and automatic CAPTCHA solving, can slow down scraping processes and decrease the quality and quantity of data collected. It’s crucial for developers to be aware of these common problems and seek out tools and techniques to circumvent them effectively.
The next section will discuss how the Bright Data Scraping Browser can help solve these problems, making web scraping a much smoother and more productive experience.
How to Scrape with the Bright Data Scraping Browser
In the world of web scraping, Bright Data stands out as a cutting-edge provider, and at the heart of Bright Data’s offerings is its web scraping browser, a tool designed specifically to address the challenges faced in data extraction.
Scraping Browser easily tackles any IP blocking issues because it has access to a vast pool of residential and mobile IPs. This means you can rotate IPs and emulate organic user behavior, significantly reducing the risk of being blocked.
Similarly, by leveraging Bright Data’s extensive IP pool, the Scraping Browser can distribute requests across multiple IPs, effectively mitigating the rate-limiting issue. Moreover, with the Scraping Browser, you get automatic proxy management. This means Scraping Browser handles proxy rotation, ensuring that your scraping activities continue without manual intervention.
The Scraping Browser also offers advanced browser fingerprinting protection, allowing you to mimic a real user. This makes it harder for websites to detect and block your scraping efforts:
With these features in mind, dive into the tutorial and learn how to use the Bright Data Scraping Browser to scrape an eCommerce website. Here, you’ll be using Python as the programming language of choice.
Step 1: Set Up a New Python Project
The first step of this tutorial is to set up a new Python project. This is your workspace for the scraping task. You can use any text editor or integrated development environment (IDE) of your choice.
Additionally, make sure Python is installed on your machine. You can confirm this by typing python --version
in your terminal. If Python is installed, this command displays its version. If not, you need to install it.
Once you’ve made sure Python is installed, it’s time to create your project directory. Open your terminal and navigate to where you want your project to reside. Then enter the following commands:
mkdir ecommerce-scraping # This command creates a new directory named ecommerce-scraping.
cd ecommerce-scraping # This command navigates into the newly created directory.
python -m venv env # This command creates a new virtual environment in your project directory.
source env/bin/activate # This command activates the virtual environment.
Creating a virtual environment is a good practice, as it isolates your project and its dependencies from other Python projects, preventing any conflicts between different versions of libraries.
Step 2: Import Playwright in the Project
Playwright is a Python library for automating and testing web browsers. You’ll use it to control your Scraping Browser.
To install Playwright, use pip, which is a package installer for Python:
pip install playwright
After installing Playwright, you need to run the playwright install
command. This downloads the browser binaries that Playwright needs to automate browsers:
playwright install
Step 3: Set Up a New Bright Data Account
Next, you need a Bright Data account. If you don’t have one, navigate to the Bright Data website and sign up. Once you have an account, you are able to create and manage your Scraping Browser instances and gain access to your unique credentials:
Step 4: Create a New Scraping Browser Instance
Once you have access to a Bright Data account, log in and navigate to the Scraping Browser section, where you can create a new Scraping Browser instance.
Make a note of your Host ID, as you’ll need this when connecting to the Scraping Browser:
Step 5: Connect to the Scraping Browser Instance
Now it’s time to connect Playwright to your Scraping Browser instance. Bright Data provides an example script in their documentation that you can use as a starting point. Remember to replace YOUR_ZONE_USERNAME
, YOUR_ZONE_PASSWORD
, and YOUR_ZONE_HOST
with your actual Bright Data credentials and the ID of the Scraping Browser instance you created:
import asyncio
from playwright.async_api import async_playwright
auth = 'YOUR_ZONE_USERNAME:YOUR_ZONE_PASSWORD'
browser_url = f'wss://{auth}@YOUR_ZONE_HOST'
async def main():
async with async_playwright() as pw:
print('connecting');
browser = await pw.chromium.connect_over_cdp(browser_url)
print('connected');
page = await browser.new_page()
print('goto')
await page.goto('https://example.com', timeout=120000)
print('done, evaluating')
print(await page.evaluate('()=>document.documentElement.outerHTML'))
await browser.close()
asyncio.run(main())
Save this file as main.py
in your project directory. Then run the script with the following:
python main.py
This script initiates a new instance of the Chromium browser and navigates to the URL you specified. It then prints the content of the web page and closes the browser.
At this point, you’ve validated that you can connect to the Scraping Browser instance. Since this is your baseline script, quickly go over the code:
import asyncio, async_playwright
are the required imports for the script.asyncio
is a library for writing single-threaded concurrent code using coroutines, andasync_playwright
is the asynchronous API of the Playwright library.auth = 'YOUR_ZONE_USERNAME:YOUR_ZONE_PASSWORD'
sets up the authentication for the Bright Data Scraping Browser by utilizing the username and password of your zone.browser_url = f'wss ://{auth}@YOUR_ZONE_HOST
constructs the WebSocket URL that connects to the Bright Data Scraping Browser.browser = await pw.chromium.connect_over_cdp(browser_url)
connects to the Bright Data Scraping Browser using the Chromium browser. Theawait
keyword pauses the execution of the function until the connection is established.await page.goto('https://example.com', timeout=120000)
navigates the page to the specified URL. Thetimeout
parameter specifies how long to wait for the navigation to complete before throwing an error.print(await page.evaluate('()=>document.documentElement.outerHTML'))
evaluates the JavaScript code in the context of the page and prints the result. In this case, the JavaScript code returns the entire page’s HTML content.
Step 6: Scrape an eCommerce Website
Once you’ve connected to the Scraping Browser instance, you’re ready to start scraping. In this tutorial, you’ll scrape Books to Scrape, a sandbox eCommerce website that allows scraping.
Open up your main.py
file and replace the contents with the following code; then run the script in your terminal:
import asyncio
from playwright.async_api import async_playwright
auth = 'YOUR_ZONE_USERNAME:YOUR_ZONE_PASSWORD'
browser_url = f'wss://{auth}@YOUR_ZONE_HOST'
async def main():
async with async_playwright() as pw:
print('connecting');
browser = await pw.chromium.connect_over_cdp(browser_url)
print('connected');
page = await browser.new_page()
print('goto')
await page.goto('https://books.toscrape.com', timeout=120000)
print('done, evaluating')
print(await page.evaluate('()=>document.documentElement.outerHTML'))
await browser.close()
asyncio.run(main())
You see the content of the Books to Scrape home page printed out. At this point, running this script does not return anything useful; you just get the entirety of the HTML content of the target website.
Step 7: Extract Structured Information
To make this tutorial a little more useful, extract some structured data. This process varies depending on the specific data you’re interested in, but for this example, extract the names and prices of books on the home page.
Start by inspecting the books.toscrape.com
home page and identifying the HTML elements that contain the book names and prices. The book names are in the <h3>
tags inside <article class= "product_pod">
, and the prices are in the <p class= "price_color">
tags inside the same <article>
tags.
Here’s how to modify the script to extract this information:
from playwright import sync_playwright
auth = 'YOUR_ZONE_USERNAME:YOUR_ZONE_PASSWORD'
browser_url = f'wss://{auth}@YOUR_ZONE_HOST'
async def main():
async with async_playwright() as pw:
print('connecting')
browser = await pw.chromium.connect_over_cdp(browser_url)
print('connected')
page = await browser.new_page()
print('goto')
await page.goto('https://books.toscrape.com', timeout=120000)
print('done, evaluating')
# Find all the books in the article elements
books = await page.query_selector_all('article.product_pod')
# Extract and print each book details in a loop
async def get_book_details(book):
# Extract and print book name and price
book_name_element = await book.query_selector('h3 > a')
book_name = await book_name_element.get_attribute('title')
book_price_element = await book.query_selector('div.product_price > p.price_color')
book_price = await book_price_element.inner_text()
print(f'{book_name}: {book_price}')
# Use asyncio.gather() to execute all async calls concurrently
await asyncio.gather(*(get_book_details(book) for book in books))
await browser.close()
asyncio.run(main())
When you run this script, you see a list of book names and their prices printed in your terminal. It looks something like this:
This is a very simple example; however, it demonstrates how you can extract structured data from a website using Playwright and Bright Data. You can adapt this script to scrape different types of data from other pages or websites.
Now, take things a step further and generate a CSV file containing the scraped data.
Step 8: Save the Scraped Data to a CSV File
In order to save the scraped data to a CSV file, you need to import the csv
module and create a new CSV file in the main()
function. Then you can write the scraped data to the CSV file in the get_book_details()
function.
Open your main.py
file and add the following code:
import asyncio
import csv
from playwright.async_api import async_playwright
auth = 'YOUR_ZONE_USERNAME:YOUR_ZONE_PASSWORD'
browser_url = f'wss://{auth}@YOUR_ZONE_HOST'
async def main():
async with async_playwright() as pw:
print('connecting')
browser = await pw.chromium.connect_over_cdp(browser_url)
print('connected')
page = await browser.new_page()
print('goto')
await page.goto('https://books.toscrape.com', timeout=120000)
print('done, evaluating')
# Find all the books in the article elements
books = await page.query_selector_all('article.product_pod')
async def get_book_details(book):
# Extract book name and price
book_name_element = await book.query_selector('h3 > a')
book_name = await book_name_element.get_attribute('title')
book_price_element = await book.query_selector('div.product_price > p.price_color')
book_price = await book_price_element.inner_text()
return book_name, book_price
# Use asyncio.gather() to execute all async calls concurrently
book_details = await asyncio.gather(*(get_book_details(book) for book in books))
# Write book details to a CSV file
with open('books.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(['Book Name', 'Price']) # Write header
writer.writerows(book_details) # Write book details
await browser.close()
asyncio.run(main())
If you run this script, you’ll see a new file called books.csv
in your project directory. Open this file, and you can see the scraped data in a CSV format like this:
Conclusion
In this tutorial, you learned how to use Playwright and Bright Data to scrape data from an eCommerce website. This tutorial only scratched the surface of what you can do with Playwright and the Bright Data Scraping Browser, a proxy browser solution focused on unlocking data collection from websites that use advanced antibot detection techniques. The fundamentals discussed in this article can be combined into more advanced workflows to automate things such as price matching, market analysis, and lead generation.
Behind the scenes, Bright Data uses a complete proxy infrastructure to route your requests through a pool of millions of IPs. This allows you to scrape data from websites without getting blocked or banned. Sign up for a free trial and start experimenting with the Scraping Browser today!
Want to skip scraping ecommerce websites and just get data? Purchase ecommerce datasets.