Guide to Scraping eCommerce Websites

In this step-by-step guide, learn how to scrape eCommerce websites using Playwright and Bright Data’s Scraping Browser.
13 min read
How to scrape ecommerce websites

In the fast-paced world of eCommerce, staying ahead of the curve means keeping an eye on the competition. One way to do this is through web scraping, a technique for extracting data from websites. Whether you’re a seasoned developer or a newbie dipping your toes into the world of data extraction, this article is designed to help you understand the ins and outs of web scraping eCommerce websites.

There are all kinds of reasons you may be interested in scraping eCommerce websites, including competitive analysis, market research, price monitoring, lead generation, or data-driven decision-making.

In this tutorial, you’ll learn about some of the common challenges you’ll face when scraping eCommerce websites and how to scrape them using Playwright, a Python library, and Bright Data’s Scraping Browser.

Problems with Web Scraping Using Local Browsers

While having the ability to write scrapers that let you extract vast amounts of data is amazing, doing so for the first time can be a challenging process. For instance, when using local browsers, developers often run into various issues that can hinder their efficiency and effectiveness. Some of the most common problems include the following:

  1. IP blocking: Websites often track the IP addresses making requests. If they detect an abnormal number of requests from a single IP (typical in web scraping or brute-forcing), they may block that IP. When using a local browser, all requests come from a single IP, making this a significant issue.
  2. Rate limiting: Many websites implement rate limiting, allowing only a certain number of requests from an IP address within a given time period. If this limit is exceeded, further requests may be blocked or slowed down, hindering the scraping process.
  3. Lack of proxies: Without a pool of proxies, all requests in a scraping operation come from the same IP address. This makes it easier for websites to detect and block scraping activity. In contrast, using a pool of proxies allows the requests to come from different IP addresses, reducing the risk of detection.
  4. CAPTCHA challenges: Websites may use CAPTCHA challenges to verify that the user is a human and not a bot. Local browsers often lack the functionality to automatically solve these challenges, making them a substantial roadblock in scraping efforts.
  5. Dynamic website content: Many modern websites use JavaScript to load content dynamically. A local browser might struggle to scrape these websites accurately because the content might not be fully loaded before the scraping begins.

In the context of scraping with a local browser, these issues compound to make web scraping challenging. The lack of advanced features, such as IP rotation and automatic CAPTCHA solving, can slow down scraping processes and decrease the quality and quantity of data collected. It’s crucial for developers to be aware of these common problems and seek out tools and techniques to circumvent them effectively.

The next section will discuss how the Bright Data Scraping Browser can help solve these problems, making web scraping a much smoother and more productive experience.

How to Scrape with the Bright Data Scraping Browser

In the world of web scraping, Bright Data stands out as a cutting-edge provider, and at the heart of Bright Data’s offerings is its web scraping browser, a tool designed specifically to address the challenges faced in data extraction.

Scraping Browser easily tackles any IP blocking issues because it has access to a vast pool of residential and mobile IPs. This means you can rotate IPs and emulate organic user behavior, significantly reducing the risk of being blocked.

Similarly, by leveraging Bright Data’s extensive IP pool, the Scraping Browser can distribute requests across multiple IPs, effectively mitigating the rate-limiting issue. Moreover, with the Scraping Browser, you get automatic proxy management. This means Scraping Browser handles proxy rotation, ensuring that your scraping activities continue without manual intervention.

The Scraping Browser also offers advanced browser fingerprinting protection, allowing you to mimic a real user. This makes it harder for websites to detect and block your scraping efforts:

Scraping Browser CP

With these features in mind, dive into the tutorial and learn how to use the Bright Data Scraping Browser to scrape an eCommerce website. Here, you’ll be using Python as the programming language of choice.

Step 1: Set Up a New Python Project

The first step of this tutorial is to set up a new Python project. This is your workspace for the scraping task. You can use any text editor or integrated development environment (IDE) of your choice.

Additionally, make sure Python is installed on your machine. You can confirm this by typing python --version in your terminal. If Python is installed, this command displays its version. If not, you need to install it.

Once you’ve made sure Python is installed, it’s time to create your project directory. Open your terminal and navigate to where you want your project to reside. Then enter the following commands:

mkdir ecommerce-scraping   # This command creates a new directory named ecommerce-scraping.
cd ecommerce-scraping      # This command navigates into the newly created directory.
python -m venv env         # This command creates a new virtual environment in your project directory.
source env/bin/activate    # This command activates the virtual environment.

Creating a virtual environment is a good practice, as it isolates your project and its dependencies from other Python projects, preventing any conflicts between different versions of libraries.

Step 2: Import Playwright in the Project

Playwright is a Python library for automating and testing web browsers. You’ll use it to control your Scraping Browser.

To install Playwright, use pip, which is a package installer for Python:

pip install playwright

After installing Playwright, you need to run the playwright install command. This downloads the browser binaries that Playwright needs to automate browsers:

playwright install

Step 3: Set Up a New Bright Data Account

Next, you need a Bright Data account. If you don’t have one, navigate to the Bright Data website and sign up. Once you have an account, you are able to create and manage your Scraping Browser instances and gain access to your unique credentials:

Step 4: Create a New Scraping Browser Instance

Once you have access to a Bright Data account, log in and navigate to the Scraping Browser section, where you can create a new Scraping Browser instance.

Make a note of your Host ID, as you’ll need this when connecting to the Scraping Browser:

Step 5: Connect to the Scraping Browser Instance

Now it’s time to connect Playwright to your Scraping Browser instance. Bright Data provides an example script in their documentation that you can use as a starting point. Remember to replace YOUR_ZONE_USERNAMEYOUR_ZONE_PASSWORD, and YOUR_ZONE_HOST with your actual Bright Data credentials and the ID of the Scraping Browser instance you created:

import asyncio
from playwright.async_api import async_playwright

auth = 'YOUR_ZONE_USERNAME:YOUR_ZONE_PASSWORD'
browser_url = f'wss://{auth}@YOUR_ZONE_HOST'

async def main():
    async with async_playwright() as pw:
        print('connecting');
        browser = await pw.chromium.connect_over_cdp(browser_url)
        print('connected');
        page = await browser.new_page()
        print('goto')
        await page.goto('https://example.com', timeout=120000)
        print('done, evaluating')
        print(await page.evaluate('()=>document.documentElement.outerHTML'))
        await browser.close()

asyncio.run(main())

Save this file as main.py in your project directory. Then run the script with the following:

python main.py

This script initiates a new instance of the Chromium browser and navigates to the URL you specified. It then prints the content of the web page and closes the browser.

At this point, you’ve validated that you can connect to the Scraping Browser instance. Since this is your baseline script, quickly go over the code:

  • import asyncio, async_playwright are the required imports for the script. asyncio is a library for writing single-threaded concurrent code using coroutines, and async_playwright is the asynchronous API of the Playwright library.
  • auth = 'YOUR_ZONE_USERNAME:YOUR_ZONE_PASSWORD' sets up the authentication for the Bright Data Scraping Browser by utilizing the username and password of your zone.
  • browser_url = f'wss ://{auth}@YOUR_ZONE_HOST constructs the WebSocket URL that connects to the Bright Data Scraping Browser.
  • browser = await pw.chromium.connect_over_cdp(browser_url) connects to the Bright Data Scraping Browser using the Chromium browser. The await keyword pauses the execution of the function until the connection is established.
  • await page.goto('https://example.com', timeout=120000) navigates the page to the specified URL. The timeout parameter specifies how long to wait for the navigation to complete before throwing an error.
  • print(await page.evaluate('()=>document.documentElement.outerHTML')) evaluates the JavaScript code in the context of the page and prints the result. In this case, the JavaScript code returns the entire page’s HTML content.

Step 6: Scrape an eCommerce Website

Once you’ve connected to the Scraping Browser instance, you’re ready to start scraping. In this tutorial, you’ll scrape Books to Scrape, a sandbox eCommerce website that allows scraping.

Open up your main.py file and replace the contents with the following code; then run the script in your terminal:

import asyncio
from playwright.async_api import async_playwright

auth = 'YOUR_ZONE_USERNAME:YOUR_ZONE_PASSWORD'
browser_url = f'wss://{auth}@YOUR_ZONE_HOST'

async def main():
    async with async_playwright() as pw:
        print('connecting');
        browser = await pw.chromium.connect_over_cdp(browser_url)
        print('connected');
        page = await browser.new_page()
        print('goto')
        await page.goto('https://books.toscrape.com', timeout=120000)
        print('done, evaluating')
        print(await page.evaluate('()=>document.documentElement.outerHTML'))
        await browser.close()

asyncio.run(main())

You see the content of the Books to Scrape home page printed out. At this point, running this script does not return anything useful; you just get the entirety of the HTML content of the target website.

Step 7: Extract Structured Information

To make this tutorial a little more useful, extract some structured data. This process varies depending on the specific data you’re interested in, but for this example, extract the names and prices of books on the home page.

Start by inspecting the books.toscrape.com home page and identifying the HTML elements that contain the book names and prices. The book names are in the <h3> tags inside <article class= "product_pod">, and the prices are in the <p class= "price_color"> tags inside the same <article> tags.

Here’s how to modify the script to extract this information:

from playwright import sync_playwright

auth = 'YOUR_ZONE_USERNAME:YOUR_ZONE_PASSWORD'
browser_url = f'wss://{auth}@YOUR_ZONE_HOST'

async def main():
    async with async_playwright() as pw:
        print('connecting')
        browser = await pw.chromium.connect_over_cdp(browser_url)
        print('connected')
        page = await browser.new_page()
        print('goto')
        await page.goto('https://books.toscrape.com', timeout=120000)
        print('done, evaluating')
        
        # Find all the books in the article elements
        books = await page.query_selector_all('article.product_pod')

        # Extract and print each book details in a loop
        async def get_book_details(book):
            # Extract and print book name and price
            book_name_element = await book.query_selector('h3 > a')
            book_name = await book_name_element.get_attribute('title')
            book_price_element = await book.query_selector('div.product_price > p.price_color')
            book_price = await book_price_element.inner_text()
            print(f'{book_name}: {book_price}')

        # Use asyncio.gather() to execute all async calls concurrently
        await asyncio.gather(*(get_book_details(book) for book in books))

        await browser.close()

asyncio.run(main())

When you run this script, you see a list of book names and their prices printed in your terminal. It looks something like this:

Code after running script

This is a very simple example; however, it demonstrates how you can extract structured data from a website using Playwright and Bright Data. You can adapt this script to scrape different types of data from other pages or websites.

Now, take things a step further and generate a CSV file containing the scraped data.

Step 8: Save the Scraped Data to a CSV File

In order to save the scraped data to a CSV file, you need to import the csv module and create a new CSV file in the main() function. Then you can write the scraped data to the CSV file in the get_book_details() function.

Open your main.py file and add the following code:

import asyncio
import csv
from playwright.async_api import async_playwright

auth = 'YOUR_ZONE_USERNAME:YOUR_ZONE_PASSWORD'
browser_url = f'wss://{auth}@YOUR_ZONE_HOST'

async def main():
    async with async_playwright() as pw:
        print('connecting')
        browser = await pw.chromium.connect_over_cdp(browser_url)
        print('connected')
        page = await browser.new_page()
        print('goto')
        await page.goto('https://books.toscrape.com', timeout=120000)
        print('done, evaluating')

        # Find all the books in the article elements
        books = await page.query_selector_all('article.product_pod')

        async def get_book_details(book):
            # Extract book name and price
            book_name_element = await book.query_selector('h3 > a')
            book_name = await book_name_element.get_attribute('title')
            book_price_element = await book.query_selector('div.product_price > p.price_color')
            book_price = await book_price_element.inner_text()

            return book_name, book_price

        # Use asyncio.gather() to execute all async calls concurrently
        book_details = await asyncio.gather(*(get_book_details(book) for book in books))

        # Write book details to a CSV file
        with open('books.csv', 'w', newline='') as f:
            writer = csv.writer(f)
            writer.writerow(['Book Name', 'Price'])  # Write header
            writer.writerows(book_details)  # Write book details

        await browser.close()

asyncio.run(main())

If you run this script, you’ll see a new file called books.csv in your project directory. Open this file, and you can see the scraped data in a CSV format like this:

Conclusion

In this tutorial, you learned how to use Playwright and Bright Data to scrape data from an eCommerce website. This tutorial only scratched the surface of what you can do with Playwright and the Bright Data Scraping Browser, a proxy browser solution focused on unlocking data collection from websites that use advanced antibot detection techniques. The fundamentals discussed in this article can be combined into more advanced workflows to automate things such as price matching, market analysis, and lead generation.

Behind the scenes, Bright Data uses a complete proxy infrastructure to route your requests through a pool of millions of IPs. This allows you to scrape data from websites without getting blocked or banned. Sign up for a free trial and start experimenting with the Scraping Browser today!

Want to skip scraping ecommerce websites and just get data? Purchase ecommerce datasets.