Master Web Scraping with ChatGPT: A Step-by-Step Guide

With the continued exponential growth of the digital economy, collecting data from various sources, such as APIs, websites, and databases, is more important than ever.

One common way to extract data is via web scraping. Web scraping involves using automated tools to fetch web pages and parse their content to extract specific information for further analysis and use. Common use cases include market research, price monitoring, and data aggregation.

Implementing web scraping involves handling dynamic content, managing sessions and cookies, dealing with anti-scraping measures, and ensuring legal compliance. These challenges require advanced tools and techniques for effective data extraction. ChatGPT can help with these complexities by leveraging its natural language processing capabilities to generate code and troubleshoot errors.

In this article, you’ll learn how to use ChatGPT to generate scraping code for websites that mainly rely on static HTML content and for complex websites that employ more complex page-generation techniques.

Prerequisites

Before starting this tutorial, make sure you have the following:

Python
using Visual Studio Code
A ChatGPT account

When you use ChatGPT to generate your web scraping scripts, there are two main steps:

Document each step the code needs to follow to find the information to be scraped, such as which HTML elements to target, text boxes to fill, and buttons to click. Often, you’ll need to copy the specific HTML element selector. To do this, right-click the particular page element you want to scrape and then click Inspect; Chrome highlights the specific DOM element. Right-click it and choose Copy > Copy selector so that the HTML selector path is copied to your clipboard:

Craft specific and detailed ChatGPT prompts to generate the scraping code.
Execute and test the generated code.

Scraping Websites with Static HTML Using ChatGPT

Now that you’re familiar with the general workflow, let’s use ChatGPT to scrape some websites with static HTML elements. To start, you’ll scrape the book title and prices from https://books.toscrape.com.

Initially, you need to identify the HTML elements that contain the data you need:

The selector for the book title is #default > div.container-fluid.page > div > div > div > section > div:nth-child(2) > ol > li:nth-child(1) > article > h3 > a.
The selector for the book price is *#default > div.container-fluid.page > div > div > div > section > div:nth-child(2) > ol > li:nth-child(1) > article > div.product_price > p.price_color.

Next, you need to craft specific and detailed ChatGPT prompts to generate the scraping code. To do that, you need to tell ChatGPT to install any needed Python packages, guide it through extracting HTML selectors (which you identified earlier), and then ask it to save the data as an Excel file.

Don’t forget to set the ChatGPT version to GPT-4o.

You can input something like this into ChatGPT:

You are a web scraping expert utilizing Python’s Beautiful Soup library and any necessary automation tools. I want you to generate a detailed step-by-step script to scrape https://books.toscrape.com; kindly do the following:

Install any necessary Python packages.

Extract the book title, which has the HTML selector #default > div.container-fluid.page > div > div > div > section > div:nth-child(2) > ol > li:nth-child(1) > article > h3 > a, and the book price (only the number, nothing else), which has the selector #default > div.container-fluid.page > div > div > div > section > div:nth-child(2) > ol > li:nth-child(1) > article > div.product_price > p.price_color.

Extract the list of all book titles and prices and store it in an Excel file.

Scrape the first page only.

Please note: You may receive a slightly different response.

In the response, ChatGPT instructs you to run the following command to install the Beautiful Soup library, which performs web scraping, and the pandas library, a comprehensive data analysis library that stores data as an Excel:

pip3 install requests beautifulsoup4 pandas openpyxl

You’ll notice that ChatGPT also suggests installing openpyxl, but it’s not needed here.

ChatGPT also provides the following web scraping code:

import requests
from bs4 import BeautifulSoup
import pandas as pd

# URL of the website
url = 'https://books.toscrape.com'

# Send a GET request to the website
response = requests.get(url)
if response.status_code == 200:
    # Parse the HTML content of the page
    soup = BeautifulSoup(response.content, 'html.parser')

    # Initialize lists to store the titles and prices
    book_titles = []
    book_prices = []

    # Extract the first book title and price using the provided selectors
    book_title_selector = "#default > div.container-fluid.page > div > div > div > section > div:nth-child(2) > ol > li:nth-child(1) > article > h3 > a"
    book_price_selector = "#default > div.container-fluid.page > div > div > div > section > div:nth-child(2) > ol > li:nth-child(1) > article > div.product_price > p.price_color"

    title_element = soup.select_one(book_title_selector)
    price_element = soup.select_one(book_price_selector)

    if title_element and price_element:
        book_titles.append(title_element.get('title'))
        book_prices.append(price_element.text.strip()[1:])

    # Extract all book titles and prices on the first page
    books = soup.select('section > div:nth-child(2) > ol > li')
    for book in books:
        title = book.select_one('h3 > a')
        price = book.select_one('div.product_price > p.price_color')
        if title and price:
            book_titles.append(title.get('title'))
            book_prices.append(price.text.strip()[1:])

    # Create a DataFrame from the lists
    df = pd.DataFrame({
        'Title': book_titles,
        'Price': book_prices
    })

    # Save the DataFrame to an Excel file
    df.to_excel('books_scraped.xlsx', index=False)

    print('Data has been successfully scraped and saved to books_scraped.xlsx')
else:
    print('Failed to retrieve the webpage')

This code imports the requests module, which performs HTTP calls to the website it scrapes. It also imports pandas packages that store scraped data in Excel files. The code defines the base URL of the website and the specific HTML selectors it targets to fetch the requested data.

Then, the code calls the website and puts the website content as a BeautifulSoup object called soup. The code picks the values of the book title and price as lists from the soup object using the HTML selectors. Finally, the code creates a pandas DataFrame from the book title and price list and saves it as an Excel file.

Next, you need to save the code to a file named books_scraping.py and run the command python3 books_scraping.py from the command line. This code generates an Excel file named books_scraped.xlsx in the same directory as books_scraping.py:

The generated Excel file named books_scraped.xlsx

Now that you’ve walked through one example, let’s take this tutorial a step further and scrape another simple website: https://quotes.toscrape.com, which contains a collection of quotes.

Again, let’s start by identifying the sequence of steps.

If you’re using Chrome, find the following HTML selectors:

body > div.container > div:nth-child(2) > div.col-md-8 > div:nth-child(1) > span.text to scrape the quote
body > div.container > div:nth-child(2) > div.col-md-8 > div:nth-child(1) > span:nth-child(2) > small to identify the author

Once you’ve identified the HTML selectors you want to scrape, you can draft a detailed prompt, like the following, to generate the scraping code:

You are a web scraping expert utilizing Python’s Beautiful Soup library and any necessary automation tools. I want you to generate a detailed step-by-step script to scrape https://quotes.toscrape.com. Kindly do the following:

Install any necessary Python packages.

Extract the quote, which has the HTML selector body > div.container > div:nth-child(2) > div.col-md-8 > div:nth-child(1) > span.text, and the author name, which has the selector body > div.container > div:nth-child(2) > div.col-md-8 > div:nth-child(1) > span:nth-child(2) > small.

Extract the list of all quotes and authors and store it in an Excel file.

Scrape the first page only.

After inputting this information, ChatGPT should provide you with the following command, which installs the Beautiful Soup, pandas, and openpyxl libraries:

pip3 install requests beautifulsoup4 pandas openpyxl

ChatGPT should also provide the following web scraping code:

import requests
from bs4 import BeautifulSoup
import pandas as pd

# Step 1: Fetch the web page content
url = "https://quotes.toscrape.com"
response = requests.get(url)
html_content = response.text

# Step 2: Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')

# Step 3: Extract the specific quote and author
quote_selector = "body > div.container > div:nth-child(2) > div.col-md-8 > div:nth-child(1) > span.text"
author_selector = "body > div.container > div:nth-child(2) > div.col-md-8 > div:nth-child(1) > span:nth-child(2) > small"

specific_quote = soup.select_one(quote_selector).get_text()
specific_author = soup.select_one(author_selector).get_text()

print(f"Specific Quote: {specific_quote}")
print(f"Specific Author: {specific_author}")

# Step 4: Extract all quotes and authors
quotes = soup.find_all('div', class_='quote')

all_quotes = []
for quote in quotes:
  text = quote.find('span', class_='text').get_text()
  author = quote.find('small', class_='author').get_text()
  all_quotes.append({"quote": text, "author": author})

# Step 5: Save the quotes to an Excel file
df = pd.DataFrame(all_quotes)
df.to_excel('quotes.xlsx', index=False)

print("Quotes have been saved to quotes.xlsx")
soup = BeautifulSoup(html_content, 'html.parser')

# Step 3: Extract the specific quote and author
quote_selector = "body > div.container > div:nth-child(2) > div.col-md-8 > div:nth-child(1) > span.text"
author_selector = "body > div.container > div:nth-child(2) > div.col-md-8 > div:nth-child(1) > span:nth-child(2) > small"

specific_quote = soup.select_one(quote_selector).get_text()
specific_author = soup.select_one(author_selector).get_text()

print(f"Specific Quote: {specific_quote}")
print(f"Specific Author: {specific_author}")

# Step 4: Extract all quotes and authors
quotes = soup.find_all('div', class_='quote')

all_quotes = []
for quote in quotes:
    text = quote.find('span', class_='text').get_text()
    author = quote.find('small', class_='author').get_text()
    all_quotes.append({"quote": text, "author": author})

# Step 5: Save the quotes to an Excel file
df = pd.DataFrame(all_quotes)
df.to_excel('quotes.xlsx', index=False)

print("Quotes have been saved to quotes.xlsx")

Save this code to a file named quotes_scraping.py and run the command python3 books_scraping.py from the command line. This code generates an Excel file named quotes_scraped.xlsx in the same directory as quotes_scraping.py. Open the generated Excel file, and it should look like this:

Generated Excel file with quotes and authors

Scraping Complex Websites

Scraping complex websites can be challenging because dynamic content is often loaded via JavaScript, which tools like requests and BeautifulSoup cannot handle. These sites may require interactions like clicking buttons or scrolling to access all data. To address this challenge, you can use WebDriver, which renders pages like a browser and simulates user interactions, ensuring all content is accessible just as it would be for a typical user.

For instance, Yelp is a crowd-sourced review website for businesses. Yelp relies on dynamic page generation and needs to simulate several user interactions. Here, you’ll use ChatGPT to generate a scraping code that retrieves a list of businesses in Stockholm and their ratings.

To scrape Yelp, let’s start by documenting the steps you’ll follow:

Find the selector of the location text box that the script will use; in this case, it’s #search_location. Type “Stockholm” in the location search box and then find the search button selector; in this case, it is #header_find_form > div.y-css-1iy1dwt > button. Click the search button to see the search results. This may take a few seconds. Find a selector that contains the business name (ie #main-content > ul > li:nth-child(3) > div.container__09f24__FeTO6.hoverable__09f24___UXLO.y-css-way87j > div > div.y-css-cxcdjj > div:nth-child(1) > div.y-css-1iy1dwt > div:nth-child(1) > div > div > h3 > a):

Find the selector that contains the rating for the business (ie #main-content > ul > li:nth-child(3) > div.container__09f24__FeTO6.hoverable__09f24___UXLO.y-css-way87j > div > div.y-css-cxcdjj > div:nth-child(1) > div.y-css-1iy1dwt > div:nth-child(2) > div > div > div > div.y-css-ohs7lg > span.y-css-jf9frv):

Find the selector for the Open Now button; here, it’s #main-content > div.stickyFilterOnSmallScreen__09f24__UWWJ3.hideFilterOnLargeScreen__09f24__ilqIP.y-css-9ze9ku > div > div > div > div > div > span > button:nth-child(3) > span:

Save a copy of the web page so that you can upload it later, along with the ChatGPT prompt to help ChatGPT understand the context of the prompts. In Chrome, you can do that by clicking the three dots at the top right and then clicking Save and Share > Save Page As:

Next, using the selector values you extracted earlier, you need to draft a detailed prompt to guide ChatGPT in generating the scraping script:

You are a web scraping expert. I want you to scrape https://www.yelp.com/ to extract specific information. Follow these steps before scraping:

Clear the box with the selector #search_location.

Type “Stockholm” in the search box with the selector #search_location.

Click the button with the selector #header_find_form > div.y-css-1iy1dwt > button.

Account for a few seconds for the page to load the search results.

Click the button Open Now, which has the selector #main-content > div.stickyFilterOnSmallScreen__09f24__UWWJ3.hideFilterOnLargeScreen__09f24__ilqIP.y-css-9ze9ku > div > div > div > div > div > span > button:nth-child(3) > span, to filter the businesses that are open now.

Create a list with business names; the business name has the selector #main-content > ul > li:nth-child(3) > div.container__09f24__FeTO6.hoverable__09f24___UXLO.y-css-way87j > div > div.y-css-cxcdjj > div:nth-child(1) > div.y-css-1iy1dwt > div:nth-child(1) > div > div > h3 > a.

Create a list with the business reviews; the reviews have the selector #main-content > ul > li:nth-child(3) > div.container__09f24__FeTO6.hoverable__09f24___UXLO.y-css-way87j > div > div.y-css-cxcdjj > div:nth-child(1) > div.y-css-1iy1dwt > div:nth-child(2) > div > div > div > div.y-css-ohs7lg > span.y-css-jf9frv.

Create an Excel file with the business names and their reviews.

Make sure to install all the relevant Python packages.

Use WebDriver from Selenium to automate the process.

Refer to the attached copy of the results HTML page to give you context about the HTML format.

After inputting this information, ChatGPT tells you to install Selenium, the web automation framework that supports WebDriver, and pandas:

pip3 install selenium pandas

ChatGPT also provides you with some scraping code along with an explanation:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import time

# Initialize the WebDriver
driver = webdriver.Chrome()

try:
    # Open Yelp
    driver.get('https://www.yelp.com/')

    # Clear the location search box
    location_box = driver.find_element(By.CSS_SELECTOR, '#search_location')
    location_box.clear()

    # Enter "Stockholm" in the location search box
    location_box.send_keys('Stockholm')

    # Click the search button
    search_button = driver.find_element(By.CSS_SELECTOR, '#header_find_form > div.y-css-1iy1dwt > button')
    search_button.click()

    # Wait for search results to load
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.CSS_SELECTOR, '#main-content'))
    )

    # Click on "Open Now" filter
    open_now_button = driver.find_element(By.CSS_SELECTOR, '#main-content > div.stickyFilterOnSmallScreen__09f24__UWWJ3.hideFilterOnLargeScreen__09f24__ilqIP.y-css-9ze9ku > div > div > div > div > div > span > button:nth-child(3) > span')
    open_now_button.click()

    # Wait for the filter results to load
    time.sleep(5)

    # Extract business names and reviews
    business_names = []
    business_reviews = []

    businesses = driver.find_elements(By.CSS_SELECTOR, '#main-content > ul > li')

    for business in businesses:
        try:
            name = business.find_element(By.CSS_SELECTOR, 'div.container__09f24__FeTO6.hoverable__09f24___UXLO.y-css-way87j > div > div.y-css-cxcdjj > div:nth-child(1) > div.y-css-1iy1dwt > div:nth-child(1) > div > div > h3 > a').text
            review = business.find_element(By.CSS_SELECTOR, 'div.container__09f24__FeTO6.hoverable__09f24___UXLO.y-css-way87j > div > div.y-css-cxcdjj > div:nth-child(1) > div.y-css-1iy1dwt > div:nth-child(2) > div > div > div > div.y-css-ohs7lg > span.y-css-jf9frv').text
            business_names.append(name)
            business_reviews.append(review)
        except Exception as e:
            print(f"Error extracting data for a business: {e}")

    # Create a DataFrame and save to Excel
    df = pd.DataFrame({'Business Name': business_names, 'Reviews': business_reviews})
    df.to_excel('stockholm_businesses.xlsx', index=False)
    print("Data saved to stockholm_businesses.xlsx")

finally:
    # Close the WebDriver
    driver.quit()

Save this script and run it using Python in Visual Studio Code. You’ll notice that the code launches Chrome, navigates to Yelp, clears the location text box, enters “Stockholm,” clicks the search button, filters businesses that are open now, and then closes the page. After that, the scraping result is saved to the Excel file stockholm_bussinsess.xlsx:

All the source code for this tutorial is available on GitHub.

Conclusion

In this tutorial, you learned how to use ChatGPT to extract specific information from websites with static HTML rendering and more complex websites with dynamic page generation, external JavaScript links, and user interactions.

While scraping a website like Yelp was straightforward, in reality, web scraping complex HTML structures can be challenging, and you’ll likely experience IP bans and CAPTCHAs.

To make it easier, Bright Data offers a wide variety of data collection services, including advanced proxy services to help bypass IP bans, Web Unlocker to bypass and solve CAPTCHAs, Web Scraping APIs for automated data extraction, and a Scraping Browser for efficient data extraction.