LLM Web Scraping with ScraperGraphAI

Traditional web scraping often involves writing intricate, time-consuming code tailored to specific website layouts, which can easily break when sites change. ScrapeGraphAI utilizes large language models (LLMs) to extract information and interpret it like a human would, allowing you to focus on the data instead of the layout. Integrating LLMs with ScrapeGraphAI enhances data extraction, automates content aggregation, and enables real-time analysis.

In this article, you’ll learn how to use ScrapeGraphAI for web scraping. But before that, let us introduce Bright Data’s solutions that can save you both time and money.

Bright Data’s Web Scraping Solutions

Bright Data offers a comprehensive suite of web scraping solutions, optimized for efficient, scalable, and compliant data extraction:

Web Scraper API
Ready-to-Use Datasets
Custom Datasets

These solutions enable fast, accurate, and scalable data collection, perfect for projects of any scope, from small-scale applications to enterprise-level needs.

Implementing LLM Web Scraping with ScrapeGraphAI

Before starting this tutorial, you need the following prerequisites:

Python 3.x installed.
An Ope nAI account. This tutorial uses OpenAI LLMs to scrape data with ScrapeGraphAI. While it’s possible to use other models, such as Anthropic, Google, or open source models, like Llama and Mistral AI, this tutorial uses GPT-4 because of its popularity and ease of setup.

If you’re unfamiliar with web scraping in Python, check out this web scraping tutorial to get started.

Set Up Your Environment

The first thing you need to do is create a virtual environment. Open your terminal and navigate to your project directory:

python -m venv venv

Then, activate the virtual environment. On macOS and Linux, you can do so with the following command:

source venv/bin/activate

On Windows, you can use this command:

venvScriptsactivate

Once you’ve activated the virtual environment, you need to install ScrapeGraphAI and its dependencies:

pip install scrapegraphai
playwright install

The playwright install command sets up the necessary browsers for Chromium, Firefox, and WebKit.

To manage environment variables securely, install python-dotenv:

pip install python-dotenv

It’s important to protect sensitive information, like API keys. It’s recommended that you store environment variables in a .env file to keep it separate from the code files.

Create a new file named .env in the project directory and add the following line specifying your OpenAI key:

OPENAI_API_KEY="your-openai-api-key"

This file should not be committed to version control systems like Git. To prevent this, add .env to your .gitignore file.

Scrape Data with ScrapeGraphAI

In this tutorial, you’ll start by scraping product data from Books to Scrape, a demo website specifically for practicing web scraping techniques. This website mimics an online bookstore, offering a variety of books across different genres, complete with prices, ratings, and availability status:

In traditional HTML web scraping, you would need to analyze the page’s HTML, manually inspecting elements and tags to locate the data that you want. This process is time-consuming and requires a solid understanding of web structures. With ScrapeGraphAI, you only need to specify the data you want using a prompt, and the LLM is intelligent enough to extract it.

ScrapeGraphAI offers different types of graphs for different scraping needs. These graphs define how the scraping process is structured and what it aims to accomplish. Here’s a brief overview of some of the different graphs:

SmartScraperGraph is a single-page scraper where you provide a prompt and a URL or local file. It uses an LLM to extract the information you specify.
SearchGraph is a multi-page scraper that extracts information from search engine results based on your prompt.
SpeechGraph extends SmartScraperGraph by adding a text-to-speech feature, generating an audio file of the extracted content.
ScriptCreatorGraph takes a prompt and URL and, instead of generating results, outputs a Python script capable of scraping the given URL.

If these graphs don’t fit your exact needs, ScrapeGraphAI also lets you create custom graphs by combining different nodes and tailoring the scraping process to your specific requirements.

Along with choosing the right type of graph, you also need to properly configure your scraper, especially when it comes to the prompt and model selection. The prompt guides the LLM in understanding exactly what data to extract—make sure it is clear and explicit. Additionally, choosing the right LLM model determines how well the scraper processes and interprets the website’s content. You can also configure other options like proxies to ensure smoother scraping of georestricted content and headless mode to keep the process efficient and fast. Proper configuration ultimately determines how precise and relevant the scraped data will be.

Write the Scraper Code

To write the scraper code, create a new file named app.py and add the following lines of code to it:

from dotenv import load_dotenv
import os
from scrapegraphai.graphs import SmartScraperGraph

# Load environment variables from .env file
load_dotenv()

# Access the OpenAI API key
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

# Configuration for ScrapeGraphAI
graph_config = {
    "llm": {
        "api_key": OPENAI_API_KEY,
        "model": "openai/gpt-4o-mini",
 }
}

# Define the prompt and source
prompt = "Extract the title, price and availability of all books on this page."
source = "http://books.toscrape.com/"

# Create the scraper graph
smart_scraper_graph = SmartScraperGraph(
 prompt=prompt,
 source=source,
 config=graph_config
)

# Run the scraper
result = smart_scraper_graph.run()

# Output the results
print(result)

This code imports essential modules like os and dotenv for handling environment variables and the SmartScraperGraph class from ScrapeGraphAI, which is used for scraping. It then loads the environment variables via dotenv to keep sensitive data like API keys secure. Next, the code creates an LLM configuration for scraping specifying which model to use and the API key. This configuration, along with the site URL and scraping prompt, is used to define your SmartScraperGraph, which is executed using the run() method, triggering the process to gather the specified data.

To run this code, open your terminal and run python app.py. Your output should look like this:

{
    "books": [
        {
            "title": "A Light in the Attic",
            "price": "£51.77",
            "availability": "In stock"
        },
        {
            "title": "Tipping the Velvet",
            "price": "£53.74",
            "availability": "In stock"
        }, ...
       ]
}

Note: If you encounter issues running the code, you might need to manually install the grpcio package, which is an underlying dependency of ScrapeGraphAI. You can do this with the following command:

pip install grpcio

While ScrapeGraphAI makes the data extraction part of web scraping easy, there are still some common challenges, like CAPTCHAs and IP blocks, that you need to know how to handle.

To mimic browsing behavior, you can implement timed delays in your code. You can also utilize rotating proxies to avoid detection. Additionally, CAPTCHA-solving services like Bright Data’s CAPTCHA solver or Anti Captcha can be integrated into your scraper to automatically solve CAPTCHAs for you.

Please note: Always ensure that you’re compliant with a website’s terms of service. Scraping for personal use is often acceptable, but redistributing data can have legal implications.

Using Proxies with ScrapeGraphAI

ScrapeGraphAI lets you set up a proxy service to avoid IP blocking and access georestricted content. You can use a free proxy service or configure your own custom proxy server.

To use a free proxy service, add the following to your graph_config:

graph_config = {
    "llm": {
        "api_key": OPENAI_API_KEY,
        "model": "openai/gpt-4o-mini",
 },
    "loader_kwargs": {
        "proxy": {
            "server": "broker",
            "criteria": {
                "anonymous": True,
                "secure": True,
                "countryset": {"US"},
                "timeout": 10.0,
                "max_tries": 3
 },
 },
 }
}

This configuration tells ScrapeGraphAI to use a free proxy service that matches your criteria.

To use a custom proxy server from a provider like Bright Data, alter your graph_config as follows, inserting your server URL, username, and password:

graph_config = {
    "llm": {
        "api_key": OPENAI_API_KEY,
        "model": "openai/gpt-4o-mini",
 },
    "loader_kwargs": {
        "proxy": {
            "server": "http://your_proxy_server:port",
            "username": "your_username",
            "password": "your_password",
 },
 }
}

Using a custom proxy server offers several advantages, especially for large-scale web scraping. It provides you control over the proxy’s location, allowing you to scrape georestricted content. Additionally, custom proxies are more reliable and secure compared to free proxies, reducing the chances of having your IP blocked or being rate-limited.

Clean and Prepare Data

After scraping data, you need to clean and preprocess it, especially if you plan on feeding it into an AI model. Clean data ensures that your models learn from accurate and consistent information, which directly impacts their performance and reliability. Data cleaning typically involves handling missing values, correcting data types, normalizing text, and removing duplicates.

Here’s an example of how you can clean the data you scraped earlier using pandas:

import pandas as pd

# Convert the result to a DataFrame
df = pd.DataFrame(result["books"])

# Remove currency symbols and convert prices to float
df['price'] = df['price'].str.replace('£', '').astype(float)

# Standardize availability text
df['availability'] = df['availability'].str.strip().str.lower()

# Handle missing values if any
df.dropna(inplace=True)

# Preview the cleaned data
print(df.head())

This code cleans the data by removing the currency symbol from the book prices, standardizes the availability status by converting it to lowercase, and handles any missing values.

Before running the code, you need to install the pandas library for data manipulation:

pip install pandas

To run the code, open your terminal and run python app.py. Your output should look like this:

                                   title  price availability
0                   A Light in the Attic  51.77     in stock
1                     Tipping the Velvet  53.74     in stock
2                             Soumission  50.10     in stock
3                          Sharp Objects  47.82     in stock
4  Sapiens: A Brief History of Humankind  54.23     in stock

This is only an example of how you can clean scraped data—the cleaning process varies based on the data and the LLM use case being trained. When you clean your data, you ensure that your language models receive well-structured and meaningful input. If you’re looking to learn more about how to utilize data for AI projects, consider data for AI.

All the code for this tutorial is available in this GitHub repo.

Conclusion

ScrapeGraphAI uses LLMs to provide an adaptive approach to web scraping, adjusting to changes in website structures and extracting data intelligently. However, scaling web scraping comes with challenges, like blocking IPs, solving CAPTCHAs, and maintaining legal compliance.

To help overcome these challenges, Bright Data provides a comprehensive suite of web scraping solutions tailored for AI and machine learning projects. This includes the Bright Data Web Scraper APIs, proxy services, and Serverless Scraping. Beyond this, Bright Data also offers ready-to-use datasets filled with data from over a hundred popular websites.

Start your free trial today!

LLM Web Scraping with ScrapeGraphAI

Bright Data’s Web Scraping Solutions

Implementing LLM Web Scraping with ScrapeGraphAI

Set Up Your Environment

Scrape Data with ScrapeGraphAI

Write the Scraper Code

Using Proxies with ScrapeGraphAI

Clean and Prepare Data

Conclusion

You might also be interested in

How to Connect Ollama Models to Bright Data’s Web MCP

Give AWS Bedrock Agents the Ability to Search the Web via Bright Data SERP API

Bright Data’s SERP API in Azure AI Foundry for Search-Powered Prompt Flows