Web Scraping With LangChain and Bright Data

Discover how to combine web scraping with LangChain for real-world LLM data enrichment in this detailed step-by-step guide.
16 min read
Web Scraping With LangChain blog image

In this guide, you will learn:

  • Why web scraping is an excellent method for enriching LLMs with real-world data
  • The benefits and challenges of using scraped data in LangChain workflows
  • How to create a complete LangChain web scraping integration in a step-by-step tutorial

Let’s dive in!

Using Web Scraping to Power Your LLM Applications

Web scraping involves retrieving data from the web pages. That data can then be used for fueling RAG (Retrieval-Augmented Generation) applications and leveraging LLMs (Large Language Models).

RAG applications require access to real-time, domain-specific, or expansive datasets that may not be available in static databases. Web scraping bridges this gap by extracting structured and unstructured data from diverse web sources like articles, product listings, or social media.

Learn more in our article on collecting LLM training data.

Benefits and Challenges of Using Scraped Data in LangChain

LangChain is a powerful framework for building AI-driven workflows, enabling seamless integration of LLMs with diverse data sources. It excels at data analysis, summarization, and question-answering by combining LLMs with real-time, domain-specific knowledge. However, acquiring high-quality data is always a problem.

Web scraping can tackle that problem, but it comes with several challenges, including anti-bot measures, CAPTCHAs, and dynamic websites. Maintaining compliant and efficient scrapers can also be time-consuming and technically complex. For more details, check out our guide on anti-scraping measures.

These hurdles can slow the development of AI-powered applications that depend on real-time data. The solution? Bright Data’s Web Scraper API—a ready-to-use tool offering scraping endpoints for hundreds of websites.

With advanced features like IP rotation, CAPTCHA solving, and JavaScript rendering, Bright Data automates data extraction seamlessly. That ensures reliable, efficient, and hassle-free data collection, all accessible through simple API calls.

LangChain Web Scraping Powered By Bright Data: Step-by-Step Guide

In this section, you will learn how to build a LangChain web scraping script. The goal will be to retrieve content from a CNN article using the Bright Data Web Scraper API and send it to OpenAI for summarization via LangChain.

We will use the following CNN article as the target:

An article on CNN about Christmas

The example we are going to build here is a simple starting point, but it is easy to extend with additional features and analyses using LangChain. For instance, you could even create a RAG chatbot based on SERP data.

Follow the steps below to get started!

Prerequisites

To get through this tutorial, you will need the following:

  • Python 3+ installed on your machine
  • An OpenAI API key
  • A Bright Data account

Do not worry if you are missing any of these. We will guide you through the entire process, from installing Python to obtaining your OpenAI and Bright Data credentials.

Step #1: Project Setup

First of all, check if Python 3 is installed on your machine. If not, download and install it.

Run this command in the terminal to create a folder for your project:

mkdir langchain_scraping

langchain_scrping will contain your Python LangChain scraping project.

Then, navigate to the project folder and initialize a Python virtual environment inside it:

cd langchain_scraping
python3 -m venv env

Note: On Windows, use python instead of python3.

Now, open the project directory in your favorite Python IDE. PyCharm Community Edition or Visual Studio Code with the Python extension will do.

Inside langchain_scraping, add a script.py file. This is an empty Python script, but it will soon contain your LangChain web scraping logic.

In the IDE’s terminal, activate the virtual environment with the command below:

./env/bin/activate

Or, on Windows, run:

env/Scripts/activate

Awesome! You are now fully set up.

Step #2: Install the Required Libraries

The Python LangChain scraping project relies on the following libraries:

  • python-dotenv: To load environment environment variables from a .env file. It will be used to manage sensitive information like Bright Data and OpenAI credentials.
  • requests: To perform HTTP requests to interact with Bright Data’s Web Scraper API.
  • langchain_openai: LangChain integrations for OpenAI through its openai SDK.

In an activated virtual environment, install all the dependencies with this command:

pip install python-dotenv requests langchain-community

Amazing! You are ready to write some scraping logic.

Step #3: Prepare Your Project

In scripts.py, add the following imports:

from dotenv import load_dotenv
import os

These two lines allow you to read environment variable files.

Noteos comes from the Python Standard Library, so you do not have to install it.

Then, create a .env file in your project folder to store all your credentials. Here is what your current project file structure should look like:

The new .env file and how your file structure should look like

Instruct python-dotenv to load the environment variables from .env with this line in script.py:

load_dotenv()

You can now read environment variables from .env files or the system with:

os.environ.get("<ENV_NAME>")

Cool! Time to configure Bright Data’s Web Scraper API solution.

Step #4: Configure Web Scraper API

As mentioned at the beginning of this article, web scraping comes with several challenges. Fortnately, it becomes significantly easier with an all-in-one solution like Bright Data’s Web Scraper APIs. These APIs allow you to retrieve parsed content from over 100 websites effortlessly.

As an alternative approach, see our tutorial on how to scrape news articles.

To set up Web Scraper API, refer to the official documentation or follow the instructions below.

If you have not already, create a Bright Data account. After logging in, go to your account dashboard. Here, click on the “Web Scraper API” button on the left:

Choosing Web Scraper API from the menu on the left

Since the target site is CNN.com, type “cnn” in the search input and select the “CNN news — Collecy by URL” scraper:

Searching for hte CNN Scraper API

On the current page, click on the “Create token” button to generate a Bright Data API token:

Creating a new token for the API

This should open the following modal, where you can configure the details of your token:

Configuring the details of the new token

Once done, click “Save” and copy the value of your Bright Data API token.

Once you clicked save, the new token is shown

In your .env file, store this information as below:

BRIGHT_DATA_API_TOKEN="<YOUR_BRIGHT_DATA_API_TOKEN>"

Replace <YOUR_BRIGHT_DATA_API_TOKEN> with the value you copied from the modal.

Your CNN news Web Scraper API page should now look similar to the example below:

The CNN Scraper API page

Here we go! Configure your Web Scraper API request and play with it.

Step #5: Use Bright Data for Web Scraping

The Web Scraper API launches a web scraping task configured according to your needs on the page seen earlier. Then, this process generates a snapshot containing the scraped data.

Below is an overview of how the Web Scraper API scraping process works:

  1. You make a request to the Web Scraper API, providing the pages to scrape via URLs.
  2. A web scraping task is launched to retrieve and parse data from those URLs.
  3. You repeatedly query a snapshot retrieval API to fetch the resulting data once the task is complete.

The POST endpoint for the CNN Web Scraper API is:

"https://api.brightdata.com/datasets/v3/trigger?dataset_id=gd_lycz8783197ch4wvwg&include_errors=true"

That endpoint accepts an array of objects containing url fields and returns a response like this:

{"snapshot_id":"<YOUR_SNAPSHOT_ID>"}

Using the snapshot_id from this response, you then need to query the following endpoint to retrieve your data:

https://api.brightdata.com/datasets/v3/snapshot/<YOUR_SNAPSHOT_ID>?format=json

This endpoint returns HTTP status code 202 if the task is still in progress and 200 when the task is complete and the data is ready. The recommended approach is to poll this endpoint every 10 seconds until the task is finished.

Once the task is complete, the endpoint will return data in the following format:

[
    {
        "input": {
            "url": "https://www.cnn.com/2024/12/16/weather/white-christmas-forecast-climate/",
            "keyword": ""
        },
        "id": "https://www.cnn.com/2024/12/16/weather/white-christmas-forecast-climate/index.html",
        "url": "https://www.cnn.com/2024/12/16/weather/white-christmas-forecast-climate/index.html",
        "author": "Mary Gilbert",
        "headline": "White Christmas forecast: Will you be left dreaming of snow or reveling in it?",
        "topics": [
            "weather"
        ],
        "publication_date": "2024-12-16T13:20:52.800Z",
        "updated_last": "2024-12-16T13:20:52.800Z",
        "content": "Christmas is approaching nearly as fast as Santa’s sleigh, but almost anyone in the United States fantasizing about a movie-worthy white Christmas might need to keep dreaming. Early forecasts indicate temperatures could max out around 10 to 15 degrees above normal for much of the country on Christmas Day. [omitted for brevity...]",
        "videos": null,
        "images": [
                "omitted for brevity..."
        ],
        "related_articles": [],
        "keyword": null,
        "timestamp": "2024-12-16T14:18:14.101Z"
    }
]

The content attribute contains the parsed article data, representing the information you want to access.

To implement this, first read the env from .env and initialize the endpoint URL constants:

BRIGHT_DATA_API_TOKEN = os.environ.get("BRIGHT_DATA_API_TOKEN")
BRIGHT_DATA_CNN_WEB_SCRAPER_API_URL = "https://api.brightdata.com/datasets/v3/trigger?dataset_id=gd_lycz8783197ch4wvwg&include_errors=true"

Next, you can turn the above process into a reusable function using the following code:

def get_scraped_data(url):
    # Authorization headers
    headers = {
    "Authorization": f"Bearer {BRIGHT_DATA_API_TOKEN}"
    }
    # Web Scraper API payload
    data = [{
        "url": url
    }]

    # Making the POST request to the Bright Data Web Scraper API
    response = requests.post(BRIGHT_DATA_CNN_WEB_SCRAPER_API_URL, headers=headers, json=data)

    if response.status_code == 200:
        response_data = response.json()
        snapshot_id = response_data.get("snapshot_id")
        if snapshot_id:
            # Iterate until the snapshot is ready
            snapshot_url = f"https://api.brightdata.com/datasets/v3/snapshot/{snapshot_id}?format=json"

            while True:
                snapshot_response = requests.get(snapshot_url, headers=headers)

                if snapshot_response.status_code == 200:
                    # Parse and return the snapshot data
                    snapshot_response_data = snapshot_response.json()
                    return snapshot_response_data[0].get("content")
                elif snapshot_response.status_code == 202:
                    print("Snapshot not ready yet. Retrying in 10 seconds...")
                    time.sleep(10)  # Wait for 10 seconds before retrying
                else:
                    print(f"Failed to retrieve snapshot. Status code: {snapshot_response.status_code}")
                    print(snapshot_response.text)
                    break
        else:
            print("Snapshot ID not found in the response")
    else:
        print(f"Error: {response.status_code}")
        print(response.text)

To make it work, add these two imports:

import requests
import time

Incredible! You just learned how to use Bright Data Web’s Scraper API for web scraping.

Step #6: Get Ready to Use OpenAI Models

This example relies on OpenAI models for LLM integration within LangChain. To use those models, you must configure an OpenAI API key in your environment variables.

By default, langchain_openai automatically reads the OpenAI API key from the OPENAI_API_KEY environment variable. To set this up, add the following line to your .env file:

OPENAI_API_KEY="<YOUR_OPEN_API_KEY>"

Replace <YOUR_OPENAI_API_KEY> with the value of your OpenAI API key. If you do not know how to get one, follow the official guide.

Great! Time to use OpenAI models in your LangChain scraping script.

Step #7: Generate the LLM Prompt

Define a function that takes the scraped data and produces a prompt to get a summary of the article:

def create_summary_prompt(content, words=100):
    return f"""Summarize the following content in less than {words} words.

           CONTENT:
           '{content}'
           """

In the current example, the complete prompt will be:

Summarize the following content in less than 100 words.

CONTENT:
'Christmas is approaching nearly as fast as Santa’s sleigh, but almost anyone in the United States fantasizing about a movie-worthy white Christmas might need to keep dreaming. Early forecasts indicate temperatures could max out around 10 to 15 degrees above normal for much of the country on Christmas Day. It’s a forecast reminiscent of last Christmas for many, which came amid the warmest winter on record in the US. But the country could be split in two by warmth and cold in the run up to the big day. [omitted for brevity...]'

If you pass it to ChatGPT, you should get the desired result:

Passing the task of summarizing the content in less than 100 words

This is enough to tell that the prompt works like a charm!

Step #8: Integrate OpenAI

First, call the get_scraped_data() function to retrieve the content from the article page:

article_url = "https://www.cnn.com/2024/12/16/weather/white-christmas-forecast-climate/"
scraped_data = get_scraped_data(article_url)

If the scraped_data is not None, generate the prompt:

if scraped_data is not None:
    prompt = create_summary_prompt(scraped_data)

Finally, pass it to a ChatOpenAI LangChain object configured on the GPT-4o mini AI model:

model = ChatOpenAI(model="gpt-4o-mini")
response = model.invoke(prompt)

Do not forget to import ChatOpenAI from langchain_openai:

from langchain_openai import ChatOpenAI

At the end of the process, summary should contain something similar to the summary produced by ChatGPT in the previous step:

summary = response.content

Wow! The LangChain web scraping logic is complete.

Step #9: Export the AI-Processed Data

Now, you just need to export the data generated by the selected AI model via LangChain to a human-readable format, such as a JSON file.

To do this, initialize a dictionary with the data you want. Then, export and then save it as a JSON file, as shown below:

export_data = {
    "url": article_url,
    "summary": summary
}

file_name = "summary.json"
with open(file_name, "w") as file:
    json.dump(export_data, file, indent=4)

Import json from the Python Standard Library:

import json

Congrats! Your script is ready.

Step #10: Add Some Logs

The scraping process using Web Scraping AI and ChatGPT analysis may take some time. So, it is a good practice to include logs to track the script’s progress.

You can achieve this by adding print() statements at key steps in the script, as follows:

article_url = "https://www.cnn.com/2024/12/16/weather/white-christmas-forecast-climate/"
print(f"Scraping data from '{article_url}'...")
scraped_data = get_scraped_data(article_url)

if scraped_data is not None:
    print("Data successfully scraped, creating summary prompt")
    prompt = create_summary_prompt(scraped_data)

    # Ask ChatGPT to perform the task specified in the prompt
    print("Sending prompt to ChatGPT for summarization")
    model = ChatOpenAI(model="gpt-4o-mini")
    response = model.invoke(prompt)

    # Get the AI result
    summary = response.content
    print("Received summary from ChatGPT")

    # Export the produced data to JSON
    export_data = {
        "url": article_url,
        "summary": summary
    }

    print("Exporting data to JSON")
    # Write the output dictionary to JSON file
    file_name = "summary.json"
    with open(file_name, "w") as file:
        json.dump(export_data, file, indent=4)
    print(f"Data exported to '${file_name}'")
else:
    print("Scraping failed")

Step #11: Put It All Together

Your final script.py file should contain:

from dotenv import load_dotenv
import os
import requests
import time
from langchain_openai import ChatOpenAI
import json

load_dotenv()

BRIGHT_DATA_API_TOKEN = os.environ.get("BRIGHT_DATA_API_TOKEN")
BRIGHT_DATA_CNN_WEB_SCRAPER_API_URL = "https://api.brightdata.com/datasets/v3/trigger?dataset_id=gd_lycz8783197ch4wvwg&include_errors=true"

def get_scraped_data(url):
    # Authorization headers
    headers = {
        "Authorization": f"Bearer {BRIGHT_DATA_API_TOKEN}"
    }

    # Web Scraper API payload
    data = [{
        "url": url
    }]

    # Making the POST request to the Bright Data Web Scraper API
    response = requests.post(BRIGHT_DATA_CNN_WEB_SCRAPER_API_URL, headers=headers, json=data)

    if response.status_code == 200:
        response_data = response.json()
        snapshot_id = response_data.get("snapshot_id")
        if snapshot_id:
            # Iterate until the snapshot is ready
            snapshot_url = f"https://api.brightdata.com/datasets/v3/snapshot/{snapshot_id}?format=json"

            while True:
                snapshot_response = requests.get(snapshot_url, headers=headers)

                if snapshot_response.status_code == 200:
                    # Parse and return the snapshot data
                    snapshot_response_data = snapshot_response.json()
                    return snapshot_response_data[0].get("content")
                elif snapshot_response.status_code == 202:
                    print("Snapshot not ready yet. Retrying in 10 seconds...")
                    time.sleep(10)  # Wait for 10 seconds before retrying
                else:
                    print(f"Failed to retrieve snapshot. Status code: {snapshot_response.status_code}")
                    print(snapshot_response.text)
                    break
        else:
            print("Snapshot ID not found in the response")
    else:
        print(f"Error: {response.status_code}")
        print(response.text)


def create_summary_prompt(content, words=100):
    return f"""Summarize the following content in less than {words} words.

           CONTENT:
           '{content}'
           """

# Retrieve the content from the given web page
article_url = "https://www.cnn.com/2024/12/16/weather/white-christmas-forecast-climate/"
scraped_data = get_scraped_data(article_url)

# Ask ChatGPT to perform the task specified in the prompt
prompt = create_summary_prompt(scraped_data)
model = ChatOpenAI(model="gpt-4o-mini")
response = model.invoke(prompt)

# Get the AI result
summary = response.content

# Export the produced data to JSON
export_data = {
    "url": article_url,
    "summary": summary
}

# Write dictionary to JSON file
with open("summary.json", "w") as file:
    json.dump(export_data, file, indent=4)

Can you believe it? In less than 100 lines of code, you just build an AI-based LangChain web scraping script.

Verify that it works with this command:

python3 script.py

Or, on Windows:

python script.py

The output in the terminal should be close to this one:

Scraping data from 'https://www.cnn.com/2024/12/16/weather/white-christmas-forecast-climate/'...
Snapshot not ready yet. Retrying in 10 seconds...
Data successfully scraped, creating summary prompt
Sending prompt to ChatGPT for summarization
Received summary from ChatGPT
Exporting data to JSON
Data exported to 'summary.json'

Open the open.json file that appeared in the project’s directory and you should see something like this:

{
    "url": "https://www.cnn.com/2024/12/16/weather/white-christmas-forecast-climate/",
    "summary": "As Christmas approaches, forecasts indicate temperatures in the US may be 10 to 15 degrees above normal, continuing a trend from last year\u2019s warm winter. The western US will likely remain warm, while the East experiences colder conditions leading up to Christmas. Some areas may see a mix of rain and snow, but a true \"white Christmas\" requires at least an inch of snow on the ground. Historically, cities like Minneapolis and Burlington have the best chances for snow, while places like New York City and Atlanta have significantly lower probabilities."
}

Et voilà! Mission complete.

Conclusion

In this tutorial, you discovered why web scraping is an excellent method for gathering data for your AI workflows and how to analyze it using LangChain. Specifically, you learned how to create a Python-based LangChain web scraping script to extract data from a CNN news article and process it with OpenAI APIs.

The main challenges with this approach include:

  1. Online sites frequently change their page structures.
  2. Many sites implement advanced anti-bot measures.
  3. Retrieving large volumes of data simultaneously can be complex and expensive.

Bright Data’s Web Scraper API offers a seamless solution for extracting data from major websites, overcoming these challenges effortlessly. This makes it an invaluable tool for supporting RAG applications and other LangChain-powered solutions.

Be also sure to explore our additional offerings for AI and LLM.

Sign up now to discover which of Bright Data’s proxy services or scraping products best suit your needs. Start with a free trial!

No credit card required