In this guide, you will learn:
- Why web scraping is an excellent method for enriching LLMs with real-world data
- The benefits and challenges of using scraped data in LangChain workflows
- How to create a complete LangChain web scraping integration in a step-by-step tutorial
Let’s dive in!
Using Web Scraping to Power Your LLM Applications
Web scraping involves retrieving data from the web pages. That data can then be used for fueling RAG (Retrieval-Augmented Generation) applications and leveraging LLMs (Large Language Models).
RAG applications require access to real-time, domain-specific, or expansive datasets that may not be available in static databases. Web scraping bridges this gap by extracting structured and unstructured data from diverse web sources like articles, product listings, or social media.
Learn more in our article on collecting LLM training data.
Benefits and Challenges of Using Scraped Data in LangChain
LangChain is a powerful framework for building AI-driven workflows, enabling seamless integration of LLMs with diverse data sources. It excels at data analysis, summarization, and question-answering by combining LLMs with real-time, domain-specific knowledge. However, acquiring high-quality data is always a problem.
Web scraping can tackle that problem, but it comes with several challenges, including anti-bot measures, CAPTCHAs, and dynamic websites. Maintaining compliant and efficient scrapers can also be time-consuming and technically complex. For more details, check out our guide on anti-scraping measures.
These hurdles can slow the development of AI-powered applications that depend on real-time data. The solution? Bright Data’s Web Scraper API—a ready-to-use tool offering scraping endpoints for hundreds of websites.
With advanced features like IP rotation, CAPTCHA solving, and JavaScript rendering, Bright Data automates data extraction seamlessly. That ensures reliable, efficient, and hassle-free data collection, all accessible through simple API calls.
LangChain Web Scraping Powered By Bright Data: Step-by-Step Guide
In this section, you will learn how to build a LangChain web scraping script. The goal will be to retrieve content from a CNN article using the Bright Data Web Scraper API and send it to OpenAI for summarization via LangChain.
We will use the following CNN article as the target:
The example we are going to build here is a simple starting point, but it is easy to extend with additional features and analyses using LangChain. For instance, you could even create a RAG chatbot based on SERP data.
Follow the steps below to get started!
Prerequisites
To get through this tutorial, you will need the following:
- Python 3+ installed on your machine
- An OpenAI API key
- A Bright Data account
Do not worry if you are missing any of these. We will guide you through the entire process, from installing Python to obtaining your OpenAI and Bright Data credentials.
Step #1: Project Setup
First of all, check if Python 3 is installed on your machine. If not, download and install it.
Run this command in the terminal to create a folder for your project:
mkdir langchain_scraping
langchain_scrping
will contain your Python LangChain scraping project.
Then, navigate to the project folder and initialize a Python virtual environment inside it:
cd langchain_scraping
python3 -m venv env
Note: On Windows, use python
instead of python3
.
Now, open the project directory in your favorite Python IDE. PyCharm Community Edition or Visual Studio Code with the Python extension will do.
Inside langchain_scraping
, add a script.py
file. This is an empty Python script, but it will soon contain your LangChain web scraping logic.
In the IDE’s terminal, activate the virtual environment with the command below:
./env/bin/activate
Or, on Windows, run:
env/Scripts/activate
Awesome! You are now fully set up.
Step #2: Install the Required Libraries
The Python LangChain scraping project relies on the following libraries:
python-dotenv
: To load environment environment variables from a.env
file. It will be used to manage sensitive information like Bright Data and OpenAI credentials.requests
: To perform HTTP requests to interact with Bright Data’s Web Scraper API.langchain_openai
: LangChain integrations for OpenAI through itsopenai
SDK.
In an activated virtual environment, install all the dependencies with this command:
pip install python-dotenv requests langchain-community
Amazing! You are ready to write some scraping logic.
Step #3: Prepare Your Project
In scripts.py
, add the following imports:
from dotenv import load_dotenv
import os
These two lines allow you to read environment variable files.
Note: os
comes from the Python Standard Library, so you do not have to install it.
Then, create a .env
file in your project folder to store all your credentials. Here is what your current project file structure should look like:
Instruct python-dotenv
to load the environment variables from .env
with this line in script.py
:
load_dotenv()
You can now read environment variables from .env
files or the system with:
os.environ.get("<ENV_NAME>")
Cool! Time to configure Bright Data’s Web Scraper API solution.
Step #4: Configure Web Scraper API
As mentioned at the beginning of this article, web scraping comes with several challenges. Fortnately, it becomes significantly easier with an all-in-one solution like Bright Data’s Web Scraper APIs. These APIs allow you to retrieve parsed content from over 100 websites effortlessly.
As an alternative approach, see our tutorial on how to scrape news articles.
To set up Web Scraper API, refer to the official documentation or follow the instructions below.
If you have not already, create a Bright Data account. After logging in, go to your account dashboard. Here, click on the “Web Scraper API” button on the left:
Since the target site is CNN.com, type “cnn” in the search input and select the “CNN news — Collecy by URL” scraper:
On the current page, click on the “Create token” button to generate a Bright Data API token:
This should open the following modal, where you can configure the details of your token:
Once done, click “Save” and copy the value of your Bright Data API token.
In your .env
file, store this information as below:
BRIGHT_DATA_API_TOKEN="<YOUR_BRIGHT_DATA_API_TOKEN>"
Replace <YOUR_BRIGHT_DATA_API_TOKEN>
with the value you copied from the modal.
Your CNN news Web Scraper API page should now look similar to the example below:
Here we go! Configure your Web Scraper API request and play with it.
Step #5: Use Bright Data for Web Scraping
The Web Scraper API launches a web scraping task configured according to your needs on the page seen earlier. Then, this process generates a snapshot containing the scraped data.
Below is an overview of how the Web Scraper API scraping process works:
- You make a request to the Web Scraper API, providing the pages to scrape via URLs.
- A web scraping task is launched to retrieve and parse data from those URLs.
- You repeatedly query a snapshot retrieval API to fetch the resulting data once the task is complete.
The POST endpoint for the CNN Web Scraper API is:
"https://api.brightdata.com/datasets/v3/trigger?dataset_id=gd_lycz8783197ch4wvwg&include_errors=true"
That endpoint accepts an array of objects containing url
fields and returns a response like this:
{"snapshot_id":"<YOUR_SNAPSHOT_ID>"}
Using the snapshot_id
from this response, you then need to query the following endpoint to retrieve your data:
https://api.brightdata.com/datasets/v3/snapshot/<YOUR_SNAPSHOT_ID>?format=json
This endpoint returns HTTP status code 202
if the task is still in progress and 200
when the task is complete and the data is ready. The recommended approach is to poll this endpoint every 10 seconds until the task is finished.
Once the task is complete, the endpoint will return data in the following format:
[
{
"input": {
"url": "https://www.cnn.com/2024/12/16/weather/white-christmas-forecast-climate/",
"keyword": ""
},
"id": "https://www.cnn.com/2024/12/16/weather/white-christmas-forecast-climate/index.html",
"url": "https://www.cnn.com/2024/12/16/weather/white-christmas-forecast-climate/index.html",
"author": "Mary Gilbert",
"headline": "White Christmas forecast: Will you be left dreaming of snow or reveling in it?",
"topics": [
"weather"
],
"publication_date": "2024-12-16T13:20:52.800Z",
"updated_last": "2024-12-16T13:20:52.800Z",
"content": "Christmas is approaching nearly as fast as Santa’s sleigh, but almost anyone in the United States fantasizing about a movie-worthy white Christmas might need to keep dreaming. Early forecasts indicate temperatures could max out around 10 to 15 degrees above normal for much of the country on Christmas Day. [omitted for brevity...]",
"videos": null,
"images": [
"omitted for brevity..."
],
"related_articles": [],
"keyword": null,
"timestamp": "2024-12-16T14:18:14.101Z"
}
]
The content
attribute contains the parsed article data, representing the information you want to access.
To implement this, first read the env from .env
and initialize the endpoint URL constants:
BRIGHT_DATA_API_TOKEN = os.environ.get("BRIGHT_DATA_API_TOKEN")
BRIGHT_DATA_CNN_WEB_SCRAPER_API_URL = "https://api.brightdata.com/datasets/v3/trigger?dataset_id=gd_lycz8783197ch4wvwg&include_errors=true"
Next, you can turn the above process into a reusable function using the following code:
def get_scraped_data(url):
# Authorization headers
headers = {
"Authorization": f"Bearer {BRIGHT_DATA_API_TOKEN}"
}
# Web Scraper API payload
data = [{
"url": url
}]
# Making the POST request to the Bright Data Web Scraper API
response = requests.post(BRIGHT_DATA_CNN_WEB_SCRAPER_API_URL, headers=headers, json=data)
if response.status_code == 200:
response_data = response.json()
snapshot_id = response_data.get("snapshot_id")
if snapshot_id:
# Iterate until the snapshot is ready
snapshot_url = f"https://api.brightdata.com/datasets/v3/snapshot/{snapshot_id}?format=json"
while True:
snapshot_response = requests.get(snapshot_url, headers=headers)
if snapshot_response.status_code == 200:
# Parse and return the snapshot data
snapshot_response_data = snapshot_response.json()
return snapshot_response_data[0].get("content")
elif snapshot_response.status_code == 202:
print("Snapshot not ready yet. Retrying in 10 seconds...")
time.sleep(10) # Wait for 10 seconds before retrying
else:
print(f"Failed to retrieve snapshot. Status code: {snapshot_response.status_code}")
print(snapshot_response.text)
break
else:
print("Snapshot ID not found in the response")
else:
print(f"Error: {response.status_code}")
print(response.text)
To make it work, add these two imports:
import requests
import time
Incredible! You just learned how to use Bright Data Web’s Scraper API for web scraping.
Step #6: Get Ready to Use OpenAI Models
This example relies on OpenAI models for LLM integration within LangChain. To use those models, you must configure an OpenAI API key in your environment variables.
By default, langchain_openai
automatically reads the OpenAI API key from the OPENAI_API_KEY
environment variable. To set this up, add the following line to your .env
file:
OPENAI_API_KEY="<YOUR_OPEN_API_KEY>"
Replace <YOUR_OPENAI_API_KEY>
with the value of your OpenAI API key. If you do not know how to get one, follow the official guide.
Great! Time to use OpenAI models in your LangChain scraping script.
Step #7: Generate the LLM Prompt
Define a function that takes the scraped data and produces a prompt to get a summary of the article:
def create_summary_prompt(content, words=100):
return f"""Summarize the following content in less than {words} words.
CONTENT:
'{content}'
"""
In the current example, the complete prompt will be:
Summarize the following content in less than 100 words.
CONTENT:
'Christmas is approaching nearly as fast as Santa’s sleigh, but almost anyone in the United States fantasizing about a movie-worthy white Christmas might need to keep dreaming. Early forecasts indicate temperatures could max out around 10 to 15 degrees above normal for much of the country on Christmas Day. It’s a forecast reminiscent of last Christmas for many, which came amid the warmest winter on record in the US. But the country could be split in two by warmth and cold in the run up to the big day. [omitted for brevity...]'
If you pass it to ChatGPT, you should get the desired result:
This is enough to tell that the prompt works like a charm!
Step #8: Integrate OpenAI
First, call the get_scraped_data()
function to retrieve the content from the article page:
article_url = "https://www.cnn.com/2024/12/16/weather/white-christmas-forecast-climate/"
scraped_data = get_scraped_data(article_url)
If the scraped_data
is not None
, generate the prompt:
if scraped_data is not None:
prompt = create_summary_prompt(scraped_data)
Finally, pass it to a ChatOpenAI
LangChain object configured on the GPT-4o mini AI model:
model = ChatOpenAI(model="gpt-4o-mini")
response = model.invoke(prompt)
Do not forget to import ChatOpenAI
from langchain_openai
:
from langchain_openai import ChatOpenAI
At the end of the process, summary
should contain something similar to the summary produced by ChatGPT in the previous step:
summary = response.content
Wow! The LangChain web scraping logic is complete.
Step #9: Export the AI-Processed Data
Now, you just need to export the data generated by the selected AI model via LangChain to a human-readable format, such as a JSON file.
To do this, initialize a dictionary with the data you want. Then, export and then save it as a JSON file, as shown below:
export_data = {
"url": article_url,
"summary": summary
}
file_name = "summary.json"
with open(file_name, "w") as file:
json.dump(export_data, file, indent=4)
Import json
from the Python Standard Library:
import json
Congrats! Your script is ready.
Step #10: Add Some Logs
The scraping process using Web Scraping AI and ChatGPT analysis may take some time. So, it is a good practice to include logs to track the script’s progress.
You can achieve this by adding print()
statements at key steps in the script, as follows:
article_url = "https://www.cnn.com/2024/12/16/weather/white-christmas-forecast-climate/"
print(f"Scraping data from '{article_url}'...")
scraped_data = get_scraped_data(article_url)
if scraped_data is not None:
print("Data successfully scraped, creating summary prompt")
prompt = create_summary_prompt(scraped_data)
# Ask ChatGPT to perform the task specified in the prompt
print("Sending prompt to ChatGPT for summarization")
model = ChatOpenAI(model="gpt-4o-mini")
response = model.invoke(prompt)
# Get the AI result
summary = response.content
print("Received summary from ChatGPT")
# Export the produced data to JSON
export_data = {
"url": article_url,
"summary": summary
}
print("Exporting data to JSON")
# Write the output dictionary to JSON file
file_name = "summary.json"
with open(file_name, "w") as file:
json.dump(export_data, file, indent=4)
print(f"Data exported to '${file_name}'")
else:
print("Scraping failed")
Step #11: Put It All Together
Your final script.py
file should contain:
from dotenv import load_dotenv
import os
import requests
import time
from langchain_openai import ChatOpenAI
import json
load_dotenv()
BRIGHT_DATA_API_TOKEN = os.environ.get("BRIGHT_DATA_API_TOKEN")
BRIGHT_DATA_CNN_WEB_SCRAPER_API_URL = "https://api.brightdata.com/datasets/v3/trigger?dataset_id=gd_lycz8783197ch4wvwg&include_errors=true"
def get_scraped_data(url):
# Authorization headers
headers = {
"Authorization": f"Bearer {BRIGHT_DATA_API_TOKEN}"
}
# Web Scraper API payload
data = [{
"url": url
}]
# Making the POST request to the Bright Data Web Scraper API
response = requests.post(BRIGHT_DATA_CNN_WEB_SCRAPER_API_URL, headers=headers, json=data)
if response.status_code == 200:
response_data = response.json()
snapshot_id = response_data.get("snapshot_id")
if snapshot_id:
# Iterate until the snapshot is ready
snapshot_url = f"https://api.brightdata.com/datasets/v3/snapshot/{snapshot_id}?format=json"
while True:
snapshot_response = requests.get(snapshot_url, headers=headers)
if snapshot_response.status_code == 200:
# Parse and return the snapshot data
snapshot_response_data = snapshot_response.json()
return snapshot_response_data[0].get("content")
elif snapshot_response.status_code == 202:
print("Snapshot not ready yet. Retrying in 10 seconds...")
time.sleep(10) # Wait for 10 seconds before retrying
else:
print(f"Failed to retrieve snapshot. Status code: {snapshot_response.status_code}")
print(snapshot_response.text)
break
else:
print("Snapshot ID not found in the response")
else:
print(f"Error: {response.status_code}")
print(response.text)
def create_summary_prompt(content, words=100):
return f"""Summarize the following content in less than {words} words.
CONTENT:
'{content}'
"""
# Retrieve the content from the given web page
article_url = "https://www.cnn.com/2024/12/16/weather/white-christmas-forecast-climate/"
scraped_data = get_scraped_data(article_url)
# Ask ChatGPT to perform the task specified in the prompt
prompt = create_summary_prompt(scraped_data)
model = ChatOpenAI(model="gpt-4o-mini")
response = model.invoke(prompt)
# Get the AI result
summary = response.content
# Export the produced data to JSON
export_data = {
"url": article_url,
"summary": summary
}
# Write dictionary to JSON file
with open("summary.json", "w") as file:
json.dump(export_data, file, indent=4)
Can you believe it? In less than 100 lines of code, you just build an AI-based LangChain web scraping script.
Verify that it works with this command:
python3 script.py
Or, on Windows:
python script.py
The output in the terminal should be close to this one:
Scraping data from 'https://www.cnn.com/2024/12/16/weather/white-christmas-forecast-climate/'...
Snapshot not ready yet. Retrying in 10 seconds...
Data successfully scraped, creating summary prompt
Sending prompt to ChatGPT for summarization
Received summary from ChatGPT
Exporting data to JSON
Data exported to 'summary.json'
Open the open.json
file that appeared in the project’s directory and you should see something like this:
{
"url": "https://www.cnn.com/2024/12/16/weather/white-christmas-forecast-climate/",
"summary": "As Christmas approaches, forecasts indicate temperatures in the US may be 10 to 15 degrees above normal, continuing a trend from last year\u2019s warm winter. The western US will likely remain warm, while the East experiences colder conditions leading up to Christmas. Some areas may see a mix of rain and snow, but a true \"white Christmas\" requires at least an inch of snow on the ground. Historically, cities like Minneapolis and Burlington have the best chances for snow, while places like New York City and Atlanta have significantly lower probabilities."
}
Et voilà! Mission complete.
Conclusion
In this tutorial, you discovered why web scraping is an excellent method for gathering data for your AI workflows and how to analyze it using LangChain. Specifically, you learned how to create a Python-based LangChain web scraping script to extract data from a CNN news article and process it with OpenAI APIs.
The main challenges with this approach include:
- Online sites frequently change their page structures.
- Many sites implement advanced anti-bot measures.
- Retrieving large volumes of data simultaneously can be complex and expensive.
Bright Data’s Web Scraper API offers a seamless solution for extracting data from major websites, overcoming these challenges effortlessly. This makes it an invaluable tool for supporting RAG applications and other LangChain-powered solutions.
Be also sure to explore our additional offerings for AI and LLM.
Sign up now to discover which of Bright Data’s proxy services or scraping products best suit your needs. Start with a free trial!
No credit card required