In this guide, you will learn:
- Why web scraping is an excellent method for enriching LLMs with real-world data
- The benefits and challenges of using scraped data in LangChain workflows
- How to create a complete LangChain web scraping integration in a step-by-step tutorial
Let’s dive in!
Using Web Scraping to Power Your LLM Applications
Web scraping involves retrieving data from the web pages. That data can then be used for fueling RAG (Retrieval-Augmented Generation) applications and leveraging LLMs (Large Language Models).
RAG applications require access to real-time, domain-specific, or expansive datasets that may not be available in static databases. Web scraping bridges this gap by extracting structured and unstructured data from diverse web sources like articles, product listings, or social media.
Learn more in our article on collecting LLM training data.
Benefits and Challenges of Using Scraped Data in LangChain
LangChain is a powerful framework for building AI-driven workflows, enabling seamless integration of LLMs with diverse data sources. It excels at data analysis, summarization, and question-answering by combining LLMs with real-time, domain-specific knowledge. However, acquiring high-quality data is always a problem.
Web scraping can tackle that problem, but it comes with several challenges, including anti-bot measures, CAPTCHAs, and dynamic websites. Maintaining compliant and efficient scrapers can also be time-consuming and technically complex. For more details, check out our guide on anti-scraping measures.
These hurdles can slow the development of AI-powered applications that depend on real-time data. The solution? Bright Data’s Web Scraper API—a ready-to-use tool offering scraping endpoints for hundreds of websites.
With advanced features like IP rotation, CAPTCHA solving, and JavaScript rendering, Bright Data automates data extraction seamlessly. That ensures reliable, efficient, and hassle-free data collection, all accessible through simple API calls.
LangChain Web Scraping Powered By Bright Data: Step-by-Step Guide
In this section, you will learn how to build a LangChain web scraping script. The goal will be to retrieve content from a CNN article using the Bright Data Web Scraper API and send it to OpenAI for summarization via LangChain.
We will use the following CNN article as the target:
The example we are going to build here is a simple starting point, but it is easy to extend with additional features and analyses using LangChain. For instance, you could even create a RAG chatbot based on SERP data.
Follow the steps below to get started!
Prerequisites
To get through this tutorial, you will need the following:
- Python 3+ installed on your machine
- An OpenAI API key
- A Bright Data account
Do not worry if you are missing any of these. We will guide you through the entire process, from installing Python to obtaining your OpenAI and Bright Data credentials.
Step #1: Project Setup
First of all, check if Python 3 is installed on your machine. If not, download and install it.
Run this command in the terminal to create a folder for your project:
langchain_scrping
will contain your Python LangChain scraping project.
Then, navigate to the project folder and initialize a Python virtual environment inside it:
Note: On Windows, use python
instead of python3
.
Now, open the project directory in your favorite Python IDE. PyCharm Community Edition or Visual Studio Code with the Python extension will do.
Inside langchain_scraping
, add a script.py
file. This is an empty Python script, but it will soon contain your LangChain web scraping logic.
In the IDE’s terminal, activate the virtual environment with the command below:
Or, on Windows, run:
Awesome! You are now fully set up.
Step #2: Install the Required Libraries
The Python LangChain scraping project relies on the following libraries:
python-dotenv
: To load environment environment variables from a.env
file. It will be used to manage sensitive information like Bright Data and OpenAI credentials.requests
: To perform HTTP requests to interact with Bright Data’s Web Scraper API.langchain_openai
: LangChain integrations for OpenAI through itsopenai
SDK.
In an activated virtual environment, install all the dependencies with this command:
Amazing! You are ready to write some scraping logic.
Step #3: Prepare Your Project
In scripts.py
, add the following imports:
These two lines allow you to read environment variable files.
Note: os
comes from the Python Standard Library, so you do not have to install it.
Then, create a .env
file in your project folder to store all your credentials. Here is what your current project file structure should look like:
Instruct python-dotenv
to load the environment variables from .env
with this line in script.py
:
You can now read environment variables from .env
files or the system with:
Cool! Time to configure Bright Data’s Web Scraper API solution.
Step #4: Configure Web Scraper API
As mentioned at the beginning of this article, web scraping comes with several challenges. Fortnately, it becomes significantly easier with an all-in-one solution like Bright Data’s Web Scraper APIs. These APIs allow you to retrieve parsed content from over 100 websites effortlessly.
As an alternative approach, see our tutorial on how to scrape news articles.
To set up Web Scraper API, refer to the official documentation or follow the instructions below.
If you have not already, create a Bright Data account. After logging in, go to your account dashboard. Here, click on the “Web Scraper API” button on the left:
Since the target site is CNN.com, type “cnn” in the search input and select the “CNN news — Collecy by URL” scraper:
On the current page, click on the “Create token” button to generate a Bright Data API token:
This should open the following modal, where you can configure the details of your token:
Once done, click “Save” and copy the value of your Bright Data API token.
In your .env
file, store this information as below:
Replace <YOUR_BRIGHT_DATA_API_TOKEN>
with the value you copied from the modal.
Your CNN news Web Scraper API page should now look similar to the example below:
Here we go! Configure your Web Scraper API request and play with it.
Step #5: Use Bright Data for Web Scraping
The Web Scraper API launches a web scraping task configured according to your needs on the page seen earlier. Then, this process generates a snapshot containing the scraped data.
Below is an overview of how the Web Scraper API scraping process works:
- You make a request to the Web Scraper API, providing the pages to scrape via URLs.
- A web scraping task is launched to retrieve and parse data from those URLs.
- You repeatedly query a snapshot retrieval API to fetch the resulting data once the task is complete.
The POST endpoint for the CNN Web Scraper API is:
That endpoint accepts an array of objects containing url
fields and returns a response like this:
Using the snapshot_id
from this response, you then need to query the following endpoint to retrieve your data:
This endpoint returns HTTP status code 202
if the task is still in progress and 200
when the task is complete and the data is ready. The recommended approach is to poll this endpoint every 10 seconds until the task is finished.
Once the task is complete, the endpoint will return data in the following format:
The content
attribute contains the parsed article data, representing the information you want to access.
To implement this, first read the env from .env
and initialize the endpoint URL constants:
Next, you can turn the above process into a reusable function using the following code:
To make it work, add these two imports:
Incredible! You just learned how to use Bright Data Web’s Scraper API for web scraping.
Step #6: Get Ready to Use OpenAI Models
This example relies on OpenAI models for LLM integration within LangChain. To use those models, you must configure an OpenAI API key in your environment variables.
By default, langchain_openai
automatically reads the OpenAI API key from the OPENAI_API_KEY
environment variable. To set this up, add the following line to your .env
file:
Replace <YOUR_OPENAI_API_KEY>
with the value of your OpenAI API key. If you do not know how to get one, follow the official guide.
Great! Time to use OpenAI models in your LangChain scraping script.
Step #7: Generate the LLM Prompt
Define a function that takes the scraped data and produces a prompt to get a summary of the article:
In the current example, the complete prompt will be:
If you pass it to ChatGPT, you should get the desired result:
This is enough to tell that the prompt works like a charm!
Step #8: Integrate OpenAI
First, call the get_scraped_data()
function to retrieve the content from the article page:
If the scraped_data
is not None
, generate the prompt:
Finally, pass it to a ChatOpenAI
LangChain object configured on the GPT-4o mini AI model:
Do not forget to import ChatOpenAI
from langchain_openai
:
At the end of the process, summary
should contain something similar to the summary produced by ChatGPT in the previous step:
Wow! The LangChain web scraping logic is complete.
Step #9: Export the AI-Processed Data
Now, you just need to export the data generated by the selected AI model via LangChain to a human-readable format, such as a JSON file.
To do this, initialize a dictionary with the data you want. Then, export and then save it as a JSON file, as shown below:
Import json
from the Python Standard Library:
Congrats! Your script is ready.
Step #10: Add Some Logs
The scraping process using Web Scraping AI and ChatGPT analysis may take some time. So, it is a good practice to include logs to track the script’s progress.
You can achieve this by adding print()
statements at key steps in the script, as follows:
Step #11: Put It All Together
Your final script.py
file should contain:
Can you believe it? In less than 100 lines of code, you just build an AI-based LangChain web scraping script.
Verify that it works with this command:
Or, on Windows:
The output in the terminal should be close to this one:
Open the open.json
file that appeared in the project’s directory and you should see something like this:
Et voilà! Mission complete.
Conclusion
In this tutorial, you discovered why web scraping is an excellent method for gathering data for your AI workflows and how to analyze it using LangChain. Specifically, you learned how to create a Python-based LangChain web scraping script to extract data from a CNN news article and process it with OpenAI APIs.
The main challenges with this approach include:
- Online sites frequently change their page structures.
- Many sites implement advanced anti-bot measures.
- Retrieving large volumes of data simultaneously can be complex and expensive.
Bright Data’s Web Scraper API offers a seamless solution for extracting data from major websites, overcoming these challenges effortlessly. This makes it an invaluable tool for supporting RAG applications and other LangChain-powered solutions.
Be also sure to explore our additional offerings for AI and LLM.
Sign up now to discover which of Bright Data’s proxy services or scraping products best suit your needs. Start with a free trial!
No credit card required