In this guided tutorial, you will discover:
- An overview of RAG and its mechanisms
- The advantages of integrating SERP data into GPT-4o through RAG
- How to implement a Python RAG chatbot using OpenAI GPT models and SERP data
Let’s dive in!
What Is RAG?
RAG, short for Retrieval-Augmented Generation, is an AI approach that combines information retrieval with text generation. In a RAG workflow, the application first retrieves relevant data from external sources—such as documents, web pages, or databases. Then, it passes data to the AI models so that it can generate more contextually relevant responses.
RAG enhances large language models (LLMs) like GPT by enabling them to access and reference up-to-date information beyond their original training data. This approach is key in scenarios where precise and context-specific information is needed, as it improves both the quality and accuracy of AI-generated responses.
Why Feed AI Models With SERP Data
The knowledge cutoff date for GPT-4o is October 2023, meaning it lacks access to events or information that came out after that time. However, GPT-4o models can pull in data from the Internet in real-time using Bing search integration. That helps them offer more up-to-date information.
But what if you want the AI model to employ specific data sources or prefer more reliable search engines? This is where RAG comes into play!
In particular, feeding SERP (Search Engine Results Page) data to AI models via RAG is a great way to get better responses. This approach is especially beneficial for tasks that require current information or specialized insights.
In short, passing data from high-ranking search results to GPT-4o or GPT-4o mini result in replies that are detailed, precise, and contextually rich.
RAG With SERP Data With GPT Models Using Python: Step-By-Step Tutorial
In this tutorial, you will learn how to build a RAG chatbot using OpenAI’s GPT models. The idea is to gather text from the top-performing pages on Google for a specific search query and use it as the context for a GPT request.
Now, the biggest challenge is scraping SERP data. The reason is that most search engines come with advanced anti-bot solutions to prevent automated access to their pages. For detailed guidance, refer to our guide on how to scrape Google in Python.
To simplify the scraping process, we will use Bright Data’s SERP API:
This premium SERP scraper allows you to easily retrieve SERPs from Google, DuckDuckGo, Bing, Yandex, Baidu, and other search engines using simple HTTP requests.
We will then extract text data from the returned URLs using a headless browser. Then, we will use that information as the context for the GPT model in a RAG workflow. If you instead want to retrieve online data directly using AI, read our article on web scraping with ChatGPT.
If you are eager to explore the code or want to keep it on hand as you follow the steps below, clone the GitHub repository that supports this article:
Follow the instructions in the README.md file to install the project’s dependencies and launch the project.
Keep in mind that the approach presented in this blog post can easily be adapted to any other search engine or LLM.
Note: This guide refers to Unix and macOS. If you are a Windows user, you can still follow the tutorial by using the Windows Subsystem for Linux(WSL).
Step #1: Initialize a Python Project
Make sure you have Python 3 installed on your machine. Otherwise, download and install it.
Create a folder for your project and enter it in the terminal:
The rag_gpt_serp_scraping folder will contain your Python RAG project.
Then, load the project directory in your favorite Python IDE. PyCharm Community Edition or Visual Studio Code with the Python extension will do.
Inside rag_gpt_serp_scraping, add an empty app.py file. This will contain your scraping and RAG logic.
Next, initialize a Python virtual environment in the project directory:
Activate the virtual environment with the command below:
Awesome! You are now fully set up.
Step #2: Install the Required Libraries
The dependencies used by this Python RAG project based on GPT models are:
python-dotenv
: To load environment variables from a .env file. It will be used to securely manage sensitive credentials, such as Bright Data credentials and OpenAI API keys.requests
: To perform HTTP requests to Bright Data’s SERP API. For more information, refer to our guide on how to use a proxy with Requests.langchain-community
: This is part of the LangChain framework, a set of tools to build with LLMs by chaining interoperable components. It will be used for retrieving text from the Google SERP pages and cleaning it to generate relevant content for RAG.openai
: The official Python client library for the OpenAI API. It will be employed to interface with GPT models to generate natural language responses based on the given inputs and RAG context.streamlit
: A framework for building interactive web applications in Python. It will come in handy for creating a UI where users can input their Google search queries and AI prompt, and view the results dynamically.
In an activated virtual environment, launch the command below to install all the dependencies:
In detail, we will use AsyncChromiumLoader from langchain-community, which requires the following dependencies:
To function properly, Playwright also requires you to install the browsers with:
Installing all these libraries will take a while, so be patient.
Fantastic! You are ready to write your Python logic.
Step #3: Prepare Your Project
In app.py
, add the following imports:
Then, create a .env
file in your project folder to store all your credentials. Your project structure will now look like as below:
Use the function below in app.py
to instruct python-dotenv
to load the environment variables from .env
:
You can now import environment variables from .env
or the system with:
Here is also why we imported the os Python standard library.
Step #4: Configure SERP API
As mentioned in the introduction, we will rely on Bright Data’s SERP API to retrieve content from search engine results pages and use that in our Python RAG workflow. Specifically, we will extract text from the URLs of the web pages returned by the SERP API.
To set up SERP API, refer to the official documentation. Alternatively, follow the instructions below.
If you have not already created an account, sign up for Bright Data. Once logged in, navigate to your account dashboard:
There, click the “Get proxy products” button.
That will bring you to the page below, where you have to click on the “SERP API” row:
On the SERP API product page, toggle “Activate zone” to enable the product:
Now, copy the SERP API host, port, username, and password in the “Access parameters” section and add them to your .env
file:
Replace the <YOUR_XXXX> placeholders with the values provided by Bright Data on the SERP API page.
Note that the host in “Access parameters” has a format like this:
You must split it as below:
Terrific! You can now use SERP API in Python.
Step #5: Implement the SERP Scraping Logic
In app.py
, add the following function to retrieve the first number_of_urls
URLs from a Google SERP page:
This makes an HTTP GET request to SERP API with the search query specified in the query argument. The brd_json=1
query parameter ensures that SERP API parses the results into JSON for you, in the format below:
The last few lines of the function retrieve each SERP URL from the resulting JSON data, select only the first number_of_urls
URLs, and return them in a list.
Time to extract text from these URLs!
Step #6: Extract Text from the SERP URLs
Define a function that extracts text from each of the SERP URLs:
This function:
- Loads web pages from the URLs passed as an argument using a headless Chrome browser instance.
- Utilizes BeautifulSoupTransformer to process the HTML of each page and extract text from specific tags (like <p>, <h1>, <strong>, etc.), omitting unwanted tags (like <a>) and comments.
- Limits the extracted text for each webpage to a number of words specified by the
number_of_words
argument. - Returns a list of the extracted text from each URL.
Keep in mind that the [“p”, “em”, “li”, “strong”, “h1”, “h2”] tags are enough to extract text from most web pages. However, in some specific scenarios, you may need to customize this list of HTML tags. Also, you might have to increase or decrease the target number of words for each text item.
For example, consider the web page below:
Applying that function to that page will result in this text array:
Incredible! Even though it is not perfect, it is still of high quality for AI models’ standards.
The list of text items returned by extract_text_from_urls()
represents the RAG context to feed to the OpenAI model.
Step #7: Generate the RAG Prompt
Define a function that transforms the AI prompt request and text context into the final RAG prompt:
Prompts returned by the previous function when a RAG context is specified have this format:
Step #8: Perform the GPT Request
First, initialize the OpenAI client at the top of the app.py
file:
This relies on the OPENAI_API_KEY
environment variable, which you can define directly in your system’s environments or in the .env
file:
OPENAI_API_KEY="<YOUR_API_KEY>"
Replace <YOUR_API_KEY>
with the value of your OpenAI API key. If you do not know how to get one, follow the official guide.
Next, write a function that uses the OpenAI official client to perform a request to the GPT-4o mini AI model:
Note that you can configure any other of the GPT models supported by the OpenAI API.
If called with a prompt returned by get_openai_prompt()
that includes a specified text context, interrogate_openai()
will successfully perform retrieval-augmented generation as intended.
Step #9: Create the Application UI
Use Streamlit to define a simple form UI where users can specify:
- The Google search query to pass to the SERP API
- The AI prompt to send to GPT-4o mini
Achieve that with these lines of code:
Here we go! The Python RAG script is ready.
Step #10: Put It All Together
Your app.py
file should contain the following code:
Can you believe it? In less than 150 lines of code, you can achieve RAG using Python!
Step #11: Test the Application
Launch your Python RAG application with:
In the terminal, you should see the following output:
Follow the instructions, and visit http://localhost:8501
in the browser. Below is what you should be seeing:
As you can notice, the form contains the “Google Search:” and “AI Prompt:” text area inputs defined in the code, as well as the “Send” button and the “AI Final Prompt” dropdown.
Test the application by using a Google search query as below:
And an AI prompt as follows:
Click “Send,” and wait while your application processes the request. After a few seconds, you should get a result like this:
Wow! Not a bad review…
If you expand the “AI Final Prompt” dropdown, you will see the complete prompt used by the application for RAG.
Et voilà! You just implemented a Python RAG chatbot with GPT-4o mini using SERP data.
Conclusion
In this tutorial, you explored what RAG is and how it can be achieved by feeding AI models with SERP data. Specifically, you learned to build a Python RAG chatbot that scrapes SERP data and uses it in GPT models for improved accuracy in results.
The major challenge with this approach is scraping search engines like Google, as:
- They frequently alter the structure of their SERP pages.
- They are protected by some of the most sophisticated anti-bot measures available.
- Retrieving large volumes of SERP data concurrently is complex and can cost a lot of money.
As shown here, Bright Data’s SERP API helps you retrieve real-time SERP data from all major search engines with no effort. That supports RAG and many other applications. Get your free trial now!
Sign up now to discover which of Bright Data’s proxy services or scraping products best suit your needs. Start with a free trial!
No credit card required