In this guided tutorial, you will discover:
- An overview of RAG and its mechanisms
- The advantages of integrating SERP data into GPT-4o through RAG
- How to implement a Python RAG chatbot using OpenAI GPT models and SERP data
Let’s dive in!
What Is RAG?
RAG, short for Retrieval-Augmented Generation, is an AI approach that combines information retrieval with text generation. In a RAG workflow, the application first retrieves relevant data from external sources—such as documents, web pages, or databases. Then, it passes data to the AI models so that it can generate more contextually relevant responses.
RAG enhances large language models (LLMs) like GPT by enabling them to access and reference up-to-date information beyond their original training data. This approach is key in scenarios where precise and context-specific information is needed, as it improves both the quality and accuracy of AI-generated responses.
Why Feed AI Models With SERP Data
The knowledge cutoff date for GPT-4o is October 2023, meaning it lacks access to events or information that came out after that time. However, GPT-4o models can pull in data from the Internet in real-time using Bing search integration. That helps them offer more up-to-date information.
But what if you want the AI model to employ specific data sources or prefer more reliable search engines? This is where RAG comes into play!
In particular, feeding SERP (Search Engine Results Page) data to AI models via RAG is a great way to get better responses. This approach is especially beneficial for tasks that require current information or specialized insights.
In short, passing data from high-ranking search results to GPT-4o or GPT-4o mini result in replies that are detailed, precise, and contextually rich.
RAG With SERP Data With GPT Models Using Python: Step-By-Step Tutorial
In this tutorial, you will learn how to build a RAG chatbot using OpenAI’s GPT models. The idea is to gather text from the top-performing pages on Google for a specific search query and use it as the context for a GPT request.
Now, the biggest challenge is scraping SERP data. The reason is that most search engines come with advanced anti-bot solutions to prevent automated access to their pages. For detailed guidance, refer to our guide on how to scrape Google in Python.
To simplify the scraping process, we will use Bright Data’s SERP API:
This premium SERP scraper allows you to easily retrieve SERPs from Google, DuckDuckGo, Bing, Yandex, Baidu, and other search engines using simple HTTP requests.
We will then extract text data from the returned URLs using a headless browser. Then, we will use that information as the context for the GPT model in a RAG workflow. If you instead want to retrieve online data directly using AI, read our article on web scraping with ChatGPT.
If you are eager to explore the code or want to keep it on hand as you follow the steps below, clone the GitHub repository that supports this article:
git clone https://github.com/Tonel/rag_gpt_serp_scraping
Follow the instructions in the README.md file to install the project’s dependencies and launch the project.
Keep in mind that the approach presented in this blog post can easily be adapted to any other search engine or LLM.
Note: This guide refers to Unix and macOS. If you are a Windows user, you can still follow the tutorial by using the Windows Subsystem for Linux(WSL).
Step #1: Initialize a Python Project
Make sure you have Python 3 installed on your machine. Otherwise, download and install it.
Create a folder for your project and enter it in the terminal:
mkdir rag_gpt_serp_scraping
cd rag_gpt_serp_scraping
The rag_gpt_serp_scraping folder will contain your Python RAG project.
Then, load the project directory in your favorite Python IDE. PyCharm Community Edition or Visual Studio Code with the Python extension will do.
Inside rag_gpt_serp_scraping, add an empty app.py file. This will contain your scraping and RAG logic.
Next, initialize a Python virtual environment in the project directory:
python3 -m venv env
Activate the virtual environment with the command below:
source ./env/bin/activate
Awesome! You are now fully set up.
Step #2: Install the Required Libraries
The dependencies used by this Python RAG project based on GPT models are:
python-dotenv
: To load environment variables from a .env file. It will be used to securely manage sensitive credentials, such as Bright Data credentials and OpenAI API keys.requests
: To perform HTTP requests to Bright Data’s SERP API. For more information, refer to our guide on how to use a proxy with Requests.langchain-community
: This is part of the LangChain framework, a set of tools to build with LLMs by chaining interoperable components. It will be used for retrieving text from the Google SERP pages and cleaning it to generate relevant content for RAG.openai
: The official Python client library for the OpenAI API. It will be employed to interface with GPT models to generate natural language responses based on the given inputs and RAG context.streamlit
: A framework for building interactive web applications in Python. It will come in handy for creating a UI where users can input their Google search queries and AI prompt, and view the results dynamically.
In an activated virtual environment, launch the command below to install all the dependencies:
pip install python-dotenv requests langchain-community openai streamlit
In detail, we will use AsyncChromiumLoader from langchain-community, which requires the following dependencies:
pip install --upgrade --quiet playwright beautifulsoup4 html2text
To function properly, Playwright also requires you to install the browsers with:
playwright install
Installing all these libraries will take a while, so be patient.
Fantastic! You are ready to write your Python logic.
Step #3: Prepare Your Project
In app.py
, add the following imports:
from dotenv import load_dotenv
import os
import requests
from langchain_community.document_loaders import AsyncChromiumLoader
from langchain_community.document_transformers import BeautifulSoupTransformer
from openai import OpenAI
import streamlit as st
Then, create a .env
file in your project folder to store all your credentials. Your project structure will now look like as below:
Use the function below in app.py
to instruct python-dotenv
to load the environment variables from .env
:
load_dotenv()
You can now import environment variables from .env
or the system with:
os.environ.get("<ENV_NAME>")
Here is also why we imported the os Python standard library.
Step #4: Configure SERP API
As mentioned in the introduction, we will rely on Bright Data’s SERP API to retrieve content from search engine results pages and use that in our Python RAG workflow. Specifically, we will extract text from the URLs of the web pages returned by the SERP API.
To set up SERP API, refer to the official documentation. Alternatively, follow the instructions below.
If you have not already created an account, sign up for Bright Data. Once logged in, navigate to your account dashboard:
There, click the “Get proxy products” button.
That will bring you to the page below, where you have to click on the “SERP API” row:
On the SERP API product page, toggle “Activate zone” to enable the product:
Now, copy the SERP API host, port, username, and password in the “Access parameters” section and add them to your .env
file:
BRIGHT_DATA_SERP_API_HOST="<YOUR_HOST>"
BRIGHT_DATA_SERP_API_PORT=<YOUR_PORT>
BRIGHT_DATA_SERP_API_USERNAME="<YOUR_USERNAME>"
BRIGHT_DATA_SERP_API_PASSWORD="<YOUR_PASSWORD>"
Replace the <YOUR_XXXX> placeholders with the values provided by Bright Data on the SERP API page.
Note that the host in “Access parameters” has a format like this:
brd.superproxy.io:33335
You must split it as below:
BRIGHT_DATA_SERP_API_HOST="brd.superproxy.io"
BRIGHT_DATA_SERP_API_PORT=33335
Terrific! You can now use SERP API in Python.
Step #5: Implement the SERP Scraping Logic
In app.py
, add the following function to retrieve the first number_of_urls
URLs from a Google SERP page:
def get_google_serp_urls(query, number_of_urls=5):
# perform a Bright Data's SERP API request
# with JSON autoparsing
host = os.environ.get("BRIGHT_DATA_SERP_API_HOST")
port = os.environ.get("BRIGHT_DATA_SERP_API_PORT")
username = os.environ.get("BRIGHT_DATA_SERP_API_USERNAME")
password = os.environ.get("BRIGHT_DATA_SERP_API_PASSWORD")
proxy_url = f"http://{username}:{password}@{host}:{port}"
proxies = {"http": proxy_url, "https": proxy_url}
url = f"https://www.google.com/search?q={query}&brd_json=1"
response = requests.get(url, proxies=proxies, verify=False)
# retrieve the parsed JSON response
response_data = response.json()
# extract a "number_of_urls" number of
# Google SERP URLs from the response
google_serp_urls = []
if "organic" in response_data:
for item in response_data["organic"]:
if "link" in item:
google_serp_urls.append(item["link"])
return google_serp_urls[:number_of_urls]
This makes an HTTP GET request to SERP API with the search query specified in the query argument. The brd_json=1
query parameter ensures that SERP API parses the results into JSON for you, in the format below:
{
"general": {
"search_engine": "google",
"results_cnt": 1980000000,
"search_time": 0.57,
"language": "en",
"mobile": false,
"basic_view": false,
"search_type": "text",
"page_title": "pizza - Google Search",
"code_version": "1.90",
"timestamp": "2023-06-30T08:58:41.786Z"
},
"input": {
"original_url": "https://www.google.com/search?q=pizza&brd_json=1",
"user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12) AppleWebKit/608.2.11 (KHTML, like Gecko) Version/13.0.3 Safari/608.2.11",
"request_id": "hl_1a1be908_i00lwqqxt1"
},
"organic": [
{
"link": "https://www.pizzahut.com/",
"display_link": "https://www.pizzahut.com",
"title": "Pizza Hut | Delivery & Carryout - No One OutPizzas The Hut!",
"image": "omitted for brevity...",
"image_alt": "pizza from www.pizzahut.com",
"image_base64": "omitted for brevity...",
"rank": 1,
"global_rank": 1
},
{
"link": "https://www.dominos.com/en/",
"display_link": "https://www.dominos.com › ...",
"title": "Domino's: Pizza Delivery & Carryout, Pasta, Chicken & More",
"description": "Order pizza, pasta, sandwiches & more online for carryout or delivery from Domino's. View menu, find locations, track orders. Sign up for Domino's email ...",
"image": "omitted for brevity...",
"image_alt": "pizza from www.dominos.com",
"image_base64": "omitted for brevity...",
"rank": 2,
"global_rank": 3
},
// omitted for brevity...
],
// omitted for brevity...
}
The last few lines of the function retrieve each SERP URL from the resulting JSON data, select only the first number_of_urls
URLs, and return them in a list.
Time to extract text from these URLs!
Step #6: Extract Text from the SERP URLs
Define a function that extracts text from each of the SERP URLs:
def extract_text_from_urls(urls, number_of_words=600):
# instruct a headless Chrome instance to visit the provided URLs
# with the specified user-agent
loader = AsyncChromiumLoader(
urls,
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36",
)
html_documents = loader.load()
# process the extracted HTML documents to extract text from them
bs_transformer = BeautifulSoupTransformer()
docs_transformed = bs_transformer.transform_documents(
html_documents,
tags_to_extract=["p", "em", "li", "strong", "h1", "h2"],
unwanted_tags=["a"],
remove_comments=True,
)
# make sure each HTML text document contains only a number
# number_of_words words
extracted_text_list = []
for doc_transformed in docs_transformed:
# split the text into words and join the first number_of_words
words = doc_transformed.page_content.split()[:number_of_words]
extracted_text = " ".join(words)
# ignore empty text documents
if len(extracted_text) != 0:
extracted_text_list.append(extracted_text)
return extracted_text_list
This function:
- Loads web pages from the URLs passed as an argument using a headless Chrome browser instance.
- Utilizes BeautifulSoupTransformer to process the HTML of each page and extract text from specific tags (like <p>, <h1>, <strong>, etc.), omitting unwanted tags (like <a>) and comments.
- Limits the extracted text for each webpage to a number of words specified by the
number_of_words
argument. - Returns a list of the extracted text from each URL.
Keep in mind that the [“p”, “em”, “li”, “strong”, “h1”, “h2”] tags are enough to extract text from most web pages. However, in some specific scenarios, you may need to customize this list of HTML tags. Also, you might have to increase or decrease the target number of words for each text item.
For example, consider the web page below:
Applying that function to that page will result in this text array:
["Lisa Johnson Mandell’s Transformers One review reveals the heretofore inconceivable: It’s one of the best animated films of the year! I never thought I’d see myself write this about a Transformers movie, but Transformers One is actually an exceptional film! ..."]
Incredible! Even though it is not perfect, it is still of high quality for AI models’ standards.
The list of text items returned by extract_text_from_urls()
represents the RAG context to feed to the OpenAI model.
Step #7: Generate the RAG Prompt
Define a function that transforms the AI prompt request and text context into the final RAG prompt:
def get_openai_prompt(request, text_context=[]):
# default prompt
prompt = request
# add the context to the prompt, if present
if len(text_context) != 0:
context_string = "\n\n--------\n\n".join(text_context)
prompt = f"Answer the request using only the context below.\n\nContext:\n{context_string}\n\nRequest: {request}"
return prompt
Prompts returned by the previous function when a RAG context is specified have this format:
Answer the request using only the context below.
Context:
Bla bla bla...
--------
Bla bla bla...
--------
Bla bla bla...
Request: <YOUR_REQUEST>
Step #8: Perform the GPT Request
First, initialize the OpenAI client at the top of the app.py
file:
openai_client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
This relies on the OPENAI_API_KEY
environment variable, which you can define directly in your system’s environments or in the .env
file:
OPENAI_API_KEY="<YOUR_API_KEY>"
Replace <YOUR_API_KEY>
with the value of your OpenAI API key. If you do not know how to get one, follow the official guide.
Next, write a function that uses the OpenAI official client to perform a request to the GPT-4o mini AI model:
def interrogate_openai(prompt, max_tokens=800):
# interrogate the OpenAI model with the given prompt
response = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
max_tokens=max_tokens,
)
return response.choices[0].message.content
Note that you can configure any other of the GPT models supported by the OpenAI API.
If called with a prompt returned by get_openai_prompt()
that includes a specified text context, interrogate_openai()
will successfully perform retrieval-augmented generation as intended.
Step #9: Create the Application UI
Use Streamlit to define a simple form UI where users can specify:
- The Google search query to pass to the SERP API
- The AI prompt to send to GPT-4o mini
Achieve that with these lines of code:
with st.form("prompt_form"):
# initialize the output results
result = ""
final_prompt = ""
# textarea for user to input their Google search query
google_search_query = st.text_area("Google Search:", None)
# textarea for user to input their AI prompt
request = st.text_area("AI Prompt:", None)
# button to submit the form
submitted = st.form_submit_button("Send")
# if the form is submitted
if submitted:
# retrieve the Google SERP URLs from the given search query
google_serp_urls = get_google_serp_urls(google_search_query)
# extract the text from the respective HTML pages
extracted_text_list = extract_text_from_urls(google_serp_urls)
# generate the AI prompt using the extracted text as context
final_prompt = get_openai_prompt(request, extracted_text_list)
# interrogate an OpenAI model with the generated prompt
result = interrogate_openai(final_prompt)
# dropdown containing the generated prompt
final_prompt_expander = st.expander("AI Final Prompt:")
final_prompt_expander.write(final_prompt)
# write the result from the OpenAI model
st.write(result)
Here we go! The Python RAG script is ready.
Step #10: Put It All Together
Your app.py
file should contain the following code:
from dotenv import load_dotenv
import os
import requests
from langchain_community.document_loaders import AsyncChromiumLoader
from langchain_community.document_transformers import BeautifulSoupTransformer
from openai import OpenAI
import streamlit as st
# load the environment variables from the .env file
load_dotenv()
# initialize the OpenAI API client with your API key
openai_client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
def get_google_serp_urls(query, number_of_urls=5):
# perform a Bright Data's SERP API request
# with JSON autoparsing
host = os.environ.get("BRIGHT_DATA_SERP_API_HOST")
port = os.environ.get("BRIGHT_DATA_SERP_API_PORT")
username = os.environ.get("BRIGHT_DATA_SERP_API_USERNAME")
password = os.environ.get("BRIGHT_DATA_SERP_API_PASSWORD")
proxy_url = f"http://{username}:{password}@{host}:{port}"
proxies = {"http": proxy_url, "https": proxy_url}
url = f"https://www.google.com/search?q={query}&brd_json=1"
response = requests.get(url, proxies=proxies, verify=False)
# retrieve the parsed JSON response
response_data = response.json()
# extract a "number_of_urls" number of
# Google SERP URLs from the response
google_serp_urls = []
if "organic" in response_data:
for item in response_data["organic"]:
if "link" in item:
google_serp_urls.append(item["link"])
return google_serp_urls[:number_of_urls]
def extract_text_from_urls(urls, number_of_words=600):
# instruct a headless Chrome instance to visit the provided URLs
# with the specified user-agent
loader = AsyncChromiumLoader(
urls,
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36",
)
html_documents = loader.load()
# process the extracted HTML documents to extract text from them
bs_transformer = BeautifulSoupTransformer()
docs_transformed = bs_transformer.transform_documents(
html_documents,
tags_to_extract=["p", "em", "li", "strong", "h1", "h2"],
unwanted_tags=["a"],
remove_comments=True,
)
# make sure each HTML text document contains only a number
# number_of_words words
extracted_text_list = []
for doc_transformed in docs_transformed:
# split the text into words and join the first number_of_words
words = doc_transformed.page_content.split()[:number_of_words]
extracted_text = " ".join(words)
# ignore empty text documents
if len(extracted_text) != 0:
extracted_text_list.append(extracted_text)
return extracted_text_list
def get_openai_prompt(request, text_context=[]):
# default prompt
prompt = request
# add the context to the prompt, if present
if len(text_context) != 0:
context_string = "\n\n--------\n\n".join(text_context)
prompt = f"Answer the request using only the context below.\n\nContext:\n{context_string}\n\nRequest: {request}"
return prompt
def interrogate_openai(prompt, max_tokens=800):
# interrogate the OpenAI model with the given prompt
response = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
max_tokens=max_tokens,
)
return response.choices[0].message.content
# create a form in the Streamlit app for user input
with st.form("prompt_form"):
# initialize the output results
result = ""
final_prompt = ""
# textarea for user to input their Google search query
google_search_query = st.text_area("Google Search:", None)
# textarea for user to input their AI prompt
request = st.text_area("AI Prompt:", None)
# button to submit the form
submitted = st.form_submit_button("Send")
# if the form is submitted
if submitted:
# retrieve the Google SERP URLs from the given search query
google_serp_urls = get_google_serp_urls(google_search_query)
# extract the text from the respective HTML pages
extracted_text_list = extract_text_from_urls(google_serp_urls)
# generate the AI prompt using the extracted text as context
final_prompt = get_openai_prompt(request, extracted_text_list)
# interrogate an OpenAI model with the generated prompt
result = interrogate_openai(final_prompt)
# dropdown containing the generated prompt
final_prompt_expander = st.expander("AI Final Prompt")
final_prompt_expander.write(final_prompt)
# write the result from the OpenAI model
st.write(result)
Can you believe it? In less than 150 lines of code, you can achieve RAG using Python!
Step #11: Test the Application
Launch your Python RAG application with:
streamlit run app.py
In the terminal, you should see the following output:
You can now view your Streamlit app in your browser.
Local URL: http://localhost:8501
Network URL: http://172.27.134.248:8501
Follow the instructions, and visit http://localhost:8501
in the browser. Below is what you should be seeing:
As you can notice, the form contains the “Google Search:” and “AI Prompt:” text area inputs defined in the code, as well as the “Send” button and the “AI Final Prompt” dropdown.
Test the application by using a Google search query as below:
Transformers One review
And an AI prompt as follows:
Write a review for the movie Transformers One
Click “Send,” and wait while your application processes the request. After a few seconds, you should get a result like this:
Wow! Not a bad review…
If you expand the “AI Final Prompt” dropdown, you will see the complete prompt used by the application for RAG.
Et voilà! You just implemented a Python RAG chatbot with GPT-4o mini using SERP data.
Conclusion
In this tutorial, you explored what RAG is and how it can be achieved by feeding AI models with SERP data. Specifically, you learned to build a Python RAG chatbot that scrapes SERP data and uses it in GPT models for improved accuracy in results.
The major challenge with this approach is scraping search engines like Google, as:
- They frequently alter the structure of their SERP pages.
- They are protected by some of the most sophisticated anti-bot measures available.
- Retrieving large volumes of SERP data concurrently is complex and can cost a lot of money.
As shown here, Bright Data’s SERP API helps you retrieve real-time SERP data from all major search engines with no effort. That supports RAG and many other applications. Get your free trial now!
Sign up now to discover which of Bright Data’s proxy services or scraping products best suit your needs. Start with a free trial!
No credit card required