Blog / AI
AI

Using LlamaIndex and Bright Data for Web Search

A step-by-step guide to building LlamaIndex AI agents that can search the web in real time using Bright Data’s powerful SERP integration.
5 min read
Search the web with LlamaIndex blog image

In this guide, you will learn:

  • What LlamaIndex is.
  • Why AI agents built with LlamaIndex should be able to perform web searches.
  • How to create a LlamaIndex AI agent with web search capabilities.

Let’s dive in!

What Is LlamaIndex?

LlamaIndex is an open-source Python framework for building applications fueled by LLMs. It serves as a bridge between unstructured data and LLMs. In particular, it makes it easy to orchestrate LLM workflows across a variety of data sources.

With LlamaIndex, you can craft production-ready AI workflows and agents. These can search for and retrieve relevant information, synthesize insights, generate detailed reports, take automated actions, and much more.

As of this writing, it is one of the fastest-growing libraries in the AI ecosystem, with over 42k stars on GitHub.

Why Integrate Web Searching Data into Your LlamaIndex AI Agent

Compared to other AI agent frameworks, LlamaIndex has been created to solve one of the biggest limitations of LLMs. That is their lack of up-to-date, real-world knowledge.

To address that issue, LlamaIndex provides integrations with several data connectors that let you ingest content from multiple sources. Now, you might wonder: which is the most valuable data source for an AI agent?

To answer that question, it helps to consider what data sources are used to train LLMs. Successful LLMs received most of their training data from the web, the largest and most diverse source of public data.

If you want your LlamaIndex AI agent to break past its static training data, the key capability it needs is the ability to search the web and learn from what it finds. Thus, your agent should be able to extract structured information from the resulting search pages (called “SERPs”). Then, meaningfully process and learn from them.

The challenge is that SERP scraping has become much more difficult due to Google’s recent crackdowns on simple scraping scripts. Here is why you need a tool that integrates with LlamaIndex and simplifies this process. That is where LlamaIndex’s Bright Data integration comes in!

Bright Data handles the complex work of SERP scraping. Through its search_engine tool, it lets your LlamaIndex agent perform search queries and receive structured results in Markdown or JSON format.

This is what your AI agent needs to stay prepared to answer questions, both now and in the future. See how this integration works in the next chapter!

Build a LlamaIndex Agent That Can Search the Web Using Bright Data Tools

In this step-by-step guide, you will see how to build a Python AI agent with LlamaIndex that can search the web.

By integrating with Bright Data, you will enable your agent to access fresh, contextual, rich web search data. For more details, refer to our official documentation.

Follow the steps below to create your Bright Data-powered AI SERP agent using LlamaIndex!

Prerequisites

To follow along with this tutorial, you need the following:

  • Python 3.9 or higher installed on your machine (we recommend using the latest version).
  • A Bright Data API key to integrate with Bright Data’s SERP APIs.
  • An API key from a supported LLM. (In this guide, we will use Gemini, which supports integration via API for free. At the same time, you can use any LLM provider supported by LlamaIndex.)

Do not worry if you do not have a Gemini or Bright Data API key yet. We will show you how to create both in the next steps.

Step #1: Initialize Your Python Project

Start by launching your terminal and creating a new folder for your LlamaIndex AI agent project:

mkdir llamaindex-bright-data-serp-agent

llamaindex-bright-data-serp-agent/ will hold all the code for your AI agent with web searching capabilities powered by Bright Data.

Next, navigate into the project directory and create a Python virtual environment inside it:

cd llamaindex-bright-data-serp-agent
python -m venv venv

Now, open the project folder in your favorite Python IDE. We recommend Visual Studio Code with the Python extension or PyCharm Community Edition.

Create a new file named agent.py in the root of your project directory. Your project structure should look like this:

llamaindex-bright-data-serp-agent/
├── venv/
└── agent.py

In the terminal, activate the virtual environment. On Linux or macOS, run:

source venv/bin/activate

Equivalently, on Windows, execute:

venv/Scripts/activate

In the next steps, you will be guided through installing the required packages. However, if you would like to install everything upfront, run:

pip install python-dotenv llama-index-tools-brightdata llama-index-llms-google-genai llama-index

Note: We are installing llama-index-llms-google-genai because this tutorial uses Gemini as the LlamaIndex LLM provider. If you plan to use a different provider, be sure to install the corresponding LLM integration instead.

Good job! Your Python development environment is ready to build an AI agent with Bright Data’s SERP integration using LlamaIndex.

Step #2: Integrate Environment Variables Reading

Your LlamaIndex agent will connect to external services like Gemini and Bright Data via API. For security, you should never hardcode API keys directly into your Python code. Instead, use environment variables to keep them private.

Install the python-dotenv library to make managing environment variables easier. In your activated virtual environment, launch:

pip install python-dotenv

Next, open your agent.py file and add the following lines at the top to load envs from a .env file:

from dotenv import load_dotenv

load_dotenv()

load_dotenv() looks for a .env file in your project’s root directory and loads its values into the environment.

Now, create a .env file alongside your agent.py file. Your new project file structure should look like this:

llamaindex-bright-data-serp-agent/
├── venv/
├── .env # <-------------
└── agent.py

Awesome! You just set up a secure way to manage sensitive API credentials for third-party services.

Continue the initial setup by populating your .env file with the required environment variables!

Step #3: Configure Bright Data

To connect to the Bright Data SERP APIs in LlamaIndex via the official integration package, you first have to:

  1. Enable the Web Unlocker solution in your Bright Data dashboard.
  2. Retrieve your Bright Data API token.

Follow the steps below to complete the setup!

If you do not already have a Bright Data account, [create one](). If you already have an account, log in. In the dashboard, click the “Get proxy products” button:

Clicking the “Get proxy products” button in your Bright Data dashboard

You will be taken to the “Proxies & Scraping Infrastructure” page:

Note the configured Web Unlocker API zone

If you already see an active Web Unlocker API zone (like in the image above), you are all set. Make note of the zone name (for example, unlocker), as you will use it in your code later.

If you do not have a Web Unlocker zone yet, scroll down to the “Web Unlocker API” section and press the “Create zone” button:

Clicking the “Create zone” on the “Web Unlocker API” card

Why use the Web Unlocker API instead of the dedicated SERP API?
Bright Data’s LlamaIndex SERP integration operates through the Web Unlocker API. Specifically, when configured properly, Web Unlocker functions the same way as the dedicated SERP APIs. In short, by setting up a Web Unlocker API zone with the LlamaIndex Bright Data integration, you automatically gain access to the SERP APIs as well.

Give your new zone a name, such as unlocker, enable any advanced features for better performance, and click “Add”:

Configuring your new Web Unlocker API zone

Once created, you will be redirected to the zone’s configuration page:

The “unlocker” Web Unlocker API zone page

Make sure the activation toggle is set to the “Active” status. This confirms that your zone is ready for use.

Next, follow the official Bright Data guide to generate your API key. Once you have your key, store it securely in your .env file like this:

BRIGHT_DATA_API_KEY="<YOUR_BRIGHT_DATA_API_KEY>"

Replace the <YOUR_BRIGHT_DATA_API_KEY> placeholder with your actual API key value.
Awesome! Configure the Bright Data SERP tool in your LlamaIndex agent script.

Step #4: Access the Bright Data LlamaIndex SERP Tool

In agent.py, start by loading your Bright Data API key from the environment:

BRIGHT_DATA_API_KEY = os.getenv("BRIGHT_DATA_API_KEY")

Make sure to import os from the Python standard library:

import os

In your activated virtual environment, install the LlamaIndex Bright Data tools package:

pip install llama-index-tools-brightdata

Next, import the BrightDataToolSpec class in your agent.py file:

from llama_index.tools.brightdata import BrightDataToolSpec

Create an instance of BrightDataToolSpec, providing your API key and the name of the Web Unlocker zone:

brightdata_tool_spec = BrightDataToolSpec(
    api_key=BRIGHT_DATA_API_KEY,
    zone="unlocker", # Replace with the name of your Web Unlocker zone
    verbose=True,
)

Replace the zone value with the name of the Web Unlocker API zone you set up earlier (in this case, it is unlocker).

Note that setting verbose=True is useful while developing. That way, the library will print helpful logs when your LlamaIndex agent makes requests through Bright Data.

Now, BrightDataToolSpec provides several tools, but here we are focusing on the search_engine tool. This can query Google, Bing, Yandex, and more, returning results in Markdown or JSON.

To extract just that tool, write:

brightdata_serp_tools = brightdata_tool_spec.to_tool_list(["search_engine"])

The array passed to to_tool_list() acts as a filter, including only the tool named search_engine.

Note: By default, LlamaIndex will pick the most appropriate tool for a given user request. Thus, tool filtering is not strictly required. Since this tutorial is specifically about integrating Bright Data’s SERP capabilities, it makes sense to limit it to the search_engine tool for clarity.

Terrific! Bright Data is now integrated and ready to power your LlamaIndex agent with web searching capabilities.

Step #5: Connect an LLM Model

The instructions in this step use Gemini as the LLM provider for this integration. A good reason for choosing Gemini is that it offers free API access to some of its models.

To get started with Gemini in LlamaIndex, install the required integration package:

pip install llama-index-llms-google-genai

Next, import the GoogleGenAI class in agent.py:

from llama_index.llms.google_genai import GoogleGenAI

Now, initialize the Gemini LLM like this:

llm = GoogleGenAI(
    model="models/gemini-2.5-flash",
)

In this example, we are using the gemini-2.5-flash model. Feel free to choose any other supported Gemini model.

Behind the scenes, the GoogleGenAI class automatically looks for an environment variable named GEMINI_API_KEY . It uses the API key read from that env to connect to the Gemini APIs.

Configure it by opening your .env file and adding:

GEMINI_API_KEY="<YOUR_GEMINI_API_KEY>"

Replace the <YOUR_GEMINI_API_KEY> placeholder with your actual Gemini API key. If you do not have one yet, you can get it for free by following the official Gemini API retrieval guide.

Note: If you want to use a different LLM provider, LlamaIndex supports many options. Just refer to the official LlamaIndex docs for setup instructions.

Well done! You now have all the core pieces in place to build a LlamaIndex AI agent that can search the web.

Step #6: Define the LlamaIndex Agent

First, install the main LlamaIndex package:

pip install llama-index

Next, in your agent.py file, import the FunctionAgent class:

from llama_index.core.agent.workflow import FunctionAgent

FunctionAgent is a specialized LlamaIndex AI agent that can interact with external tools, such as the Bright Data SERP tool you set up earlier.

Initialize the agent with your LLM and Bright Data SERP tool like this:

agent = FunctionAgent(
    tools=brightdata_serp_tools,
    llm=llm,
    verbose=True, # Useful while developing
    system_prompt="""
        You are a helpful assistant that can retrieve SERP results in JSON format.
    """
)

This creates an AI agent that processes user input through your LLM and can call the Bright Data SERP tools to perform real-time web searches when needed. Note the system_prompt argument, which defines the agent’s role and behavior. Again, the verbose=True flag is useful for inspecting internal activity.

Wonderful! The LlamaIndex + Bright Data SERP integration is complete. The next step is to implement the REPL for interactive use.

Step #7: Build the REPL

REPL, short for “Read-Eval-Print Loop,” is an interactive programming pattern where you enter commands, have them evaluated, and see the results.

In this context, the REPL works as follows:

  1. You describe the task you want the AI agent to handle.
  2. The AI agent performs the task, making online searches if required.
  3. You see the response printed in the terminal.

This loop continues indefinitely until you type "exit".

In agent.py, add this asynchronous function to handle the REPL logic:

async def main():
    print("Gemini-based agent with web searching capabilities powered by Bright Data. Type 'exit' to quit.\n")

    while True:
        # Read the user request for the AI agent from the CLI
        request = input("Request -> ")

        # Terminate the execution if the user type "exit"
        if request.strip().lower() == "exit":
            print("\nAgent terminated")
            break

        try:
            # Execute the request
            response = await agent.run(request)
            print(f"\nResponse ->:\n{response}\n")
        except Exception as e:
            print(f"\nError: {str(e)}\n")

This REPL function:

  1. Accepts user input from the command line via input().
  2. Processes the input using the LlamaIndex agent powered by Gemini and Bright Data through agent.run().
  3. Displays the response back to the console.

Because agent.run() is asynchronous, the REPL logic must be inside an async function. Run it like this at the bottom of your file:

if __name__ == "__main__":
    asyncio.run(main())

Do not forget to import asyncio:

import asyncio

Here we go! The LlamaIndex AI agent with SERP scraping tools is ready.

Step #8: Put It All Together and Run the AI Agent

This is what your agent.py file should contain:

from dotenv import load_dotenv
import os
from llama_index.tools.brightdata import BrightDataToolSpec
from llama_index.llms.google_genai import GoogleGenAI
from llama_index.core.agent.workflow import FunctionAgent
import asyncio

# Load environment variables from the .env file
load_dotenv()

# Read the Bright Data API key from the envs
BRIGHT_DATA_API_KEY = os.getenv("BRIGHT_DATA_API_KEY")

# Set up the Bright Data Tools
brightdata_tool_spec = BrightDataToolSpec(
    api_key=BRIGHT_DATA_API_KEY,
    zone="unlocker", # Replace with the name of your Web Unlocker zone
    verbose=True, # Useful while developing

)
# Get only the "search_engine" (SERP scraping) tool
brightdata_serp_tools = brightdata_tool_spec.to_tool_list(["search_engine"])

# Configure the connection to Gemini
llm = GoogleGenAI(
    model="models/gemini-2.5-flash",
)

# Create the LlamaIndex agent powered by Gemini and connected to Bright Data tools
agent = FunctionAgent(
    tools=brightdata_serp_tools,
    llm=llm,
    verbose=True, # Useful while developing
    system_prompt="""
        You are a helpful assistant that can retrieve SERP results in JSON format.
    """
)

# Async REPL loop
async def main():
    print("Gemini-based agent with web searching capabilities powered by Bright Data. Type 'exit' to quit.\n")

    while True:
        # Read the user request for the AI agent from the CLI
        request = input("Request -> ")

        # Terminate the execution if the user type "exit"
        if request.strip().lower() == "exit":
            print("\nAgent terminated")
            break

        try:
            # Execute the request
            response = await agent.run(request)
            print(f"\nResponse ->:\n{response}\n")
        except Exception as e:
            print(f"\nError: {str(e)}\n")

if __name__ == "__main__":
    asyncio.run(main())

Run your LlamaIndex SERP agent with:

python agent.py

When the script starts, you will see a prompt like this in your terminal:

The REPL of your AI SERP agent printed in the terminal

Try asking your agent for something that requires fresh information, for example:

Write a short Markdown report on the new AI protocols, including some real-world links for further reading.

To perform this task effectively, the AI agent needs to search the web for up-to-date information.

The result will be:

How the LlamaIndex + Bright Data SERP AI agent addresses the prompt

That was quite fast, so let’s break down what happened:

  1. The agent detects the need to search for “new AI protocols” and calls the Bright Data SERP API via the search_engine tool using this input URL: https://www.google.com/search?q=new%20AI%20protocols&num=10&brd_json=1.
  2. The tool asynchronously fetches SERP data in JSON format from Bright Data’s Google Search API.
  3. The agent passes the JSON response to the Gemini LLM.
  4. Gemini processes the fresh data and generates a clear, accurate Markdown report with relevant links.

In this case, the AI agent returned:

## New AI Protocols: A Brief Report

The rapid advancement of Artificial Intelligence has led to the emergence of new protocols designed to enhance interoperability, communication, and data handling among AI systems and with external data sources. These protocols aim to standardize how AI agents interact, leading to more scalable and integrated AI deployments.

Here are some of the key new AI protocols:

### 1. Model Context Protocol (MCP)

The Model Context Protocol (MCP) is an open standard that facilitates secure, two-way connections between AI-powered tools and various data sources. It fundamentally changes how AI assistants interact with the digital world by allowing them to access and utilize external information more effectively. This protocol is crucial for enabling AI models to communicate with external data sources and for building more capable and context-aware AI applications.

**Further Reading:**
*   **Introducing the Model Context Protocol:** [https://www.anthropic.com/news/model-context-protocol](https://www.anthropic.com/news/model-context-protocol)
*   **How A Simple Protocol Is Changing Everything About AI:** [https://www.forbes.com/sites/craigsmith/2025/04/07/how-a-simple-protocol-is-changing-everything-about-ai/](https://www.forbes.com/sites/craigsmith/2025/04/07/how-a-simple-protocol-is-changing-everything-about-ai/)
*   **The New Model Context Protocol for AI Agents:** [https://evergreen.insightglobal.com/the-new-model-context-protocol-for-ai-agents/](https://evergreen.insightglobal.com/the-new-model-context-protocol-for-ai-agents/)
*   **Model Context Protocol: The New Standard for AI Interoperability:** [https://techstrong.ai/aiops/model-context-protocol-the-new-standard-for-ai-interoperability/](https://techstrong.ai/aiops/model-context-protocol-the-new-standard-for-ai-interoperability/)
*   **Hot new protocol glues together AI and apps:** [https://www.axios.com/2025/04/17/model-context-protocol-anthropic-open-source](https://www.axios.com/2025/04/17/model-context-protocol-anthropic-open-source)

### 2. Agent2Agent Protocol (A2A)

The Agent2Agent Protocol (A2A) is a cross-platform specification designed to enable AI agents to communicate with each other, securely exchange information, and coordinate actions. This protocol is vital for fostering collaboration among different AI agents, allowing them to work together on complex tasks and delegate responsibilities across various enterprise systems.

**Further Reading:**
*   **Announcing the Agent2Agent Protocol (A2A):** [https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/](https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/)
*   **What Every AI Engineer Should Know About A2A, MCP & ACP:** [https://medium.com/@elisowski/what-every-ai-engineer-should-know-about-a2a-mcp-acp-8335a210a742](https://medium.com/@elisowski/what-every-ai-engineer-should-know-about-a2a-mcp-acp-8335a210a742)
*   **What a new AI protocol means for journalists:** [https://www.dw.com/en/what-coding-agents-and-a-new-ai-protocol-mean-for-journalists/a-72976193](https://www.dw.com/en/what-coding-agents-and-a-new-ai-protocol-mean-for-journalists/a-72976193)

### 3. Agent Communication Protocol (ACP)

The Agent Communication Protocol (ACP) is an open standard specifically for agent-to-agent communication. Its purpose is to transform the current landscape of siloed AI agents into interoperable agentic systems, promoting easier integration and collaboration between them. ACP provides a standardized messaging framework for structured communication.

**Further Reading:**
*   **MCP, ACP, and Agent2Agent set standards for scalable AI:** [https://www.cio.com/article/3991302/ai-protocols-set-standards-for-scalable-results.html](https://www.cio.com/article/3991302/ai-protocols-set-standards-for-scalable-results.html)
*   **What is Agent Communication Protocol (ACP)?** [https://www.ibm.com/think/topics/agent-communication-protocol](https://www.ibm.com/think/topics/agent-communication-protocol)
*   **MCP vs A2A vs ACP: AI Protocols Explained:** [https://www.bluebash.co/blog/mcp-vs-a2a-vs-acp-agent-communication-protocols/](https://www.bluebash.co/blog/mcp-vs-a2a-vs-acp-agent-communication-protocols/)

These emerging protocols are crucial steps towards a more interconnected and efficient AI ecosystem, enabling more sophisticated and collaborative AI applications across various industries.

Notice that the AI agent’s response includes recent protocols and up-to-date links published after Gemini’s last training update. This highlights the value of integrating live web search capabilities.

More specifically, the response includes contextual links that closely match what you would find by searching “new ai protocols” on Google:

The “new ai protocols” Google SERP

Notice that the response includes many of the same links you would find in the actual “new AI protocols” SERP (at the time of writing, at least).

Et voilà! You now have a LlamaIndex AI agent with search engine scraping capabilities, powered by Bright Data.

Step #9: Next Steps

The current LlamaIndex SERP AI agent is just a simple example that uses only the search_engine tool from Bright Data.

In more advanced scenarios, you probably do not want to restrict your agent to a single tool. Instead, it is better to give your agent access to all available tools and write a clear system prompt that helps the LLM decide which ones to use for each goal.

For example, you could extend your prompt to go a step further and:

  1. Perform multiple search queries.
  2. Select the top N links from the SERP results.
  3. Visit those pages and scrape their content in Markdown.
  4. Learn from that info to produce a richer, more detailed output.

For more guidance on integrating with all available tools, see our tutorial on building AI agents with LlamaIndex and Bright Data.

Conclusion

In this article, you learned how to use LlamaIndex to build an AI agent capable of searching the web via Bright Data. This integration allows your agent to run search queries on major search engines, including Google, Bing, Yandex, and many others.

Keep in mind that the example covered here is just a starting point. If you plan to develop more advanced agents, you’ll need robust tools for retrieving, validating, and transforming live web data. That is exactly what Bright Data’s AI infrastructure for agents provides.

Create a free Bright Data account and start exploring our agentic AI data tools today!

Antonello Zanini

Technical Writer

5.5 years experience

Antonello Zanini is a technical writer, editor, and software engineer with 5M+ views. Expert in technical content strategy, web development, and project management.

Expertise
Web Development Web Scraping AI Integration