Build AI Agents With Web Data Using LlamaIndex (2025)

In this guide, you will learn:

What LlamaIndex is and why it is so widely used.
What makes it unique for AI agent development, especially its built-in support for data integrations.
How to use LlamaIndex to build an AI agent with data retrieval capabilities from both general sites and specific search engines.

Let’s dive in!

What Is LlamaIndex?

LlamaIndex is an open-source Python data framework for building LLM-powered applications.

It helps you create production-ready AI workflows and agents capable of finding and retrieving relevant information, synthesizing insights, generating detailed reports, taking automated actions, and more.

LlamaIndex is one of the fastest-growing libraries for building AI agents, with over 42k GitHub stars:

The GitHub star evolution for LlamaIndex

Integrate Data into Your LlamaIndex AI Agent

Compared to other AI agent-building technologies, LlamaIndex focuses on data. That is why the project’s GitHub repository defines LlamaIndex as a “data framework.”

Specifically, LlamaIndex addresses one of the biggest limitations of LLMs. That is their lack of knowledge about current or real-time events. The limitation arises because LLMs are trained on static datasets and do not have built-in access to up-to-date information.

To solve that issue, LlamaIndex introduces support for tools that:

Provide data connectors to ingest data from APIs, PDFs, Word docs, SQL databases, web pages, and more.
Structure your data using indices, graphs, and other formats optimized for LLM consumption.
Enable advanced retrieval so you can input an LLM prompt and receive a knowledge-augmented response, grounded in relevant context.
Support seamless integration with external frameworks such as LangChain, Flask, Docker, and ChatGPT.

In other terms, building with LlamaIndex typically means combining the core library with a set of plugins/integrations tailored to your use case. For example, explore a LlamaIndex web scraping scenario.

Now, the Web is the largest and most comprehensive source of data on the planet. Thus, an AI agent should ideally have access to it in order to ground its responses and perform tasks more effectively. This is where LlamaIndex Bright Data tools come into play!

With the Bright Data tools, your LlamaIndex AI agent gains:

Real-time web scraping functionality from any webpage.
Structured product and platform data from sites like Amazon, LinkedIn, Zillow, Facebook, and many others.
The ability to retrieve search engine results for any search query.
Visual data capture via full-page screenshots, useful for summarization or visual analysis.

See how this integration works in the next chapter!

Build a LlamaIndex Agent That Can Source the Web Using Bright Data Tools

In this step-by-step section, you will learn how to use LlamaIndex to build a Python AI agent that connects to Bright Data tools.

This integration will give your agent powerful web data access features. In detail, the AI agent will gain the ability to extract content from any web page, fetch real-time search engine results, and more. For more information, refer to our official documentation.

Follow the steps below to build your Bright Data–powered AI agent using LlamaIndex!

Prerequisites

To follow this tutorial, you will need the following:

Python 3.9 or higher installed on your machine (the latest version is recommended).
A Bright Data API key for integration with BrightDataToolSpec.
An API key from a supported LLM provider (in this guide, we will use Gemini, which is free to use via API. Feel free to use any provider supported by LlamaIndex).

Do not worry if you do not have a Gemini or Bright Data API key yet. We will walk you through how to create both in the next steps.

Step #1: Create Your Python Project

Start by opening a terminal and creating a new folder for your LlamaIndex AI agent project:

mkdir llamaindex-bright-data-agent

llamaindex-bright-data-agent/ will contain the code for your AI agent with web data retrieval capabilities powered by Bright Data.

Next, move into the project directory and create a virtual environment inside it:

cd llamaindex-bright-data-agent
python -m venv venv

Now, open the project folder in your favorite Python IDE. We recommend Visual Studio Code (with the Python extension) or PyCharm Community Edition.

Create a new file called agent.py in the root of the folder. Your project structure should now look like this:

llamaindex-bright-data-agent/
├── venv/
└── agent.py

In your terminal, activate the virtual environment. In Linux or macOS, run this command:

source venv/bin/activate

Equivalently, on Windows, execute:

venv/Scripts/activate

In the next steps, we will walk you through installing the required packages. Still, if you prefer to install them all now, run:

pip install python-dotenv llama-index-tools-brightdata llama-index-llms-google-genai llama-index

Note: We are installing llama-index-llms-gemini because this tutorial uses Gemini as the LLM provider. If you are planning to use a different provider, be sure to install the corresponding LlamaIndex integration for it instead.

You are all set! You now have a Python development environment ready to build an AI agent using LlamaIndex and Bright Data tools.

Step #2: Set Up Environment Variables Reading

Your LlamaIndex agent will connect to external services like Gemini and Bright Data via API keys. For security reasons, never hardcode API keys directly into your Python code. Instead, use environment variables to keep them private.

To make working with environment variables easier, install the python-dotenv library. In your virtual environment activated, run:

pip install python-dotenv

Then, open your agent.py file and add the following lines at the top to load variables from a .env file:

from dotenv import load_dotenv

load_dotenv()

The load_dotenv() function looks for a .env file in the project’s root directory and automatically loads its values into the environment.

Now, create a .env file alongside your agent.py file, like so:

llamaindex-bright-data-agent/
├── venv/
├── .env # <-------------
└── agent.py

Perfect! You have now set up a secure way to manage sensitive API credentials for third-party services. Tim to continue the initial setup by populating the .env file with the required envs.

Step #3: Get Started with Bright Data

As of this writing, the BrightDataToolSpec exposes the following tools within LlamaIndex:

scrape_as_markdown: Scrapes the raw content of any webpage and returns it in Markdown format.
get_screenshot: Captures a full-page screenshot of a webpage and saves it locally.
search_engine: Performs a search query on search engines like Google, Bing, Yandex, and more. It returns the entire SERP or a JSON-structured version of that data.
web_data_feed: Retrieves structured JSON data from well-known platforms.

The first three tools—scrape_as_markdown, get_screenshot, and search_engine—use Bright Data’s Web Unlocker API. That solution opens the door to web scraping and screenshotting from any site, even those with strict anti-bot protection. Plus, it supports SERP web data access from all major search engines.

In contrast, web_data_feed leverages Bright Data’s Web Scraper API. That endpoint returns pre-structured data from a predefined list of supported platforms such as Amazon, Instagram, LinkedIn, ZoomInfo, and more.

To integrate these tools, you will need to:

Enable the Web Unlocker solution in your Bright Data dashboard.
Retrieve your Bright Data API token, which grants access to both the Web Unlocker and Web Scraper API.

Follow the steps below to complete the setup!

First, if you do not already have a Bright Data account, go ahead and [create one](). If you already have an account, log in and open your dashboard. Click the “Get proxy products” button:

Clicking the “Get proxy products” button in your Bright Data account

You will be redirected to the “Proxies & Scraping Infrastructure” page:

If you already see an active Web Unlocker API zone (as above), you are good to go. The zone name (unlocker, in this case) is important, as you will need it later in your code.

If you do not have one yet, scroll down to the “Web Unlocker API” section and click “Create zone”:

Clicking the “Create zone” on the “Web Unlocker API” card

Give your new zone a name, such as unlocker, enable advanced features for better performance, and click “Add”:

Configuring your new Web Unlocker API zone

Once the zone is created, you will be redirected to the zone’s configuration page:

The “unlocker” Web Unlocker API zone page

Make sure that the activation toggle is set to “Active.” This confirms the zone is properly configured and ready for use.

Next, follow the official Bright Data guide to generate your API key. Once you have it, store it securely in your .env file like this:

BRIGHT_DATA_API_KEY="<YOUR_BRIGHT_DATA_API_KEY>"

Replace the <YOUR_BRIGHT_DATA_API_KEY> placeholder with your actual API key value.

Amazing! Time to integrate Bright Data tools into your LlamaIndex agent script.

Step #4: Install and Configure the LlamaIndex Bright Data Tools

In agent.py, start by loading your Bright Data API key from the environment:

BRIGHT_DATA_API_KEY = os.getenv("BRIGHT_DATA_API_KEY")

Do not forget to import os from the Python standard library:

import os

With your virtual environment activated, install the LlamaIndex Bright Data tools package:

pip install llama-index-tools-brightdata

In your agent.py file, import the BrightDataToolSpec class:

from llama_index.tools.brightdata import BrightDataToolSpec

Then, create an instance of BrightDataToolSpec using your API key and the zone name:

brightdata_tool_spec = BrightDataToolSpec(
    api_key=BRIGHT_DATA_API_KEY,
    zone="<BRIGHT_DATA_WEB_UNLOCKER_API_ZONE_NAME>", # Replace with the actual value
    verbose=True, # Useful while developing
)

Replace the <BRIGHT_DATA_WEB_UNLOCKER_API_ZONE_NAME> placeholder with the name of the Web Unlocker API zone you set up earlier. In this case, it is unlocker:

brightdata_tool_spec = BrightDataToolSpec(
    api_key=BRIGHT_DATA_API_KEY,
    zone="unlocker",
    verbose=True,
)

Note that the verbose option has been set to True. That is helpful while developing, as it prints useful information about what is happening when the LlamaIndex agent makes requests via Bright Data.

Then, convert the tool spec to a list of usable tools in your agent:

brightdata_tools = brightdata_tool_spec.to_tool_list()

Fantastic! Bright Data tools are now integrated and ready to power your LlamaIndex agent. The next step is to connect your LLM.

Step #5: Prepare the LLM Model

To use Gemini (the chosen LLM provider) start by installing the required integration package:

pip install llama-index-llms-google-genai

Next, import the GoogleGenAI class from the installed package:

from llama_index.llms.google_genai import GoogleGenAI

Now, initialize the Gemini LLM like this:

llm = GoogleGenAI(
    model="models/gemini-2.5-flash",
)

In this example, we are using the gemini-2.5-flash model. You can swap this out for any other supported Gemini model as needed.

Behind the scenes, GoogleGenAI automatically looks for an environment variable named GEMINI_API_KEY. To set it, open your .env file and add the following line:

GEMINI_API_KEY="<YOUR_GEMINI_API_KEY>"

Replace the <YOUR_GEMINI_API_KEY> placeholder with your actual Gemini API key. If you do not have one, retrieve it for free by following the official guide.

Note: If you prefer using a different LLM provider, LlamaIndex supports many options. Just refer to the official LlamaIndex docs for setup instructions.

Great job! You now have all the core components wired together to build a LlamaIndex agent with web data retrieval capabilities.

Step #6: Create the LlamaIndex Agent

First, install the main LlamaIndex package:

pip install llama-index

Then, in your agent.py file, import the FunctionAgent class:

from llama_index.core.agent.workflow import FunctionAgent

FunctionAgent is a special type of LlamaIndex AI agent that can interact with external tools, such as the Bright Data tools you configured earlier.

Initialize the agent with your LLM and Bright Data tools like this:

agent = FunctionAgent(
    tools=brightdata_tools,
    llm=llm,
    verbose=True, # Useful while developing
)

This sets up an AI agent that processes user inputs using your LLM and can call Bright Data tools to retrieve information as needed. The verbose=True flag is handy during development because it shows which tools the agent is using for each request.

Well done! The LlamaIndex + Bright Data integration is complete. The next step is to build the REPL for interactive use.

Step #7: Implement the REPL

REPL stands for “Read-Eval-Print Loop” and is an interactive programming pattern where you can enter commands, have them evaluated, and see the results immediately. In this context, you:

Enter a command or task.
Let the AI agent evaluate and handle it.
See the response.

This loop continues indefinitely, until you type "exit".

When dealing with AI agents, the REPL tends to be more practical than sending isolated prompts. The reason is that it enables your LlamaIndex agent to maintain session context, improving its responses by learning from previous interactions.

Now, implement the REPL logic in agent.py as below:

async def main():
    print("Gemini-powered Agent with Bright Data tools for web data retrieval. Type 'exit' to quit.\n")

    while True:
        # Read the user request for the AI agent from the CLI
        request = input("Request -> ")

        # Terminate the exection if the user type "exit"
        if request.strip().lower() == "exit":
            print("\nAgent terminated")
            break

        try:
            # Execute the request
            response = await agent.run(request)
            print(f"\nResponse ->:\n{response}\n")
        except Exception as e:
            print(f"\nError: {str(e)}\n")

if __name__ == "__main__":
    asyncio.run(main())

This REPL:

Reads the user’s input from the command line with input().
Evaluates it using the LlamaIndex agent powered by Gemini and Bright Data with agent.run().
Prints the response back to the console.

Do not forget to import asyncio:

import asyncio

Terrific! The LlamaIndex AI agent is ready.

Step #8: Put It All Together and Run the Agent

This is what your agent.py file should now contain:

from dotenv import load_dotenv
import os
from llama_index.tools.brightdata import BrightDataToolSpec
from llama_index.llms.google_genai import GoogleGenAI
from llama_index.core.agent.workflow import FunctionAgent
import asyncio

# Load environment variables from the .env file
load_dotenv()

# Read the Bright Data API key from the envs
BRIGHT_DATA_API_KEY = os.getenv("BRIGHT_DATA_API_KEY")

# Set up the Bright Data Tools
brightdata_tool_spec = BrightDataToolSpec(
    api_key=BRIGHT_DATA_API_KEY,
    zone="unlocker",
    verbose=True, # Useful while developing
)
brightdata_tools = brightdata_tool_spec.to_tool_list()

# Configure the connection to Gemini
llm = GoogleGenAI(
    model="models/gemini-2.5-flash",
)

# Create the LlamaIndex agent powered by Gemini and connected to Bright Data tools
agent = FunctionAgent(
    tools=brightdata_tools,
    llm=llm,
    verbose=True, # Useful while developing
)

async def main():
    print("Gemini-powered Agent with Bright Data tools for web data retrieval. Type 'exit' to quit.\n")

    while True:
        # Read the user request for the AI agent from the CLI
        request = input("Request -> ")

        # Terminate the exection if the user type "exit"
        if request.strip().lower() == "exit":
            print("\nAgent terminated")
            break

        try:
            # Execute the request
            response = await agent.run(request)
            print(f"\nResponse ->:\n{response}\n")
        except Exception as e:
            print(f"\nError: {str(e)}\n")

if __name__ == "__main__":
    asyncio.run(main())

Run the agent script using the following command:

python agent.py

When the script starts, you will see something like this:

The REPL of your agent printed in the terminal

Enter the prompt as follows into the terminal:

Generate a report summarizing the most important information about the product "Death Stranding 2" using data from its Amazon page: "https://www.amazon.com/Death-Stranding-2-Beach-PlayStation-5/dp/B0F19GPDW3/"

The result will be:

How the LlamaIndex + Bright Data AI agent addresses the prompt

That was quite fast, so let’s break down what happened:

The agent identifies that the task requires Amazon product data, so it calls the web_data_feed tool with this input: {"source_type": "amazon_product", "url": "https://www.amazon.com/Death-Stranding-2-Beach-PlayStation-5/dp/B0F19GPDW3/"}
That tool asynchronously queries Bright Data’s Amazon Web Scraper API to fetch structured product data.
Once the JSON response is returned, the agent feeds it to the Gemini LLM.
Gemini processes the fresh data and generates a clear, accurate summary.

In other words, given the prompt, the agent smartly selects the best tool. In this case, that is web_data_feed. That retrieves real-time product data from the given Amazon page with an asynchronous approach. Then, the LLM uses that to generate a meaningful summary.

In this case, the AI agent returned:

Here's a summary report for "Death Stranding 2: On The Beach - PS5" based on its Amazon product page:

**Product Report: Death Stranding 2: On The Beach - PS5**

*   **Title:** Death Stranding 2: On The Beach - PS5
*   **Brand/Manufacturer:** Sony Interactive Entertainment
*   **Price:** $69.99 USD
*   **Release Date:** June 26, 2025
*   **Availability:** Available for pre-order.

**Description:**
"Death Stranding 2: On The Beach" is an upcoming PlayStation 5 title from legendary game creator Hideo Kojima. Players will embark on a new journey with Sam and his companions to save humanity from extinction, traversing a world filled with otherworldly enemies and obstacles. The game explores the question of human connection and promises to once again change the world through its unique narrative and gameplay.

**Key Features:**
*   **Pre-order Bonus:** Includes Quokka Hologram, Battle Skeleton Silver (LV1,LV2,LV3), Boost Skeleton Silver (LV1,LV2,LV3), and Bokka Silver (LV1,LV2,LV3).
*   **Open World:** Features large, varied open-world environments with unique challenges.
*   **Gameplay Choices:** Offers multiple approaches to combat and stealth, allowing players to choose between aggressive tactics, sneaking, or avoiding danger.
*   **New Story:** Continues the narrative from the original Death Stranding, following Sam on a fresh journey with unexpected twists.
*   **Player Interaction:** Player actions can influence how other players interact with the game's world.

**Category & Ranking:**
*   **Categories:** Video Games, PlayStation 5, Games
*   **Best Sellers Rank:** #10 in Video Games, #1 in PlayStation 5 Games

**Sales Performance:**
*   **Bought in past month:** 7,000 units

Notice how the AI agent would not be able to achieve such an outcome without the Bright Data tools. That is because:

The chosen Amazon product is a new product and LLMs are not trained on such fresh data.
LLMs may not be able to scrape or access real-time web pages on their own.
Scraping Amazon products is notoriously difficult due to strict anti-bot systems like the notorious Amazon CAPTCHA.

Important: If you try other prompts, you will see that the agent automatically selects and uses the appropriate configured tools to retrieve the data it needs to generate grounded responses.

Et voilà! You now have a LlamaIndex AI agent with top-notch web data access features, powered by integration with Bright Data.

Conclusion

In this article, you learned how to use LlamaIndex to build an AI agent with real-time access to web data, thanks to Bright Data tools.

That integration gives your agent the ability to retrieve public web content in Markdown format, structured JSON formats, and even as screenshots. That is true on both websites and search engines.

Keep in mind that the integration seen here was just a basic example. If you are aiming to build more advanced agents, you will need reliable tools for retrieving, validating, and transforming live web data. That is exactly what Bright Data’s AI infrastructure is built for.

Create a free Bright Data account and start exploring our AI-ready data tools today!

Start free trial

Start free with Google

Antonello Zanini

Technical Writer

5.5 years experience

Antonello Zanini is a technical writer, editor, and software engineer with 5M+ views. Expert in technical content strategy, web development, and project management.

Expertise

Web Development Web Scraping AI Integration

View all articles

Use LlamaIndex to Build AI Agents with Web Data Access