In this guide, you will learn:
- What LlamaIndex is and why it is so widely used.
- What makes it unique for AI agent development, especially its built-in support for data integrations.
- How to use LlamaIndex to build an AI agent with data retrieval capabilities from both general sites and specific search engines.
Let’s dive in!
What Is LlamaIndex?
LlamaIndex is an open-source Python data framework for building LLM-powered applications.
It helps you create production-ready AI workflows and agents capable of finding and retrieving relevant information, synthesizing insights, generating detailed reports, taking automated actions, and more.
LlamaIndex is one of the fastest-growing libraries for building AI agents, with over 42k GitHub stars:
Integrate Data into Your LlamaIndex AI Agent
Compared to other AI agent-building technologies, LlamaIndex focuses on data. That is why the project’s GitHub repository defines LlamaIndex as a “data framework.”
Specifically, LlamaIndex addresses one of the biggest limitations of LLMs. That is their lack of knowledge about current or real-time events. The limitation arises because LLMs are trained on static datasets and do not have built-in access to up-to-date information.
To solve that issue, LlamaIndex introduces support for tools that:
- Provide data connectors to ingest data from APIs, PDFs, Word docs, SQL databases, web pages, and more.
- Structure your data using indices, graphs, and other formats optimized for LLM consumption.
- Enable advanced retrieval so you can input an LLM prompt and receive a knowledge-augmented response, grounded in relevant context.
- Support seamless integration with external frameworks such as LangChain, Flask, Docker, and ChatGPT.
In other terms, building with LlamaIndex typically means combining the core library with a set of plugins/integrations tailored to your use case. For example, explore a LlamaIndex web scraping scenario.
Now, the Web is the largest and most comprehensive source of data on the planet. Thus, an AI agent should ideally have access to it in order to ground its responses and perform tasks more effectively. This is where LlamaIndex Bright Data tools come into play!
With the Bright Data tools, your LlamaIndex AI agent gains:
- Real-time web scraping functionality from any webpage.
- Structured product and platform data from sites like Amazon, LinkedIn, Zillow, Facebook, and many others.
- The ability to retrieve search engine results for any search query.
- Visual data capture via full-page screenshots, useful for summarization or visual analysis.
See how this integration works in the next chapter!
Build a LlamaIndex Agent That Can Source the Web Using Bright Data Tools
In this step-by-step section, you will learn how to use LlamaIndex to build a Python AI agent that connects to Bright Data tools.
This integration will give your agent powerful web data access features. In detail, the AI agent will gain the ability to extract content from any web page, fetch real-time search engine results, and more. For more information, refer to our official documentation.
Follow the steps below to build your Bright Data–powered AI agent using LlamaIndex!
Prerequisites
To follow this tutorial, you will need the following:
- Python 3.9 or higher installed on your machine (the latest version is recommended).
- A Bright Data API key for integration with
BrightDataToolSpec
. - An API key from a supported LLM provider (in this guide, we will use Gemini, which is free to use via API. Feel free to use any provider supported by LlamaIndex).
Do not worry if you do not have a Gemini or Bright Data API key yet. We will walk you through how to create both in the next steps.
Step #1: Create Your Python Project
Start by opening a terminal and creating a new folder for your LlamaIndex AI agent project:
llamaindex-bright-data-agent/
will contain the code for your AI agent with web data retrieval capabilities powered by Bright Data.
Next, move into the project directory and create a virtual environment inside it:
Now, open the project folder in your favorite Python IDE. We recommend Visual Studio Code (with the Python extension) or PyCharm Community Edition.
Create a new file called agent.py
in the root of the folder. Your project structure should now look like this:
In your terminal, activate the virtual environment. In Linux or macOS, run this command:
Equivalently, on Windows, execute:
In the next steps, we will walk you through installing the required packages. Still, if you prefer to install them all now, run:
Note: We are installing llama-index-llms-gemini
because this tutorial uses Gemini as the LLM provider. If you are planning to use a different provider, be sure to install the corresponding LlamaIndex integration for it instead.
You are all set! You now have a Python development environment ready to build an AI agent using LlamaIndex and Bright Data tools.
Step #2: Set Up Environment Variables Reading
Your LlamaIndex agent will connect to external services like Gemini and Bright Data via API keys. For security reasons, never hardcode API keys directly into your Python code. Instead, use environment variables to keep them private.
To make working with environment variables easier, install the python-dotenv
library. In your virtual environment activated, run:
Then, open your agent.py
file and add the following lines at the top to load variables from a .env
file:
The load_dotenv()
function looks for a .env
file in the project’s root directory and automatically loads its values into the environment.
Now, create a .env
file alongside your agent.py
file, like so:
Perfect! You have now set up a secure way to manage sensitive API credentials for third-party services. Tim to continue the initial setup by populating the .env
file with the required envs.
Step #3: Get Started with Bright Data
As of this writing, the BrightDataToolSpec
exposes the following tools within LlamaIndex:
scrape_as_markdown
: Scrapes the raw content of any webpage and returns it in Markdown format.get_screenshot
: Captures a full-page screenshot of a webpage and saves it locally.search_engine
: Performs a search query on search engines like Google, Bing, Yandex, and more. It returns the entire SERP or a JSON-structured version of that data.web_data_feed
: Retrieves structured JSON data from well-known platforms.
The first three tools—scrape_as_markdown
, get_screenshot
, and search_engine
—use Bright Data’s Web Unlocker API. That solution opens the door to web scraping and screenshotting from any site, even those with strict anti-bot protection. Plus, it supports SERP web data access from all major search engines.
In contrast, web_data_feed
leverages Bright Data’s Web Scraper API. That endpoint returns pre-structured data from a predefined list of supported platforms such as Amazon, Instagram, LinkedIn, ZoomInfo, and more.
To integrate these tools, you will need to:
- Enable the Web Unlocker solution in your Bright Data dashboard.
- Retrieve your Bright Data API token, which grants access to both the Web Unlocker and Web Scraper API.
Follow the steps below to complete the setup!
First, if you do not already have a Bright Data account, go ahead and [create one](). If you already have an account, log in and open your dashboard. Click the “Get proxy products” button:
You will be redirected to the “Proxies & Scraping Infrastructure” page:
If you already see an active Web Unlocker API zone (as above), you are good to go. The zone name (unlocker
, in this case) is important, as you will need it later in your code.
If you do not have one yet, scroll down to the “Web Unlocker API” section and click “Create zone”:
Give your new zone a name, such as unlocker
, enable advanced features for better performance, and click “Add”:
Once the zone is created, you will be redirected to the zone’s configuration page:
Make sure that the activation toggle is set to “Active.” This confirms the zone is properly configured and ready for use.
Next, follow the official Bright Data guide to generate your API key. Once you have it, store it securely in your .env
file like this:
Replace the <YOUR_BRIGHT_DATA_API_KEY>
placeholder with your actual API key value.
Amazing! Time to integrate Bright Data tools into your LlamaIndex agent script.
Step #4: Install and Configure the LlamaIndex Bright Data Tools
In agent.py
, start by loading your Bright Data API key from the environment:
Do not forget to import os
from the Python standard library:
With your virtual environment activated, install the LlamaIndex Bright Data tools package:
In your agent.py
file, import the BrightDataToolSpec
class:
Then, create an instance of BrightDataToolSpec
using your API key and the zone name:
Replace the <BRIGHT_DATA_WEB_UNLOCKER_API_ZONE_NAME>
placeholder with the name of the Web Unlocker API zone you set up earlier. In this case, it is unlocker
:
Note that the verbose
option has been set to True
. That is helpful while developing, as it prints useful information about what is happening when the LlamaIndex agent makes requests via Bright Data.
Then, convert the tool spec to a list of usable tools in your agent:
Fantastic! Bright Data tools are now integrated and ready to power your LlamaIndex agent. The next step is to connect your LLM.
Step #5: Prepare the LLM Model
To use Gemini (the chosen LLM provider) start by installing the required integration package:
Next, import the GoogleGenAI
class from the installed package:
Now, initialize the Gemini LLM like this:
In this example, we are using the gemini-2.5-flash
model. You can swap this out for any other supported Gemini model as needed.
Behind the scenes, GoogleGenAI
automatically looks for an environment variable named GEMINI_API_KEY
. To set it, open your .env
file and add the following line:
Replace the <YOUR_GEMINI_API_KEY>
placeholder with your actual Gemini API key. If you do not have one, retrieve it for free by following the official guide.
Note: If you prefer using a different LLM provider, LlamaIndex supports many options. Just refer to the official LlamaIndex docs for setup instructions.
Great job! You now have all the core components wired together to build a LlamaIndex agent with web data retrieval capabilities.
Step #6: Create the LlamaIndex Agent
First, install the main LlamaIndex package:
Then, in your agent.py
file, import the FunctionAgent
class:
FunctionAgent
is a special type of LlamaIndex AI agent that can interact with external tools, such as the Bright Data tools you configured earlier.
Initialize the agent with your LLM and Bright Data tools like this:
This sets up an AI agent that processes user inputs using your LLM and can call Bright Data tools to retrieve information as needed. The verbose=True
flag is handy during development because it shows which tools the agent is using for each request.
Well done! The LlamaIndex + Bright Data integration is complete. The next step is to build the REPL for interactive use.
Step #7: Implement the REPL
REPL stands for “Read-Eval-Print Loop” and is an interactive programming pattern where you can enter commands, have them evaluated, and see the results immediately. In this context, you:
- Enter a command or task.
- Let the AI agent evaluate and handle it.
- See the response.
This loop continues indefinitely, until you type "exit"
.
When dealing with AI agents, the REPL tends to be more practical than sending isolated prompts. The reason is that it enables your LlamaIndex agent to maintain session context, improving its responses by learning from previous interactions.
Now, implement the REPL logic in agent.py
as below:
This REPL:
- Reads the user’s input from the command line with
input()
. - Evaluates it using the LlamaIndex agent powered by Gemini and Bright Data with
agent.run()
. - Prints the response back to the console.
Do not forget to import asyncio
:
Terrific! The LlamaIndex AI agent is ready.
Step #8: Put It All Together and Run the Agent
This is what your agent.py
file should now contain:
Run the agent script using the following command:
When the script starts, you will see something like this:
Enter the prompt as follows into the terminal:
The result will be:
That was quite fast, so let’s break down what happened:
- The agent identifies that the task requires Amazon product data, so it calls the
web_data_feed
tool with this input:{"source_type": "amazon_product", "url": "https://www.amazon.com/Death-Stranding-2-Beach-PlayStation-5/dp/B0F19GPDW3/"}
- That tool asynchronously queries Bright Data’s Amazon Web Scraper API to fetch structured product data.
- Once the JSON response is returned, the agent feeds it to the Gemini LLM.
- Gemini processes the fresh data and generates a clear, accurate summary.
In other words, given the prompt, the agent smartly selects the best tool. In this case, that is web_data_feed
. That retrieves real-time product data from the given Amazon page with an asynchronous approach. Then, the LLM uses that to generate a meaningful summary.
In this case, the AI agent returned:
Notice how the AI agent would not be able to achieve such an outcome without the Bright Data tools. That is because:
- The chosen Amazon product is a new product and LLMs are not trained on such fresh data.
- LLMs may not be able to scrape or access real-time web pages on their own.
- Scraping Amazon products is notoriously difficult due to strict anti-bot systems like the notorious Amazon CAPTCHA.
Important: If you try other prompts, you will see that the agent automatically selects and uses the appropriate configured tools to retrieve the data it needs to generate grounded responses.
Et voilà! You now have a LlamaIndex AI agent with top-notch web data access features, powered by integration with Bright Data.
Conclusion
In this article, you learned how to use LlamaIndex to build an AI agent with real-time access to web data, thanks to Bright Data tools.
That integration gives your agent the ability to retrieve public web content in Markdown format, structured JSON formats, and even as screenshots. That is true on both websites and search engines.
Keep in mind that the integration seen here was just a basic example. If you are aiming to build more advanced agents, you will need reliable tools for retrieving, validating, and transforming live web data. That is exactly what Bright Data’s AI infrastructure is built for.
Create a free Bright Data account and start exploring our AI-ready data tools today!