Throughout this guide, we’ll use LlamaIndex to extract data with their Bright Data tools. When you’re finished with this tutorial, you’ll be able to do all the following.
- Extract website data as markdown
- Take screenshots of webpages
- Perform Google searches from inside your application
- Trigger collections on demand using datafeeds and Bright Data’s Web Scraper API
Introduction: What is LlamaIndex?
Before the age of AI, data collection was a brittle and high maintenance process. A single change to the site layout could break your entire pipeline. In modern times, this isn’t the case — as long as you’re using the right tools.
LlamaIndex connects language models to external tooling and data sources. It comes prepacked with minimal models built to work minimally with these toolsets. In our case particularly, LlamaIndex can integrate with Bright Data’s MCP Server.
In the next few sections, we’ll walk through the capabilities of the Bright Data toolset from LlamaIndex. Make sure you’ve got Python installed.
Prerequisites
Our requirements here are surprisingly light. For simple scraping operations, we don’t even need an LLM. You need LlamaIndex and a Bright Data API key — that’s it!
LlamaIndex
LlamaIndex offers a full suite of tools you can install with the following command. If you’re only looking to scrape the web, this isn’t strictly required.
You can install their Bright Data Tools with the following command via pip.
Bright Data
First, you need an account with Bright Data. You can use this link to sign up for a free trial with Unlocker. Once you’ve got an account, save your API key.
You can find your API key in your Bright Data “proxies” dashboard or in your user settings.
Scraping With LlamaIndex
BrightDataToolSpec: Your Bridge To Bright Data MCP
LlamaIndex gives us access to the BrightDataToolSpec
class. The snippet below sets up access to all of the tooling. Remember to replace the API key with your own and the zone name with one of your personal zones.
Scrape As Markdown
The snippet below sets you up to scrape any page and return its content as markdown. The scrape_as_markdown()
method does it all for us.
Here’s some sample output from the command. As you can see, we’re successfully scraping Amazon data and converting it to markdown.
Taking Screenshots
Screenshots are another excellent tool when scraping the web. Most modern LLMs can view and interpret pictures. In the snippet below, we take a screenshot of the page with the get_screenshot()
method.
The shot below came from BrightDataToolSpec
. This might be the easiest screenshot method available in all of Python.
Search Engine
Like the tools preceding it, we call the search engine using a simple method search_engine()
. It uses Google by default, but you can use any search engine you want. You can learn more about our SERP query parameters here.
The following search engines are available.
- Bing
- Yandex
- DuckDuckGo
Notice how we’re calling json.loads()
before dumping the data to a JSON file. Even when using .json()
, LlamaIndex outputs its JSON as a string. If you wish to handle it like a dict
, json.loads()
converts it to a traditional JSON object for you.
Here’s a small portion of the JSON file that our scraper writes.
Web Scraper API
The scraping API allows you to create data feeds that trigger collections on demand. In the code below, we use web_data_feed()
to trigger a collection from the Scraper API.
After a few moments, head on over to your logs page. You should see all of your collections logged and ready to download with the click of a button.
Conclusion
Now, you’ve leveled up your web scraping and cut your workload drastically. With LlamaIndex, Bright Data and just a few lines of Python, you can pull almost any data you want from the web.
Whether you’re extracting markdown, capturing screenshots, running Google searches or triggering full scraping jobs, LlamaIndex and Bright Data give you the power to harvest your valuable data.
Ready to take it to the next level? Hook this power tool combo into a live data pipeline or build an AI agent.
Sign up for a free trial now and level up your data collection today!