In this guide, you will learn:
- Why Gemini is a great solution for AI-powered web scraping
- How to use it to scrape a site in Python through a guided tutorial
- The biggest limitation of this way of scraping the Web and how to overcome it
Let’s dive in!
Why Use Gemini for Web Scraping?
Gemini is a family of multimodal AI models developed by Google that can analyze and interpret text, images, audio, videos, and code. Using Gemini for web scraping simplifies data extraction by automating the interpretation and structuring of unstructured content. That eliminates the need for manual effort—especially when it comes to data parsing.
In detail, these are some of the most common use cases for Gemini in web scraping:
- Pages that frequently change structure: Gemini can handle dynamic pages where the layout or data elements change often, such as in e-commerce sites like Amazon.
- Pages with a lot of unstructured data: It excels at extracting useful information from large volumes of unorganized text.
- Pages where writing custom parsing logic is difficult: For pages with complex or unpredictable structures, Gemini can automate the process without requiring intricate parsing rules.
Common usage scenarios for Gemini in web scraping include:
- RAG (Retrieval-Augmented Generation): Combining real-time data scraping to enhance AI insights. For a complete example using a similar AI technology, follow our tutorial on how to create a RAG chatbot using SERP data.
- Social media scraping: Collecting structured data from platforms with dynamic content.
- Content aggregation: Gathering news, articles, or blog posts from multiple sources to create summaries or analytics.
For more information, refer to our guide on using AI for web scraping.
Web Scraping with Gemini in Python: Step-By-Step Guide
As the target site for this section, we will use a specific product page from the “Ecommerce Test Site to Learn Web Scraping” sandbox:
This is a great example because most e-commerce product pages display different types of data or have varying structures. That is what makes e-commerce web scraping so challenging, and where AI can help.
The goal of our Gemini-powered scraper is to leverage AI to extract product details from the page without writing manual parsing logic. The product data retrieved via AI will include:
- SKU
- Name
- Images
- Price
- Description
- Sizes
- Colors
- Category
Follow the steps below to learn how to perform web scraping with Gemini!
Step #1: Project Setup
Before getting started, verify that you have Python 3 installed on your computer. Otherwise, download it and follow the installation wizard.
Now, launch the following command to create a folder for your scraping project:
gemini-scraper
represents the project folder of your Python Gemini-powered web scraper.
Navigate to it in the terminal, and initialize a virtual environment inside it:
Load the project folder in your favorite Python IDE. Visual Studio Code with the Python extension or PyCharm Community Edition are two great options.
Create a scraper.py
file in the project’s folder, which should now contain this file structure:
Currently, scraper.py
is a blank Python script but it will soon contain the desired LLM scraping logic.
In the IDE’s terminal, activate the virtual environment. In Linux or macOS, execute this command:
Equivalently, on Windows, run:
Wonderful! You now have a Python environment for web scraping with Gemini.
Step #2: Configure Gemini
Gemini provides an API that you can call using any HTTP client—including requests
. Still, it is best to connect through the official Google AI Python SDK for the Gemini API. To install it, run the following command in the activated virtual environment:
Then, import it in your scraper.py
file:
To make the SDK work, you need a Gemini API key. If you have not retrieved your API key
yet, follow the official Google documentation. Specifically, log in to your Google account and join Google AI Studio. Navigate to the “Get API Key” page, and you will see the following modal:
Click the “Get API key” button, and the following section will appear:
Now, press “Create API key” to generate your Gemini API key:
Copy the key and store it in a safe place.
Note: The Gemini free tier is enough for this example. The paid tier is only necessary if you need higher rate limits or want to ensure that your prompts and responses are not used to improve Google products. For more details, refer to the Gemini billing page.
To use the Gemini API key in Python, you can either set it as an environment variable:
Or, alternatively, store it directly in your Python script as a constant:
And pass it to genai
as a configuration, as follows:
In this case, we will follow the second approach. However, keep in mind that both methods work, as google-generativeai
automatically tries to read the API key from GEMINI_API_KEY
if you do not pass it manually.
Amazing! You can now use the Gemini SDK to make API requests to the LLM in Python.
Step #3: Get the HTML of the Target Page
To connect to the target server and retrieve the HTML of its web pages, we will use Requests—the most popular HTTP client in Python. In an activated virtual environment, install it with:
Then, import it in scraper.py
:
Use it to send a GET
request to the target page and retrieve its HTML document:
response.content
will now hold the raw HTML of the page. Time to parse it and get ready to extract data from it!
Step #4: Convert the HTML to Markdown
If you compare other AI scraping technologies like Crawl4AI, you will notice that they allow you to use CSS selectors to target HTML elements. These libraries then convert the HTML of the selected elements into Markdown text. Finally, they process that text with an LLM.
Ever wondered why? Well, for two key reasons for that behavior:
- To reduce the number of tokens sent to the AI, helping you save money (since not all LLM providers are free like Gemini).
- To make AI processing faster, as less input data means lower computational costs and quicker responses.
For a complete walkthrough, see our guide on web scraping using CrawlAI and DeepSeek.
Let’s try to replicate that logic and see if it actually makes sense. Start by inspecting the target page by opening it in an incognito window (to open a fresh session). Then, right-click anywhere on the page and select the “Inspect” option.
Examine the page structure. You will see that all relevant data is contained within the HTML element identified by the CSS selector #main
:
You could send the entire raw HTML to Gemini, but that would introduce a lot of unnecessary information (such as headers and footers). Instead, by passing only the #main
content, you reduce noise and prevent AI hallucinations.
To select only #main
, you need a Python HTML parsing tool, such as Beautiful Soup. So, install it with:
If you are unfamiliar with its syntax, check out our guide on Beautiful Soup web scraping.
Now, import it in scraper.py
:
Use Beautiful Soup to parse the raw HTML retrieved via Requests, select the #main
element, and extract its HTML:
If you print main_html
, you will see something like this:
Now, verify how many tokens this HTML would generate and estimate the cost if you were using Gemini’s paid tier. To do so, use a tool like Token Calculator:
As you can tell, this approach equates to nearly 20,000 tokens, costing around $0.25 per request for Gemini 1.5 Pro. On a large-scale scraping project, that can easily become a problem!
Try to convert the extracted HTML into Markdown—similar to what Crawl4AI does. First, install an HTML-to-Markdown library like markdownify
:
Import markdownify
in scraper.py
:
Next, use markdownify
to convert the extracted HTML into Markdown:
The resulting main_markdown
string contain something like this:
This Markdown version of the input data is a lot smaller than the original #main
HTML while containing all the key data needed for scraping.
Use Token Calculator again to verify how many tokens the new input would consume:
Wow, we reduced 19,858 tokens down to 765 tokens—a 95% reduction!
Step #5: Use the LLM to Extract Data
To perform web scraping with Gemini, follow these steps:
- Write a well-structured prompt to extract the desired data from the Markdown input. Make sure to define the attributes you want the result to have.
- Send a request to a Gemini LLM model using
genai
, configuring it so that the request will return JSON-formatted data. - Parse the returned JSON.
Implement the above logic with these lines of code:
The prompt
variable instructs Gemini to extract structured data from the main_markdown
content. Then, genai.GenerativeModel()
sets the "gemini-2.0-flash-lite"
model to perform the LLM request. Finally, the raw response string in JSON format is converted into a usable Python dictionary with json.loads()
.
Note the "application/json"
configuration to tell Gemini to return JSON data.
Do not forget to import json
from the Python Standard Library:
Now that you have the scraped data in a product_data
dictionary, you could access its fields for further data processing, as in the example below:
Fantastic! You just utilized Gemini for web scraping. It only remains to export the scraped data.
Step #6: Export the Scraped Data
Currently, you have the scraped data stored in a Python dictionary. To export it to a JSON file, use the following code:
This will create a product.json
file containing the scraped data in JSON format.
Congratulations! The Gemini-powered web scraper is complete.
Step #7: Put It All Together
Below is the complete code of your Gemini scraping script:
Launch the script with:
Once executed, a product.json
file will appear in your project folder. Open it, and you will see structured data like this:
Et voilà! You started from unstructured data in an HTML page and you now have it in a structured JSON file, thanks to Gemini-powered web scraping.
Next Steps
To take your Gemini-powered scraper to the next level, consider these improvements:
- Make it reusable: Modify the script to accept the prompt and target URL as command-line arguments. That will make it general-purpose and adaptable for different usage scenarios.
- Implement web crawling: Extend the scraper to handle multi-page websites by adding logic for crawling and pagination.
- Secure API credentials: Store your Gemini API key in a
.env
file and usepython-dotenv
to load it. This prevents exposing your API key in the code.
Overcoming the Main Limitation of This Web Scraping Approach
What is the biggest limitation of this approach to web scraping? The HTTP request made by requests!
Sure, in the example above, it worked perfectly—but that is because the target site is just a web scraping playground. In reality, companies and website owners know how valuable their data is, even when it is publicly accessible. To protect it, they implement anti-scraping measures that can easily block your automated HTTP requests.
Also, the approach above will not work on dynamic sites that rely on JavaScript for rendering or fetching data asynchronously. Thus, sites do not even need advanced anti-scraping frameworks to stop your scraper. Using JavaScript-based content loading is enough.
The solution to all those issues? A Web Unlocking API!
A Web Unlocker API is an HTTP endpoint that you can call from any HTTP client. The key difference? It returns the fully unlocked HTML of any URL you pass to it—bypassing any anti-scraping block. No matter how many protections a target site has, a simple request to Web Unlocker will fetch the page’s HTML for you.
To get started with that tool and retrieve your API key, follow the official Web Unlocker documentation. Then, replace your existing request code from “Step #3” with these lines:
And just like that—no more blocks, no more limitations! You can now scrape the Web using Gemini without worrying about getting stopped.
Conclusion
In this blog post, you learned how to use Gemini in combination with Requests and other tools to build an AI-powered scraper. One of the major challenges in web scraping is the risk of being blocked, but this was solved using Bright Data’s Web Unlocker API.
As explained here, by combining Gemini and the Web Unlocker API, you can extract data from any site without needing custom parsing logic. This is just one of many scenarios that Bright Data’s products and services support, helping you implement effective AI-driven web scraping.
Explore our other web scraping tools:
- Proxy Services: Four different types of proxies to bypass location restrictions, including 72 million+ residential IPs
- Web Scraper APIs: Dedicated endpoints for extracting fresh, structured web data from over 100 popular domains.
- SERP API: API to handle all ongoing unlocking management for SERP and extract one page
- Scraping Browser: Puppeteer, Selenium, and Playwright-compatible browser with built-in unlocking activities
Sign up now to Bright Data and test our proxy services and scraping products for free!
No credit card required