Web Scraping With Gemini in 2025: Complete Tutorial

Discover how to leverage Gemini AI for web scraping in Python, automate data extraction, and overcome scraping challenges with this guide.
16 min read
Web Scraping with Gemini blog image

In this guide, you will learn:

  • Why Gemini is a great solution for AI-powered web scraping
  • How to use it to scrape a site in Python through a guided tutorial
  • The biggest limitation of this way of scraping the Web and how to overcome it

Let’s dive in!

Why Use Gemini for Web Scraping?

Gemini is a family of multimodal AI models developed by Google that can analyze and interpret text, images, audio, videos, and code. Using Gemini for web scraping simplifies data extraction by automating the interpretation and structuring of unstructured content. That eliminates the need for manual effort—especially when it comes to data parsing.

In detail, these are some of the most common use cases for Gemini in web scraping:

  • Pages that frequently change structure: Gemini can handle dynamic pages where the layout or data elements change often, such as in e-commerce sites like Amazon.
  • Pages with a lot of unstructured data: It excels at extracting useful information from large volumes of unorganized text.
  • Pages where writing custom parsing logic is difficult: For pages with complex or unpredictable structures, Gemini can automate the process without requiring intricate parsing rules.

Common usage scenarios for Gemini in web scraping include:

  • RAG (Retrieval-Augmented Generation): Combining real-time data scraping to enhance AI insights. For a complete example using a similar AI technology, follow our tutorial on how to create a RAG chatbot using SERP data.
  • Social media scraping: Collecting structured data from platforms with dynamic content.
  • Content aggregation: Gathering news, articles, or blog posts from multiple sources to create summaries or analytics.

For more information, refer to our guide on using AI for web scraping.

Web Scraping with Gemini in Python: Step-By-Step Guide

As the target site for this section, we will use a specific product page from the “Ecommerce Test Site to Learn Web Scraping” sandbox:

The target web page

This is a great example because most e-commerce product pages display different types of data or have varying structures. That is what makes e-commerce web scraping so challenging, and where AI can help.

The goal of our Gemini-powered scraper is to leverage AI to extract product details from the page without writing manual parsing logic. The product data retrieved via AI will include:

  • SKU
  • Name
  • Images
  • Price
  • Description
  • Sizes
  • Colors
  • Category

Follow the steps below to learn how to perform web scraping with Gemini!

Step #1: Project Setup

Before getting started, verify that you have Python 3 installed on your computer. Otherwise, download it and follow the installation wizard.

Now, launch the following command to create a folder for your scraping project:

mkdir gemini-scraper

gemini-scraper represents the project folder of your Python Gemini-powered web scraper.

Navigate to it in the terminal, and initialize a virtual environment inside it:

cd gemini-scraper
python -m venv venv

Load the project folder in your favorite Python IDE. Visual Studio Code with the Python extension or PyCharm Community Edition are two great options.

Create a scraper.py file in the project’s folder, which should now contain this file structure:

The file structure of the Gemini-Powered Scraper

Currently, scraper.py is a blank Python script but it will soon contain the desired LLM scraping logic.

In the IDE’s terminal, activate the virtual environment. In Linux or macOS, execute this command:

./venv/bin/activate

Equivalently, on Windows, run:

venv/Scripts/activate

Wonderful! You now have a Python environment for web scraping with Gemini.

Step #2: Configure Gemini

Gemini provides an API that you can call using any HTTP client—including requests. Still, it is best to connect through the official Google AI Python SDK for the Gemini API. To install it, run the following command in the activated virtual environment:

pip install google-generativeai

Then, import it in your scraper.py file:

import google.generativeai as genai

To make the SDK work, you need a Gemini API key. If you have not retrieved your API key
yet, follow the official Google documentation. Specifically, log in to your Google account and join Google AI Studio. Navigate to the “Get API Key” page, and you will see the following modal:

Note the “Get API key” button

Click the “Get API key” button, and the following section will appear:

Note the “Create API key” button

Now, press “Create API key” to generate your Gemini API key:

Your new Gemini API key

Copy the key and store it in a safe place.

Note: The Gemini free tier is enough for this example. The paid tier is only necessary if you need higher rate limits or want to ensure that your prompts and responses are not used to improve Google products. For more details, refer to the Gemini billing page.

To use the Gemini API key in Python, you can either set it as an environment variable:

export GEMINI_API_KEY=<YOUR_GEMINI_API_KEY>

Or, alternatively, store it directly in your Python script as a constant:

GEMINI_API_KEY="<YOUR_GEMINI_API_KEY>"

And pass it to genai as a configuration, as follows:

genai.configure(api_key=GEMINI_API_KEY)

In this case, we will follow the second approach. However, keep in mind that both methods work, as google-generativeai automatically tries to read the API key from GEMINI_API_KEY if you do not pass it manually.

Amazing! You can now use the Gemini SDK to make API requests to the LLM in Python.

Step #3: Get the HTML of the Target Page

To connect to the target server and retrieve the HTML of its web pages, we will use Requests—the most popular HTTP client in Python. In an activated virtual environment, install it with:

pip install requests

Then, import it in scraper.py:

import requests

Use it to send a GET request to the target page and retrieve its HTML document:

url = "https://www.scrapingcourse.com/ecommerce/product/adrienne-trek-jacket/"
response = requests.get(url)

response.content will now hold the raw HTML of the page. Time to parse it and get ready to extract data from it!

Step #4: Convert the HTML to Markdown

If you compare other AI scraping technologies like Crawl4AI, you will notice that they allow you to use CSS selectors to target HTML elements. These libraries then convert the HTML of the selected elements into Markdown text. Finally, they process that text with an LLM.

Ever wondered why? Well, for two key reasons for that behavior:

  1. To reduce the number of tokens sent to the AI, helping you save money (since not all LLM providers are free like Gemini).
  2. To make AI processing faster, as less input data means lower computational costs and quicker responses.

For a complete walkthrough, see our guide on web scraping using CrawlAI and DeepSeek.

Let’s try to replicate that logic and see if it actually makes sense. Start by inspecting the target page by opening it in an incognito window (to open a fresh session). Then, right-click anywhere on the page and select the “Inspect” option.

Examine the page structure. You will see that all relevant data is contained within the HTML element identified by the CSS selector #main:

The #main HTML element contains all the data of interest

You could send the entire raw HTML to Gemini, but that would introduce a lot of unnecessary information (such as headers and footers). Instead, by passing only the #main content, you reduce noise and prevent AI hallucinations.

To select only #main, you need a Python HTML parsing tool, such as Beautiful Soup. So, install it with:

pip install beautifulsoup4

If you are unfamiliar with its syntax, check out our guide on Beautiful Soup web scraping.

Now, import it in scraper.py:

from bs4 import BeautifulSoup

Use Beautiful Soup to parse the raw HTML retrieved via Requests, select the #main element, and extract its HTML:

# Parse the HTML with BeautifulSoup
soup = BeautifulSoup(response.content, "html.parser")

# Extract the #main element
main_element = soup.select_one("#main")

# Get its outer HTML
main_html = str(main_element)

If you print main_html, you will see something like this:

<main id="main" class="site-main" role="main" data-testid="main-content" data-content="main-area">
    <!-- omitted for brevity... -->
</main>

Now, verify how many tokens this HTML would generate and estimate the cost if you were using Gemini’s paid tier. To do so, use a tool like Token Calculator:

The token usage and price prediction from using the raw #main HTML

As you can tell, this approach equates to nearly 20,000 tokens, costing around $0.25 per request for Gemini 1.5 Pro. On a large-scale scraping project, that can easily become a problem!

Try to convert the extracted HTML into Markdown—similar to what Crawl4AI does. First, install an HTML-to-Markdown library like markdownify:

pip install markdownify

Import markdownify in scraper.py:

from markdownify import markdownify

Next, use markdownify to convert the extracted HTML into Markdown:

main_markdown = markdownify(main_html)

The resulting main_markdown string contain something like this:

[![](https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wj08-gray_main-416x516.jpg "wj08-gray_main.jpg")](https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wj08-gray_main.jpg)

[![](https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wj08-gray_alt1-416x516.jpg "wj08-gray_alt1.jpg")](https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wj08-gray_alt1.jpg)

[![](https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wj08-gray_alternate-416x516.jpg "wj08-gray_alternate.jpg")](https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wj08-gray_alternate.jpg)

[![](https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wj08-gray_back-416x516.jpg "wj08-gray_back.jpg")](https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wj08-gray_back.jpg)

Adrienne Trek Jacket
====================

$57.00

This is a variable product called a Adrienne Trek Jacket

|  |  |
| --- | --- |
| Size | Choose an optionXSSMLXL |
| Color | Choose an optionGrayOrangePurple[Clear](#) |

Adrienne Trek Jacket quantity

Add to cart

SKU: WJ08
Category: [Erin Recommends|Clothing](https://www.scrapingcourse.com/ecommerce/product-category/clothing/women/tops-women/jacketsclothing-tops-women/promotions-jacketsclothing-tops-women/women-saleclothing-promotions-jacketsclothing-tops-women/collections-women-saleclothing-promotions-jacketsclothing-tops-women/erin-recommendsclothing-collections-women-saleclothing-promotions-jacketsclothing-tops-women/)

* [Description](#tab-description)
* [Additional information](#tab-additional_information)

Description
-----------

You’re ready for a cross-country jog or a coffee on the patio in the Adrienne Trek Jacket. Its style is unique with stand collar and drawstrings, and it fits like a jacket should.

* gray 1/4 zip pullover.
* Comfortable, relaxed fit.
* Front zip for venting.
* Spacious, kangaroo pockets.
* 27″ body length.
* 95% Organic Cotton / 5% Spandex.

Additional information
----------------------

|  |  |
| --- | --- |
| Size | XS, S, M, L, XL |
| Color | Gray, Orange, Purple |

This Markdown version of the input data is a lot smaller than the original #main HTML while containing all the key data needed for scraping.

Use Token Calculator again to verify how many tokens the new input would consume:

Wow, we reduced 19,858 tokens down to 765 tokens—a 95% reduction!

Step #5: Use the LLM to Extract Data

To perform web scraping with Gemini, follow these steps:

  1. Write a well-structured prompt to extract the desired data from the Markdown input. Make sure to define the attributes you want the result to have.
  2. Send a request to a Gemini LLM model using genai, configuring it so that the request will return JSON-formatted data.
  3. Parse the returned JSON.

Implement the above logic with these lines of code:

# Extract structured data using Gemini
prompt = f"""Extract data from the content below. Respond with a raw string in JSON format containing the scraped data in the specified attributes:\n\n
JSON ATTRIBUTES: \n
sku, name, images, price, description, sizes, colors, category

CONTENT:\n
{main_markdown}
"""
model = genai.GenerativeModel("gemini-2.0-flash-lite", generation_config={"response_mime_type": "application/json"})
response = model.generate_content(prompt)

# Get the response and parse it from JSON
product_raw_string = response.text
product_data = json.loads(product_raw_string)

The prompt variable instructs Gemini to extract structured data from the main_markdown content. Then, genai.GenerativeModel() sets the "gemini-2.0-flash-lite" model to perform the LLM request. Finally, the raw response string in JSON format is converted into a usable Python dictionary with json.loads().

Note the "application/json" configuration to tell Gemini to return JSON data.

Do not forget to import json from the Python Standard Library:

import json

Now that you have the scraped data in a product_data dictionary, you could access its fields for further data processing, as in the example below:

price = product_data["price"]
price_eur = price * USD_EUR
# ...

Fantastic! You just utilized Gemini for web scraping. It only remains to export the scraped data.

Step #6: Export the Scraped Data

Currently, you have the scraped data stored in a Python dictionary. To export it to a JSON file, use the following code:

with open("product.json", "w", encoding="utf-8") as json_file:
    json.dump(product_data, json_file, indent=4)

This will create a product.json file containing the scraped data in JSON format.

Congratulations! The Gemini-powered web scraper is complete.

Step #7: Put It All Together

Below is the complete code of your Gemini scraping script:

import google.generativeai as genai
import requests
from bs4 import BeautifulSoup
from markdownify import markdownify
import json

# Your Gemini API key
GEMINI_API_KEY = "<YOUR_GEMINI_API_KEY>"

# Set up the Google Gemini API
genai.configure(api_key=GEMINI_API_KEY)

# Fetch the HTML content of the target page
url = "https://www.scrapingcourse.com/ecommerce/product/adrienne-trek-jacket/"
response = requests.get(url)

# Parse the HTML of the target page with BeautifulSoup
soup = BeautifulSoup(response.content, "html.parser")

# Select the #main element
main_element = soup.select_one("#main")

# Get its outer HTML and convert it to Markdown
main_html = str(main_element)
main_markdown = markdownify(main_html)

# Extract structured data using Gemini
prompt = f"""Extract data from the content below. Respond with a raw string in JSON format containing the scraped data in the specified attributes:\n\n
JSON ATTRIBUTES: \n
sku, name, images, price, description, sizes, colors, category

CONTENT:\n
{main_markdown}
"""
model = genai.GenerativeModel("gemini-2.0-flash-lite", generation_config={"response_mime_type": "application/json"})
response = model.generate_content(prompt)

# Get the response and parse it from JSON
product_raw_string = response.text
product_data = json.loads(product_raw_string)

# Futher data processing... (optional)

# Export the scraped data to JSON
with open("product.json", "w", encoding="utf-8") as json_file:
    json.dump(product_data, json_file, indent=4)

Launch the script with:

python scraper.py

Once executed, a product.json file will appear in your project folder. Open it, and you will see structured data like this:

{
    "sku": "WJ08",
    "name": "Adrienne Trek Jacket",
    "images": [
        "https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wj08-gray_main-416x516.jpg",
        "https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wj08-gray_alt1-416x516.jpg",
        "https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wj08-gray_alternate-416x516.jpg",
        "https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/wj08-gray_back-416x516.jpg"
    ],
    "price": "$57.00",
    "description": "You\u2019re ready for a cross-country jog or a coffee on the patio in the Adrienne Trek Jacket. Its style is unique with stand collar and drawstrings, and it fits like a jacket should.\n\n\u2022 gray 1/4 zip pullover.  \n\u2022 Comfortable, relaxed fit.  \n\u2022 Front zip for venting.  \n\u2022 Spacious, kangaroo pockets.  \n\u2022 27\u2033 body length.  \n\u2022 95% Organic Cotton / 5% Spandex.",
    "sizes": [
        "XS",
        "S",
        "M",
        "L",
        "XL"
    ],
    "colors": [
        "Gray",
        "Orange",
        "Purple"
    ],
    "category": "Erin Recommends|Clothing"
}

Et voilà! You started from unstructured data in an HTML page and you now have it in a structured JSON file, thanks to Gemini-powered web scraping.

Next Steps

To take your Gemini-powered scraper to the next level, consider these improvements:

  • Make it reusable: Modify the script to accept the prompt and target URL as command-line arguments. That will make it general-purpose and adaptable for different usage scenarios.
  • Implement web crawling: Extend the scraper to handle multi-page websites by adding logic for crawling and pagination.
  • Secure API credentials: Store your Gemini API key in a .env file and use python-dotenv to load it. This prevents exposing your API key in the code.

Overcoming the Main Limitation of This Web Scraping Approach

What is the biggest limitation of this approach to web scraping? The HTTP request made by requests!

Sure, in the example above, it worked perfectly—but that is because the target site is just a web scraping playground. In reality, companies and website owners know how valuable their data is, even when it is publicly accessible. To protect it, they implement anti-scraping measures that can easily block your automated HTTP requests.

Also, the approach above will not work on dynamic sites that rely on JavaScript for rendering or fetching data asynchronously. Thus, sites do not even need advanced anti-scraping frameworks to stop your scraper. Using JavaScript-based content loading is enough.

The solution to all those issues? A Web Unlocking API!

A Web Unlocker API is an HTTP endpoint that you can call from any HTTP client. The key difference? It returns the fully unlocked HTML of any URL you pass to it—bypassing any anti-scraping block. No matter how many protections a target site has, a simple request to Web Unlocker will fetch the page’s HTML for you.

To get started with that tool and retrieve your API key, follow the official Web Unlocker documentation. Then, replace your existing request code from “Step #3” with these lines:

WEB_UNLOCKER_API_KEY = "<YOUR_WEB_UNLOCKER_API_KEY>"

# Set up authentication headers for Web Unlocker
headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {WEB_UNLOCKER_API_KEY}"
}

# Define the request payload
payload = {
    "zone": "unblocker",
    "url": "https://www.scrapingcourse.com/ecommerce/product/adrienne-trek-jacket/",  # Replace with your target URL
    "format": "raw"
}

# Fetch the unlocked HTML of the target page
response = requests.post("https://api.brightdata.com/request", json=payload, headers=headers)

And just like that—no more blocks, no more limitations! You can now scrape the Web using Gemini without worrying about getting stopped.

Conclusion

In this blog post, you learned how to use Gemini in combination with Requests and other tools to build an AI-powered scraper. One of the major challenges in web scraping is the risk of being blocked, but this was solved using Bright Data’s Web Unlocker API.

As explained here, by combining Gemini and the Web Unlocker API, you can extract data from any site without needing custom parsing logic. This is just one of many scenarios that Bright Data’s products and services support, helping you implement effective AI-driven web scraping.

Explore our other web scraping tools:

  • Proxy Services: Four different types of proxies to bypass location restrictions, including 72 million+ residential IPs
  • Web Scraper APIs: Dedicated endpoints for extracting fresh, structured web data from over 100 popular domains.
  • SERP API: API to handle all ongoing unlocking management for SERP and extract one page
  • Scraping Browser: Puppeteer, Selenium, and Playwright-compatible browser with built-in unlocking activities

Sign up now to Bright Data and test our proxy services and scraping products for free!

No credit card required