How to Build an AI Scraper With Crawl4AI and DeepSeek

Learn to build an AI-powered web scraper with Crawl4AI and DeepSeek through our detailed, step-by-step tutorial.
22 min read
Web Scraping with Crawl4AI Deepseek and Web Unlocker blog image

In this tutorial, you will learn:

  • What Crawl4AI is and what it offers for web scraping
  • The ideal scenarios for using Crawl4AI with an LLM like DeepSeek
  • How to build a DeepSeek-powered Crawl4AI scraper in a guided section.

Let’s dive in!

What Is Craw4AI?

Crawl4AI is an open-source, AI-ready web crawler and scraper designed for seamless integration with large language models (LLMs), AI agents, and data pipelines. It delivers high-speed, real-time data extraction while being flexible and easy to deploy.

The features it offers for AI web scraping are:

  • Built for LLMs: Generates structured Markdown optimized for retrieval-augmented generation (RAG) and fine-tuning.
  • Flexible browser control: Supports session management, proxies, and custom hooks.
  • Heuristic intelligence: Uses smart algorithms to optimize data parsing.
  • Fully open source: No API keys required; deployable via Docker and cloud platforms.

Discover more on the official documentation.

When To Use Crawl4AI and DeepSeek for Web Scraping

DeepSeek offers powerful, open-source, free LLM models that have made waves in the AI community due to their efficiency and effectiveness. Plus, these models integrate smoothly with Crawl4AI.

By leveraging DeepSeek in Crawl4AI, you can extract structured data from even the most complex and inconsistent web pages. All that without the need for predefined parsing logic.

Below are key scenarios where the DeepSeek + Crawl4AI combination is especially useful:

  • Frequent site structure changes: Traditional scrapers break when websites update their HTML structure, but AI dynamically adapts.
  • Inconsistent page layouts: Platforms like Amazon have varying product page designs. An LLM can intelligently extract data regardless of layout differences.
  • Unstructured content parsing: Extracting insights from free-text reviews, blog posts, or forum discussions becomes easy with LLM-powered processing.

Web Scraping With Craw4AI and DeepSeek: Step-By-Step Guide

In this guided tutorial, you will learn how to build an AI-powered web scraper using Crawl4AI. As the LLM engine, we will use DeepSeek.

Specifically, you will see how to create an AI scraper to extract data from G2 page for Bright Data:

The G2 target page

Follow the steps below and learn how to perform web scraping with Crawl4AI and DeepSeek!

Prerequisites

To follow this tutorial, ensure you meet the following prerequisites:

Do not worry if you do not have a GroqCloud or Bright Data account yet. You will be guided through their setup during the next steps.

Step #1: Project Setup

Run the following command to create a folder for your Crawl4AI DeepSeek scraping project:

mkdir crawl4ai-deepseek-scraper

Navigate into the project folder and create a virtual environment:

cd crawl4ai-deepseek-scraper
python -m venv venv

Now, load the crawl4ai-deepseek-scraper folder in your favorite Python IDE. Visual Studio Code with the Python extension or PyCharm Community Edition are two great options.

Inside the project folder, create:

  • scraper.py: The file that will contain the AI-powered scraping logic.
  • models/: A directory to store Pydantic-based Crawl4AI LLM data models.
  • .env: A file to store environment variables securely.

After creating these files and folders, your project structure should look like this:

The file structure of the Crawl4AI DeepSeek scraper

Next, activate the virtual environment in your IDE’s terminal.

In Linux or macOS, launch this command:

./env/bin/activate

Equivalently, on Windows, execute:

env/Scripts/activate

Great! You now have a Python environment for Crawl4AI web scraping with DeepSeek.

Step #2: Install Craw4AI

With your virtual environment activated, install Crawl4AI via the crawl4ai pip package:

pip install crawl4ai

Note that the library has several dependencies, so the installation might take a while.

Once installed, run the following command in your terminal:

crawl4ai-setup

The process:

  1. Installs or updates the required Playwright browsers (Chromium, Firefox, etc.).
  2. Performs OS-level checks (e.g., ensuring required system libraries are installed on Linux).
  3. Confirms your environment is properly set up for web crawling.

After running the command, you should see an output similar to this:

[INIT].... → Running post-installation setup...
[INIT].... → Installing Playwright browsers...
[COMPLETE] ● Playwright installation completed successfully.
[INIT].... → Starting database initialization...
[COMPLETE] ● Database backup created at: C:\Users\antoz\.crawl4ai\crawl4ai.db.backup_20250219_092341
[INIT].... → Starting database migration...
[COMPLETE] ● Migration completed. 0 records processed.
[COMPLETE] ● Database initialization completed successfully.
[COMPLETE] ● Post-installation setup completed!

Amazing! Crawl4AI is now installed and ready to use.

Step #4: Initialize scraper.py

Since Crawl4AI requires asynchronous code, start by creating a basic asyncio script:

import asyncio

async def main():
    # Scraping logic...

if __name__ == "__main__":
    asyncio.run(main())

Now, remember that the project involves integrations with third-party services like DeepSeek. To implement that, you will need to rely on API keys and other secrets. We will store them in a .env file.

Install python-dotenv to load environment variables:

pip install python-dotenv

Before defining main(), load the environment variables from the .env file with load_dotenv():

load_dotenv()

Import load_dotenv from the python-dotenv library:

from dotenv import load_dotenv

Perfect! scraper.py is ready to host some AI-powered scraping logic.

Step #5: Create Your First AI Scraper

Inside the main() function in scraper.py, add the following logic using a basic Crawl4AI crawler:

# Browser configuration
browser_config = BrowserConfig(
    headless=True
)

# Crawler configuration
crawler_config = CrawlerRunConfig(
    cache_mode=CacheMode.BYPASS
)

# Run the AI-powered crawler
async with AsyncWebCrawler(config=browser_config) as crawler:
    result = await crawler.arun(
        url="https://www.g2.com/products/bright-data/reviews",
        config=crawler_config
    )

    # print the first 1000 characters
    print(f"Parsed Markdown data:\n{result.markdown[:1000]}")

In the above snippet, the key points are:

  • BrowserConfig: Controls how the browser is launched and behaves, including settings like headless mode and custom user agents for web scraping.
  • CrawlerRunConfig: Defines the crawling behavior, such as caching strategy, data selection rules, timeouts, and more.
  • headless=True: Configures the browser to run in headless mode—without the GUI—to save resources.
  • CacheMode.BYPASS: This configuration guarantees that the crawler fetches fresh content directly from the website instead of relying on cached data.
  • crawler.arun(): This method launches the asynchronous crawler to extract data from the specified URL.
  • result.markdown: The extracted content is converted into Markdown format, making it easier to parse and analyze.

Do not forget to add the following imports:

from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode

Right now, scraper.py should contain:

import asyncio
from dotenv import load_dotenv
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode

# Load secrets from .env file
load_dotenv()

async def main():
    # Browser configuration
    browser_config = BrowserConfig(
      headless=True
    )

    # Crawler configuration
    crawler_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS
    )

    # Run the AI-powered crawler
    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url="https://www.g2.com/products/bright-data/reviews",
            config=crawler_config
        )

        # print the first 1000 characters
        print(f"Parsed Markdown data:\n{result.markdown[:1000]}")

if __name__ == "__main__":
    asyncio.run(main())

If you execute the script, you should see an output as below:

[INIT].... → Crawl4AI 0.4.248
[FETCH]... ↓ https://www.g2.com/products/bright-data/reviews... | Status: True | Time: 0.83s
[SCRAPE].. ◆ Processed https://www.g2.com/products/bright-data/reviews... | Time: 1ms
[COMPLETE] ● https://www.g2.com/products/bright-data/reviews... | Status: True | Total: 0.83s
Parsed Markdown data:

That is suspicious, as the parsed Markdown content is empty. To investigate further, print the response status:

print(f"Response status code: {result.status_code}")

This time, the output will include:

Response status code: 403

The Markdown-parsed result is empty because the Crawl4AI request was blocked by G2’s bot detection systems. That is clear by the 403 Forbidden status code returned by the server.

That should not be surprising, as G2 has strict anti-bot measures in place. In particular, it often displays CAPTCHAs—even when accessed through a regular browser:

The G2 CAPTCHA page

In this case, since no valid content was received, Crawl4AI could not convert it to Markdown. In the next step, we will explore how to bypass this restriction. For further reading, take a look at our guide on how to bypass CAPTCHAs in Python.

Step #6: Configure Web Unlocker API

Crawl4AI is a powerful tool with built-in bot bypassing mechanisms. However, it cannot bypass highly protected websites like G2, which employ strict and top-notch anti-bot and anti-scraping measures.

Against such sites, the best solution is to use a dedicated tool designed to unblock any webpage, regardless of its protection level. The ideal scraping product for this task is Bright Data’s Web Unlocker, a scraping API that:

  • Simulates real user behavior to bypass anti-bot detection
  • Handles proxy management and CAPTCHA solving automatically
  • Scales seamlessly without requiring infrastructure management

Follow the next instructions to integrate Web Unlocker API into your Crawl4AI DeepSeek scraper.
Alternatively, take a look at the official documentation.

First, log in to your Bright Data account or create one if you have not already. Fund your account or take advantage of the free trial available for all products.

Next, navigate to “Proxies & Scraping” in the dashboard and select the “unblocker” option in the table:

Selecting the "unblocker" option in the Bright Data dashboard

This will take you to the Web Unlocker API setup page shown below:

The Web Unlocker API setup page

Here, enable Web Unlocker API by clicking on the toggle:

The toggle is now set to "On"

G2 is protected by advanced anti-bot defenses, including CAPTCHAs. Thus, verify that the following two toggles are enabled on the “Configuration” page:

Enabling the "Premium domains" and "CAPTCHA Solver" options

Crawl4AI operates by navigating pages in a controlled browser. Under the hood, it relies on Playwright’s goto() function, which sends an HTTP GET request to the target webpage. In contrast, Web Unlocker API works through POST requests.

That is not a problem as you can still use Web Unlocker API with Crawl4AI by configuring it as a proxy. This allows Crawl4AI’s browser to send requests through Bright Data’s product, receiving back unblocked HTML pages.

To access your Web Unlocker API proxy credentials, reach the “Native proxy-based access” tab on the “Overview” page:

native proxy-based access on the overview page

Copy the following credentials from the page:

  • <HOST>
  • <PORT>
  • <USERNAME>
  • <PASSWORD>

Then, use them to populate your .env file with these environment variables:

PROXY_SERVER=https://<HOST>:<PORT>
PROXY_USERNAME=<USERNAME>
PROXY_PASSWORD=<PASSWORD>

Fantastic! Web Unlocker is now ready for integration with Crawl4AI.

Step #7: Integrate Web Unlocker API

BrowserConfig supports proxy integration through the proxy_config object. To integrate Web Unlocker API with Crawl4AI, populate that object with the environment variables from your .env file and pass it to the BrowserConfig constructor:

# Bright Data's Web Unlocker API proxy configuration
proxy_config = {
    "server": os.getenv("PROXY_SERVER"),
    "username": os.getenv("PROXY_USERNAME"),
    "password": os.getenv("PROXY_PASSWORD")
}

# Browser configuration
browser_config = BrowserConfig(
    headless=True,
    proxy_config=proxy_config,
)

Remember to import os from the Python Standard Library:

import os

Keep in mind that Web Unlocker API introduces some time overhead due to IP rotation via the proxy and eventual CAPTCHA solving. To account for that, you should:

  1. Increase the page load timeout to 3 minutes
  2. Instruct the crawler to wait for the DOM to be fully loaded before parsing it

Achieve that with the following CrawlerRunConfig configuration:

crawler_config = CrawlerRunConfig(
    cache_mode=CacheMode.BYPASS,
    wait_until="domcontentloaded", # wait until the DOM of the page has been loaded
    page_timeout=180000, # wait up to 3 mins for page load
)

Note that even Web Unlocker API is not flawless when dealing with complex sites like G2. Rarely, the scraping API may fail to retrieve the unblocked page, causing the script to terminate with the following error:

Error: Failed on navigating ACS-GOTO:
Page.goto: net::ERR_HTTP_RESPONSE_CODE_FAILURE at https://www.g2.com/products/bright-data/reviews

Rest assured, you are only charged for successful requests. So, there is no need to worry about relaunching the script until it works. On a production script, consider implementing an automatic retry logic.

When the request is successful, you will receive an output like this:

Response status code: 200
Parsed Markdown data:
  * [Home](https://www.g2.com/products/bright-data/</>)
  * [Write a Review](https://www.g2.com/products/bright-data/</wizard/new-review>)
  * Browse
  * [Top Categories](https://www.g2.com/products/bright-data/<#>)
Top Categories
    * [AI Chatbots Software](https://www.g2.com/products/bright-data/<https:/www.g2.com/categories/ai-chatbots>)
    * [CRM Software](https://www.g2.com/products/bright-data/<https:/www.g2.com/categories/crm>)
    * [Project Management Software](https://www.g2.com/products/bright-data/<https:/www.g2.com/categories/project-management>)
    * [Expense Management Software](https://www.g2.com/products/bright-data/<https:/www.g2.com/categories/expense-management>)
    * [Video Conferencing Software](https://www.g2.com/products/bright-data/<https:/www.g2.com/categories/video-conferencing>)
    * [Online Backup Software](https://www.g2.com/products/bright-data/<https:/www.g2.com/categories/online-backup>)
    * [E-Commerce Platforms](https://www.g2.com/products/brig

Terrific! This time, G2 responded with a 200 OK status code. That means the request was not blocked, and Crawl4AI was able to successfully parse the HTML into Markdown as intended.

Step #8: Groq Setup

GroqCloud is one of the few providers that supports DeepSeek AI models via OpenAI-compatible APIs—even on a free plan. So, it will be the platform used for the LLM integration in Crawl4AI.

If you do not already have a Groq account, create one. Otherwise, just log in. In your user dashboard, navigate to “API Keys” in the left menu and click the “Create API Key” button:

The "Create API Key" button

A popup will appear:

The "Crate API Key" popup

Give your API key a name (e.g., “Crawl4AI Scraping”) and wait for the anti-bot verification by Cloudflare. Then, click “Submit” to generate your API key:

Your Groq API key

Copy the API key and add it to your .env file as below:

LLM_API_TOKEN=<YOUR_GROK_API_KEY>

Replace <YOUR_GROQ_API_KEY> with the actual API key provided by Groq.

Beautiful! You are ready to use DeepSeek for LLM scraping with Crawl4AI.

Step #9: Define a Schema for Your Scraped Data

Crawl4AI performs LLM scraping following a schema-based approach. In this context, a schema is a JSON data structure that defines:

  1. A base selector that identifies the “container” element on the page (e.g., a product row, a blog post card).
  2. Fields specifying the CSS/XPath selectors to capture each piece of data (e.g., text, attribute, HTML block).
  3. Nested or list types for repeated or hierarchical structures.

To define the schema, you must first identify the data you want to extract from the target page. To do that, open the target page in incognito mode in your browser:

The target page on G2

In this case, assume you are interested in the following fields:

  • name: The name of the product/company.
  • image_url: The URL of the product/company image.
  • description: A brief description of the product/company.
  • review_score: The average review score of the product/company.
  • number_of_reviews: The total number of reviews.
  • claimed: A boolean indicating if the company profile is claimed by the owner.

Now, in the models folder, create a g2_product.py file and populate it with a Pydantic-based schema class called G2Product as follows:

# ./models/g2_product.py

from pydantic import BaseModel

class G2Product(BaseModel):
    """
    Represents the data structure of a G2 product/company page.
    """

    name: str
    image_url: str
    description: str
    review_score: str
    number_of_reviews: str
    claimed: bool

Yes! The LLM scraping process performed by DeepSeek will return objects following the above schema.

Step #10: Prepare to Integrate DeepSeek

Before completing the integration of DeepSeek with Crawl4AI, review the “Settings > Limits” page in your GroqCloud account:

The limitations on DeepSeek models from GroqCloud

There, you can see that the two available DeepSeek models have the following limitations on the free plan:

  1. Up to 30 requests per minute
  2. Up to 1,000 requests per day
  3. No more than 6,000 tokens per minute

While the first two restrictions are not a problem for this example, the last one presents a challenge. A typical web page can contain millions of characters, translating to hundreds of thousands of tokens.

In other words, you cannot feed the entire G2 page directly into DeepSeek models via Groq due to token limits. To tackle the issue, Crawl4AI allows you to select only specific sections of the page. Those sections—and not the entire page— will be converted to Markdown and passed to the LLM. The section selection process relies on CSS selectors.

To determine the sections to select, open the target page in your browser. Right-click on the elements containing the data of interest and select the “Inspect” option:

The G2 product/company data header

Here, you can notice that the .product-head__title element contains the product/company name, review score, number of reviews, and claimed status.

Now, inspect the logo section:

The G2 product and company logo section

You can retrieve that information using the .product-head__logo CSS selector.

Finally, inspect the description section:

The G2 product/company description element

The description is available using the [itemprop="description"] selector.

Configure these CSS selectors in CrawlerRunConfig as follows:

crawler_config = CrawlerRunConfig(
    cache_mode=CacheMode.BYPASS,
    wait_until="domcontentloaded",
    page_timeout=180000,
    css_selector=".product-head__title, .product-head__logo, [itemprop=\"description\"]", # the CSS selectors of the elements to extract data from
)

If you execute scraper.py again, you will now get something like:

Response status code: 200
Parsed Markdown data:
[![Bright Data Reviews](https://images.g2crowd.com/uploads/product/image/large_detail/large_detail_9d7645872b9abb68923fb7e2c7c9d834/bright-data.png)![G2 recognized Bright Data](https://images.g2crowd.com/uploads/report_medal_translation/image/3436/medal.svg)](https:/www.g2.com/products/bright-data/reviews)
[Editedit](https:/my.g2.com/bright-data/product_information)
[Bright Data](https:/www.g2.com/products/bright-data/reviews)
By [bright data](https:/www.g2.com/sellers/bright-data)
Show rating breakdown
4.7 out of 5 stars
[5 star78%](https:/www.g2.com/products/bright-data/reviews?filters%5Bnps_score%5D%5B%5D=5#reviews)
[4 star19%](https:/www.g2.c

The output only include relevant sections instead of the entire HTML page. This approach significantly reduces token usage, allowing you to stay within Groq’s free-tier limits while effectively extracting the data of interest!

Step #11: Define the DeepSeek-Based LLM Extraction Strategy

Craw4AI supports LLM-based data extraction through the LLMExtractionStrategy object. You can define one for DeepSeek integration as below:

extraction_strategy = LLMExtractionStrategy(
    provider=os.getenv("LLM_MODEL"),
    api_token=os.getenv("LLM_API_TOKEN"),
    schema=G2Product.model_json_schema(),
    extraction_type="schema",
    instruction=(
        "Extract the 'name', 'description', 'image_url', 'review_score', and 'number_of_reviews' "
        "from the content below. "
        "'review_score' must be in \"x/5\" format. Get the entire description, not just the first few sentences."
    ),
    input_format="markdown",
    verbose=True
)

To specify the LLM model, add the following environment variable to .env:

LLM_MODEL=groq/deepseek-r1-distill-llama-70b

This tells Craw4AI to use the deepseek-r1-distill-llama-70b model from GroqCloud for LLM-based data extraction.

In scraper.py, import LLMExtractionStrategy and G2Product:

from crawl4ai.extraction_strategy import LLMExtractionStrategy
from models.g2_product import G2Product

Then, pass the extraction_strategy object to crawler_config:

crawler_config = CrawlerRunConfig(
    cache_mode=CacheMode.BYPASS,
    wait_until="domcontentloaded",
    page_timeout=180000, # 3 mins
    css_selector=".product-head__title, .product-head__logo, [itemprop=\"description\"]",
    extraction_strategy=extraction_strategy
)

When you run the script, Craw4AI will:

  1. Connect to the target web page via the Web Unlocker API proxy.
  2. Retrieve the HTML content of the page and filter elements using the specified CSS selectors.
  3. Convert the selected HTML elements to Markdown format.
  4. Send the formatted Markdown to DeepSeek for data extraction.
  5. Tell DeepSeek to process the input according to the provided prompt (instruction) and return the extracted data.

After running crawler.arun(), you can check token usage with:

print(extraction_strategy.show_usage())

Then, you can access and print the extracted data with:

result_raw_data = result.extracted_content
print(result_raw_data)

If you execute the script and print the results, you should see an output like this:

=== Token Usage Summary ===
Type                   Count
------------------------------
Completion               525
Prompt                 2,002
Total                  2,527

=== Usage History ===
Request #    Completion       Prompt        Total
------------------------------------------------
1                   525        2,002        2,527
None
[
    {
        "name": "Bright Data",
        "image_url": "https://images.g2crowd.com/uploads/product/image/large_detail/large_detail_9d7645872b9abb68923fb7e2c07c9d834/bright-data.png",
        "description": "Bright Data is the world's #1 web data, proxies, & data scraping solutions platform. Fortune 500 companies, academic institutions and small businesses all rely on Bright Data's products, network and solutions to retrieve crucial public web data in the most efficient, reliable and flexible manner, so they can research, monitor, analyze data and make better informed decisions. Bright Data is used worldwide by 20,000+ customers in nearly every industry. Its products range from no-code data solutions utilized by business owners, to a robust proxy and scraping infrastructure used by developers and IT professionals. Bright Data products stand out because they provide a cost-effective way to perform fast and stable public web data collection at scale, effortless conversion of unstructured data into structured data and superior customer experience, while being fully transparent and compliant.",
        "review_score": "4.7/5",
        "number_of_reviews": "221",
        "claimed": true
    }
]

The first part of the output (token usage) comes from show_usage(), confirming we are well below the 6,000-token limit. The following resulting data is a JSON string matching the G2Product schema.

Simply incredible!

Step #12: Handle the Result Data

As you can see from the output in the previous step, DeepSeek typically returns an array instead of a single object. To handle that, parse the returned data as JSON and extract the first element from the array:

# Parse the extracted data from JSON
result_data = json.loads(result.extracted_content)

# If the returned data is an array, access its first element
if result_data:
    result_data = result_data[0]

Remember to import json from the Python Standard Library:

import json

At this point, result_data should be an instance of G2Product. The final step is to export this data to a JSON file.

Step #13: Export the Scraped Data to JSON

Use json to export result_data to a g2.json file:

with open("g2.json", "w", encoding="utf-8") as f:
    json.dump(result_data, f, indent=4)

Mission complete!

Step #14: Put It All Together

Your final scraper.py file should contain:

import asyncio
from dotenv import load_dotenv
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
import os
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from models.g2_product import G2Product
import json

# Load secrets from .env file
load_dotenv()

async def main():
    # Bright Data's Web Unlocker API proxy configuration
    proxy_config = {
        "server": os.getenv("PROXY_SERVER"),
        "username": os.getenv("PROXY_USERNAME"),
        "password": os.getenv("PROXY_PASSWORD")
    }

    # Browser configuration
    browser_config = BrowserConfig(
        headless=True,
        proxy_config=proxy_config,
    )

    # LLM extraction strategy for data extraction using DeepSeek
    extraction_strategy = LLMExtractionStrategy(
        provider=os.getenv("LLM_MODEL"),
        api_token=os.getenv("LLM_API_TOKEN"),
        schema=G2Product.model_json_schema(),
        extraction_type="schema",
        instruction=(
            "Extract the 'name', 'description', 'image_url', 'review_score', and 'number_of_reviews' "
            "from the content below. "
            "'review_score' must be in \"x/5\" format. Get the entire description, not just the first few sentences."
        ),
        input_format="markdown",
        verbose=True
    )

    # Crawler configuration
    crawler_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        wait_until="domcontentloaded",
        page_timeout=180000, # 3 mins
        css_selector=".product-head__title, .product-head__logo, [itemprop=\"description\"]",
        extraction_strategy=extraction_strategy
    )

    # Run the AI-powered crawler
    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url="https://www.g2.com/products/bright-data/reviews",
            config=crawler_config
        )

        # Log the AI model usage info
        print(extraction_strategy.show_usage())

        # Parse the extracted data from JSON
        result_data = json.loads(result.extracted_content)

        # If the returned data is an array, access its first element
        if result_data:
            result_data = result_data[0]

    # Export the scraped data to JSON
    with open("g2.json", "w", encoding="utf-8") as f:
        json.dump(result_data, f, indent=4)

if __name__ == "__main__":
    asyncio.run(main())

Then, models/g2_product.py will store:

from pydantic import BaseModel

class G2Product(BaseModel):
    """
    Represents the data structure of a G2 product/company page.
    """

    name: str
    image_url: str
    description: str
    review_score: str
    number_of_reviews: str
    claimed: bool

And .env will have:

PROXY_SERVER=https://<WEB_UNLOCKER_API_HOST>:<WEB_UNLOCKER_API_PORT>
PROXY_USERNAME=<WEB_UNLOCKER_API_USERNAME>
PROXY_PASSWORD=<WEB_UNLOCKER_API_PASSWORD>
LLM_API_TOKEN=<GROQ_API_KEY>
LLM_MODEL=groq/deepseek-r1-distill-llama-70b

Launch your DeepSeek Crawl4AI scraper with:

python scraper.py

The output in the terminal will be something like this:

[INIT].... → Crawl4AI 0.4.248
[FETCH]... ↓ https://www.g2.com/products/bright-data/reviews... | Status: True | Time: 56.13s
[SCRAPE].. ◆ Processed https://www.g2.com/products/bright-data/reviews... | Time: 397ms
[LOG] Call LLM for https://www.g2.com/products/bright-data/reviews - block index: 0
[LOG] Extracted 1 blocks from URL: https://www.g2.com/products/bright-data/reviews block index: 0
[EXTRACT]. ■ Completed for https://www.g2.com/products/bright-data/reviews... | Time: 12.273853100006818s
[COMPLETE] ● https://www.g2.com/products/bright-data/reviews... | Status: True | Total: 68.81s

=== Token Usage Summary ===
Type                   Count
------------------------------
Completion               524
Prompt                 2,002
Total                  2,526

=== Usage History ===
Request #    Completion       Prompt        Total
------------------------------------------------
1                   524        2,002        2,526
None

Also, a g2.json file will appear in your project’s folder. Open it, and you will see:

{
    "name": "Bright Data",
    "image_url": "https://images.g2crowd.com/uploads/product/image/large_detail/large_detail_9d7645872b9abb68923fb7e2c7c9d834/bright-data.png",
    "description": "Bright Data is the world's #1 web data, proxies, & data scraping solutions platform. Fortune 500 companies, academic institutions and small businesses all rely on Bright Data's products, network and solutions to retrieve crucial public web data in the most efficient, reliable and flexible manner, so they can research, monitor, analyze data and make better informed decisions. Bright Data is used worldwide by 20,000+ customers in nearly every industry. Its products range from no-code data solutions utilized by business owners, to a robust proxy and scraping infrastructure used by developers and IT professionals. Bright Data products stand out because they provide a cost-effective way to perform fast and stable public web data collection at scale, effortless conversion of unstructured data into structured data and superior customer experience, while being fully transparent and compliant.",
    "review_score": "4.7/5",
    "number_of_reviews": "221",
    "claimed": true
}

Congratulations! You started with a bot-protected G2 page and used Crawl4AI, DeepSeek, and Web Unlocker API to extract structured data from it—without writing a single line of parsing logic.

Conclusion

In this tutorial, you explored what Crawl4AI is and how to use it in combination with DeepSeek to build an AI-powered scraper. One of the major challenges when scraping is the risk of being blocked, but this was overcome with Bright Data’s Web Unlocker API.

As demonstrated in this tutorial, with the combination of Crawl4AI, DeepSeek, and the Web Unlocker API, you can extract data from any site—even those that are more protected, like G2—without the need for specific parsing logic. This is just one of many scenarios supported by Bright Data’s products and services, which help you implement effective AI-driven web scraping.

Explore our other web scraping tools that integrate with Crawl4AI:

  • Proxy Services: 4 different types of proxies to bypass location restrictions, including 72 million+ residential IPs
  • Web Scraper APIs: Dedicated endpoints for extracting fresh, structured web data from over 100 popular domains.
  • SERP API: API to handle all ongoing unlocking management for SERP and extract one page
  • Scraping Browser: Puppeteer, Selenium, and Playwright-compatible browser with built-in unlocking activities

Sign up now to Bright Data and test our proxy services and scraping products for free!

No credit card required