In this tutorial, you will learn:
- What Crawl4AI is and what it offers for web scraping
- The ideal scenarios for using Crawl4AI with an LLM like DeepSeek
- How to build a DeepSeek-powered Crawl4AI scraper in a guided section.
Let’s dive in!
What Is Craw4AI?
Crawl4AI is an open-source, AI-ready web crawler and scraper designed for seamless integration with large language models (LLMs), AI agents, and data pipelines. It delivers high-speed, real-time data extraction while being flexible and easy to deploy.
The features it offers for AI web scraping are:
- Built for LLMs: Generates structured Markdown optimized for retrieval-augmented generation (RAG) and fine-tuning.
- Flexible browser control: Supports session management, proxies, and custom hooks.
- Heuristic intelligence: Uses smart algorithms to optimize data parsing.
- Fully open source: No API keys required; deployable via Docker and cloud platforms.
Discover more on the official documentation.
When To Use Crawl4AI and DeepSeek for Web Scraping
DeepSeek offers powerful, open-source, free LLM models that have made waves in the AI community due to their efficiency and effectiveness. Plus, these models integrate smoothly with Crawl4AI.
By leveraging DeepSeek in Crawl4AI, you can extract structured data from even the most complex and inconsistent web pages. All that without the need for predefined parsing logic.
Below are key scenarios where the DeepSeek + Crawl4AI combination is especially useful:
- Frequent site structure changes: Traditional scrapers break when websites update their HTML structure, but AI dynamically adapts.
- Inconsistent page layouts: Platforms like Amazon have varying product page designs. An LLM can intelligently extract data regardless of layout differences.
- Unstructured content parsing: Extracting insights from free-text reviews, blog posts, or forum discussions becomes easy with LLM-powered processing.
Web Scraping With Craw4AI and DeepSeek: Step-By-Step Guide
In this guided tutorial, you will learn how to build an AI-powered web scraper using Crawl4AI. As the LLM engine, we will use DeepSeek.
Specifically, you will see how to create an AI scraper to extract data from G2 page for Bright Data:
Follow the steps below and learn how to perform web scraping with Crawl4AI and DeepSeek!
Prerequisites
To follow this tutorial, ensure you meet the following prerequisites:
- Python 3+ installed on your machine
- A GroqCloud account
- A Bright Data account
Do not worry if you do not have a GroqCloud or Bright Data account yet. You will be guided through their setup during the next steps.
Step #1: Project Setup
Run the following command to create a folder for your Crawl4AI DeepSeek scraping project:
Navigate into the project folder and create a virtual environment:
Now, load the crawl4ai-deepseek-scraper
folder in your favorite Python IDE. Visual Studio Code with the Python extension or PyCharm Community Edition are two great options.
Inside the project folder, create:
scraper.py
: The file that will contain the AI-powered scraping logic.models/
: A directory to store Pydantic-based Crawl4AI LLM data models..env
: A file to store environment variables securely.
After creating these files and folders, your project structure should look like this:
Next, activate the virtual environment in your IDE’s terminal.
In Linux or macOS, launch this command:
Equivalently, on Windows, execute:
Great! You now have a Python environment for Crawl4AI web scraping with DeepSeek.
Step #2: Install Craw4AI
With your virtual environment activated, install Crawl4AI via the crawl4ai
pip package:
Note that the library has several dependencies, so the installation might take a while.
Once installed, run the following command in your terminal:
The process:
- Installs or updates the required Playwright browsers (Chromium, Firefox, etc.).
- Performs OS-level checks (e.g., ensuring required system libraries are installed on Linux).
- Confirms your environment is properly set up for web crawling.
After running the command, you should see an output similar to this:
Amazing! Crawl4AI is now installed and ready to use.
Step #4: Initialize scraper.py
Since Crawl4AI requires asynchronous code, start by creating a basic asyncio
script:
Now, remember that the project involves integrations with third-party services like DeepSeek. To implement that, you will need to rely on API keys and other secrets. We will store them in a .env
file.
Install python-dotenv
to load environment variables:
Before defining main()
, load the environment variables from the .env
file with load_dotenv()
:
Import load_dotenv
from the python-dotenv
library:
Perfect! scraper.py
is ready to host some AI-powered scraping logic.
Step #5: Create Your First AI Scraper
Inside the main()
function in scraper.py
, add the following logic using a basic Crawl4AI crawler:
In the above snippet, the key points are:
BrowserConfig
: Controls how the browser is launched and behaves, including settings like headless mode and custom user agents for web scraping.CrawlerRunConfig
: Defines the crawling behavior, such as caching strategy, data selection rules, timeouts, and more.headless=True
: Configures the browser to run in headless mode—without the GUI—to save resources.CacheMode.BYPASS
: This configuration guarantees that the crawler fetches fresh content directly from the website instead of relying on cached data.crawler.arun()
: This method launches the asynchronous crawler to extract data from the specified URL.result.markdown
: The extracted content is converted into Markdown format, making it easier to parse and analyze.
Do not forget to add the following imports:
Right now, scraper.py
should contain:
If you execute the script, you should see an output as below:
That is suspicious, as the parsed Markdown content is empty. To investigate further, print the response status:
This time, the output will include:
The Markdown-parsed result is empty because the Crawl4AI request was blocked by G2’s bot detection systems. That is clear by the 403 Forbidden
status code returned by the server.
That should not be surprising, as G2 has strict anti-bot measures in place. In particular, it often displays CAPTCHAs—even when accessed through a regular browser:
In this case, since no valid content was received, Crawl4AI could not convert it to Markdown. In the next step, we will explore how to bypass this restriction. For further reading, take a look at our guide on how to bypass CAPTCHAs in Python.
Step #6: Configure Web Unlocker API
Crawl4AI is a powerful tool with built-in bot bypassing mechanisms. However, it cannot bypass highly protected websites like G2, which employ strict and top-notch anti-bot and anti-scraping measures.
Against such sites, the best solution is to use a dedicated tool designed to unblock any webpage, regardless of its protection level. The ideal scraping product for this task is Bright Data’s Web Unlocker, a scraping API that:
- Simulates real user behavior to bypass anti-bot detection
- Handles proxy management and CAPTCHA solving automatically
- Scales seamlessly without requiring infrastructure management
Follow the next instructions to integrate Web Unlocker API into your Crawl4AI DeepSeek scraper.
Alternatively, take a look at the official documentation.
First, log in to your Bright Data account or create one if you have not already. Fund your account or take advantage of the free trial available for all products.
Next, navigate to “Proxies & Scraping” in the dashboard and select the “unblocker” option in the table:
This will take you to the Web Unlocker API setup page shown below:
Here, enable Web Unlocker API by clicking on the toggle:
G2 is protected by advanced anti-bot defenses, including CAPTCHAs. Thus, verify that the following two toggles are enabled on the “Configuration” page:
Crawl4AI operates by navigating pages in a controlled browser. Under the hood, it relies on Playwright’s goto()
function, which sends an HTTP GET
request to the target webpage. In contrast, Web Unlocker API works through POST
requests.
That is not a problem as you can still use Web Unlocker API with Crawl4AI by configuring it as a proxy. This allows Crawl4AI’s browser to send requests through Bright Data’s product, receiving back unblocked HTML pages.
To access your Web Unlocker API proxy credentials, reach the “Native proxy-based access” tab on the “Overview” page:
Copy the following credentials from the page:
<HOST>
<PORT>
<USERNAME>
<PASSWORD>
Then, use them to populate your .env
file with these environment variables:
Fantastic! Web Unlocker is now ready for integration with Crawl4AI.
Step #7: Integrate Web Unlocker API
BrowserConfig
supports proxy integration through the proxy_config
object. To integrate Web Unlocker API with Crawl4AI, populate that object with the environment variables from your .env
file and pass it to the BrowserConfig
constructor:
Remember to import os
from the Python Standard Library:
Keep in mind that Web Unlocker API introduces some time overhead due to IP rotation via the proxy and eventual CAPTCHA solving. To account for that, you should:
- Increase the page load timeout to 3 minutes
- Instruct the crawler to wait for the DOM to be fully loaded before parsing it
Achieve that with the following CrawlerRunConfig
configuration:
Note that even Web Unlocker API is not flawless when dealing with complex sites like G2. Rarely, the scraping API may fail to retrieve the unblocked page, causing the script to terminate with the following error:
Rest assured, you are only charged for successful requests. So, there is no need to worry about relaunching the script until it works. On a production script, consider implementing an automatic retry logic.
When the request is successful, you will receive an output like this:
Terrific! This time, G2 responded with a 200 OK
status code. That means the request was not blocked, and Crawl4AI was able to successfully parse the HTML into Markdown as intended.
Step #8: Groq Setup
GroqCloud is one of the few providers that supports DeepSeek AI models via OpenAI-compatible APIs—even on a free plan. So, it will be the platform used for the LLM integration in Crawl4AI.
If you do not already have a Groq account, create one. Otherwise, just log in. In your user dashboard, navigate to “API Keys” in the left menu and click the “Create API Key” button:
A popup will appear:
Give your API key a name (e.g., “Crawl4AI Scraping”) and wait for the anti-bot verification by Cloudflare. Then, click “Submit” to generate your API key:
Copy the API key and add it to your .env
file as below:
Replace <YOUR_GROQ_API_KEY>
with the actual API key provided by Groq.
Beautiful! You are ready to use DeepSeek for LLM scraping with Crawl4AI.
Step #9: Define a Schema for Your Scraped Data
Crawl4AI performs LLM scraping following a schema-based approach. In this context, a schema is a JSON data structure that defines:
- A base selector that identifies the “container” element on the page (e.g., a product row, a blog post card).
- Fields specifying the CSS/XPath selectors to capture each piece of data (e.g., text, attribute, HTML block).
- Nested or list types for repeated or hierarchical structures.
To define the schema, you must first identify the data you want to extract from the target page. To do that, open the target page in incognito mode in your browser:
In this case, assume you are interested in the following fields:
name
: The name of the product/company.image_url
: The URL of the product/company image.description
: A brief description of the product/company.review_score
: The average review score of the product/company.number_of_reviews
: The total number of reviews.claimed
: A boolean indicating if the company profile is claimed by the owner.
Now, in the models
folder, create a g2_product.py
file and populate it with a Pydantic-based schema class called G2Product
as follows:
Yes! The LLM scraping process performed by DeepSeek will return objects following the above schema.
Step #10: Prepare to Integrate DeepSeek
Before completing the integration of DeepSeek with Crawl4AI, review the “Settings > Limits” page in your GroqCloud account:
There, you can see that the two available DeepSeek models have the following limitations on the free plan:
- Up to 30 requests per minute
- Up to 1,000 requests per day
- No more than 6,000 tokens per minute
While the first two restrictions are not a problem for this example, the last one presents a challenge. A typical web page can contain millions of characters, translating to hundreds of thousands of tokens.
In other words, you cannot feed the entire G2 page directly into DeepSeek models via Groq due to token limits. To tackle the issue, Crawl4AI allows you to select only specific sections of the page. Those sections—and not the entire page— will be converted to Markdown and passed to the LLM. The section selection process relies on CSS selectors.
To determine the sections to select, open the target page in your browser. Right-click on the elements containing the data of interest and select the “Inspect” option:
Here, you can notice that the .product-head__title
element contains the product/company name, review score, number of reviews, and claimed status.
Now, inspect the logo section:
You can retrieve that information using the .product-head__logo
CSS selector.
Finally, inspect the description section:
The description is available using the [itemprop="description"]
selector.
Configure these CSS selectors in CrawlerRunConfig
as follows:
If you execute scraper.py
again, you will now get something like:
The output only include relevant sections instead of the entire HTML page. This approach significantly reduces token usage, allowing you to stay within Groq’s free-tier limits while effectively extracting the data of interest!
Step #11: Define the DeepSeek-Based LLM Extraction Strategy
Craw4AI supports LLM-based data extraction through the LLMExtractionStrategy
object. You can define one for DeepSeek integration as below:
To specify the LLM model, add the following environment variable to .env
:
This tells Craw4AI to use the deepseek-r1-distill-llama-70b
model from GroqCloud for LLM-based data extraction.
In scraper.py
, import LLMExtractionStrategy
and G2Product
:
Then, pass the extraction_strategy
object to crawler_config
:
When you run the script, Craw4AI will:
- Connect to the target web page via the Web Unlocker API proxy.
- Retrieve the HTML content of the page and filter elements using the specified CSS selectors.
- Convert the selected HTML elements to Markdown format.
- Send the formatted Markdown to DeepSeek for data extraction.
- Tell DeepSeek to process the input according to the provided prompt (
instruction
) and return the extracted data.
After running crawler.arun()
, you can check token usage with:
Then, you can access and print the extracted data with:
If you execute the script and print the results, you should see an output like this:
The first part of the output (token usage) comes from show_usage()
, confirming we are well below the 6,000-token limit. The following resulting data is a JSON string matching the G2Product
schema.
Simply incredible!
Step #12: Handle the Result Data
As you can see from the output in the previous step, DeepSeek typically returns an array instead of a single object. To handle that, parse the returned data as JSON and extract the first element from the array:
Remember to import json
from the Python Standard Library:
At this point, result_data
should be an instance of G2Product
. The final step is to export this data to a JSON file.
Step #13: Export the Scraped Data to JSON
Use json
to export result_data
to a g2.json
file:
Mission complete!
Step #14: Put It All Together
Your final scraper.py
file should contain:
Then, models/g2_product.py
will store:
And .env
will have:
Launch your DeepSeek Crawl4AI scraper with:
The output in the terminal will be something like this:
Also, a g2.json
file will appear in your project’s folder. Open it, and you will see:
Congratulations! You started with a bot-protected G2 page and used Crawl4AI, DeepSeek, and Web Unlocker API to extract structured data from it—without writing a single line of parsing logic.
Conclusion
In this tutorial, you explored what Crawl4AI is and how to use it in combination with DeepSeek to build an AI-powered scraper. One of the major challenges when scraping is the risk of being blocked, but this was overcome with Bright Data’s Web Unlocker API.
As demonstrated in this tutorial, with the combination of Crawl4AI, DeepSeek, and the Web Unlocker API, you can extract data from any site—even those that are more protected, like G2—without the need for specific parsing logic. This is just one of many scenarios supported by Bright Data’s products and services, which help you implement effective AI-driven web scraping.
Explore our other web scraping tools that integrate with Crawl4AI:
- Proxy Services: 4 different types of proxies to bypass location restrictions, including 72 million+ residential IPs
- Web Scraper APIs: Dedicated endpoints for extracting fresh, structured web data from over 100 popular domains.
- SERP API: API to handle all ongoing unlocking management for SERP and extract one page
- Scraping Browser: Puppeteer, Selenium, and Playwright-compatible browser with built-in unlocking activities
Sign up now to Bright Data and test our proxy services and scraping products for free!
No credit card required