Blog / AI
AI

Web Scraping With Qwen3 in 2025: Complete Tutorial

Master web scraping with Qwen3 LLM, from setup to advanced usage, and efficiently extract data using AI-driven techniques in 2025.
15 min read
Real-Time Web Scraping with Qwen3 and Bright Data

In this tutorial, you will see:

  • What Qwen3 is and what makes it stand out as an LLM
  • Why it is well-suited for web scraping tasks
  • How to use Qwen3 locally for web scraping with Hugging Face
  • Its main limitations and how to work around them
  • A few alternatives to Qwen3 for AI-powered scraping

Let’s dive in!

What Is Qwen3?

Qwen3 is the latest generation of LLMs developed by Alibaba Cloud’s Qwen team. The model is open-source and is freely explorable on GitHub—available under the Apache 2.0 license. That makes great for research and development.

The main features of Qwen3 include:

  • Hybrid reasoning: It can switch between a “thinking mode” for complex logical reasoning (like math or coding) and a “non-thinking mode” for faster, general-purpose responses. This allows you to control the depth of reasoning for optimal performance and cost efficiency.
  • Diverse models: Qwen3 offers a comprehensive suite of models, including dense models (ranging from 0.6B to 32B parameters) and Mixture-of-Experts (MoE) models (like the 30B and 235B variants).
  • Enhanced capabilities: It shows significant advancements in reasoning, instruction following, agent capabilities, and multilingual support (covering over 100 languages and dialects).
  • Training Data: Qwen3 was trained on a massive dataset of approximately 36 trillion tokens, nearly double that of its predecessor, Qwen2.5.

Why Use Qwen3 for Web Scraping?

Qwen3 makes web scraping easier by automating the interpretation and structuring of unstructured content in HTML pages. That eliminates the need for manual data parsing. Instead of writing complex logic to extract data, the model understands the structure of the page for you.

Relying on Qwen3 for web data parsing is especially useful when dealing with common web scraping challenges like:

  • Frequently changing page layouts: A popular scenario is Amazon, where each product page can show different data.
  • Unstructured data: Qwen3 can extract valuable information from messy, free-form text without requiring hardcoded selectors or regex logic.
  • Difficult-to-parse content: For pages with inconsistent or complex structure, an LLM like Qwen3 removes the need for custom parsing logic.

For a deeper dive, read our guide on using AI for web scraping.

Another major advantage is that Qwen3 is open-source. That means you can run it locally on your own machine for free, without relying on third-party APIs or paying for premium LLMs like OpenAI’s. This gives you full control over your scraping architecture.

How to Perform Web Scraping with Qwen3 in Python

In this section, the target page will be the “Affirm Water Bottle” product page from the “Ecommerce Test Site to Learn Web Scraping” sandbox:

The target page

This page serves as a solid example because e-commerce product pages usually have inconsistent structures, displaying varying types of data. That variability is what makes e-commerce web scraping particularly challenging—and also where AI can make a big difference.

Here, we will use a Qwen3-powered scraper to intelligently extract product information without writing manual parsing rules.

Note: This tutorial will show you how to use Hugging Face to run Qwen3 models locally and for free. Now, other viable options exist. These include connecting to an LLM provider hosting Qwen3 models, or utilizing solutions like Ollama.

Follow the steps below to start scraping web data using Qwen3!

Step #1: Set Up Your Project

Before getting started, make sure that you have Python 3.10+ installed in your machine. Otherwise, download it and follow the installation instructions.

Next, execute the command below to create a folder for your scraping project:

mkdir qwen3-scraper

The qwen3-scraper directory will serve as the project folder for web scraping using Qwen3.

Navigate to the folder in your terminal and initialize a Python virtual environment inside it:

cd qwen3-scraper
python -m venv venv

Load the project folder in your preferred Python IDE. Visual Studio Code with the Python extension or PyCharm Community Edition are both excellent options.

Create a scraper.py file in the project’s folder, which should now contain:

The project file structure for web scraping with Qwen3

Right now, scraper.py is just an empty Python script, but it will soon contain the logic for LLM web scraping.

Then, activate the virtual environment. On Linux or macOS, run:

source venv/bin/activate

Equivalently, on Windows, use:

venv/Scripts/activate

Note: The following steps will guide you through installing all the required libraries. If you prefer to install everything at once, you can use the command below now:

pip install transformers torch accelerate requests beautifulsoup4 markdownify

Awesome! Your Python environment is fully set up for web scraping with Qwen3.

Step #2: Configure Qwen3 in Hugging Face

As mentioned at the beginning of this section, we will use Hugging Face to run a Qwen3 model locally. This is now possible because Hugging Face recently added support for Qwen3 models.

First, make sure you are in an activated virtual environment. Then, install the necessary Hugging Face dependencies by running:

pip install transformers torch accelerate

Next, in your scraper.py file, import the required classes from Hugging Face’s transformers library:

from transformers import AutoModelForCausalLM, AutoTokenizer

Now, use those classes to load a tokenizer and the Qwen3 model:

model_name = "Qwen/Qwen3-0.6B"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

In this case, we are using the Qwen/Qwen3-0.6B model, but you can choose from over 40+ other available Qwen3 models on Hugging Face.

Awesome! You now have everything in place to utilize Qwen3 in your Python script.

Step #3: Get the HTML of the Target Page

Now, it is time to retrieve the HTML content of the target page. You can achieve that using a powerful Python HTTP client like Requests.

In your activated virtual environment, install the Requests library:

pip install requests

Then, in your scraper.py file, import the library:

import requests

Use the get() method to send an HTTP GET request to the page URL:

url = "https://www.scrapingcourse.com/ecommerce/product/ajax-full-zip-sweatshirt/"
response = requests.get(url)

The server will respond with the raw HTML of the page. To see the full HTML content, you can print response.content:

print(response.content)

The result should be this HTML string:

<!DOCTYPE html>
<html lang="en-US">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <link rel="profile" href="https://gmpg.org/xfn/11">
    <link rel="pingback" href="https://www.scrapingcourse.com/ecommerce/xmlrpc.php">

    <!-- omitted for brevity... -->

    <title>Affirm Water Bottle &#8211; Ecommerce Test Site to Learn Web Scraping</title>
    <!-- omitted for brevity... -->
</head>
<body>
    <!-- omitted for brevity... -->
</body>
</html>

You now have the complete HTML of the target page available in Python. Let’s move on to parsing it and extracting the data we need using Qwen3!

Step #4: Convert the Page HTML to Markdown (optional, but recommended)

Note: This step is not strictly required. However, it can save you significant time locally (and money if you are using paid Qwen3 providers). So, it is definitely worth considering.

Take a moment to explore how other AI-powered web scraping tools like Crawl4AI and ScrapeGraphAI handle raw HTML. You will notice they both offer options to convert HTML into Markdown before passing the content to the configured LLM.

Why do they do that? There are two main reasons:

  • Cost efficiency: Markdown conversion reduces the number of tokens sent to the AI, helping you save money.
  • Faster processing: Less input data means lower computational costs and quicker responses.

For more information, read our guide on why the new AI agents choose Markdown over HTML.

In this case, since Qwen3 runs locally, cost efficiency is not important because you are not connected to a third-party LLM provider. What really matters here is faster processing. Why? Because asking the chosen Qwen3 model (which is one of the smaller available models, btw) to process the entire HTML page can easily push an i7 CPU to 100% usage for several minutes.

That is too much, as you do not want to overheat or freeze your laptop or PC. So, reducing input size by converting to Markdown makes perfect sense.

Time to replicate the HTML-to-Markdown conversion logic and reduce token usage!

First, open the target webpage in incognito mode to ensure a fresh session. Then, right-click anywhere on the page and select “Inspect” to open the DevTools. Now, examine the page structure. You will see that all relevant data is contained within the HTML element identified by the CSS selector #main:

The #main HTML element with the product data

By focusing on the content inside #main in the HTML-to-Markdown conversion process, you extract only the part of the page with relevant data. This avoids including headers, footers, and other sections you are not interested in. That way, the final Markdown output will be much shorter.

To select just the HTML in the #main element, you need a Python HTML parsing library like Beautiful Soup. In your activated virtual environment, install it with this command:

pip install beautifulsoup4

If you are not familiar with its API, follow our guide on Beautiful Soup web scraping.

Then, import it in scraper.py:

from bs4 import BeautifulSoup

Now, use Beautiful Soup to:

  1. Parse the raw HTML fetched with Requests
  2. Select the #main element
  3. Extract its HTML content

Implement the three above micro-steps with this snippet:

# Parse the HTML of the page with BeautifulSoup
soup = BeautifulSoup(response.content, "html.parser")

# Select the #main element
main_element = soup.select_one("#main")

# Get the outer HTML of the selected element
main_html = str(main_element)

If you print main_html, you will see something like this:

<main id="main" class="site-main" role="main" data-testid="main-content" data-content="main-area">
    <!-- omitted for brevity... -->
    <div id="product-2765" class="product type-product post-2765 status-publish first instock product_cat-fitness-equipment has-post-thumbnail shipping-taxable purchasable product-type-simple">
        <!-- omitted for brevity... -->
    </div>
</main>

This string is much smaller than the full HTML page but still contains about 13,402 characters.

To reduce the size even more without losing important data, convert the extracted HTML to Markdown. First, install the markdownify library:

pip install markdownify

Import markdownify in scraper.py:

from markdownify import markdownify

Then, employ it to convert the HTML from #main to Markdown:

main_markdown = markdownify(main_html)

The data conversion process should produce an output as below:

HTML to Markdown comparison

The Markdown version is about 2.53 KB, compared to 13.61 KB for the original #main HTML. That is an 81% reduction in size! On top of that, what matters is that the Markdown version retains all the key data you need to scrape.

With this simple trick, you reduced a bulky HTML snippet into a compact Markdown string. This will speed up local LLM data parsing via Qwen3 a lot!

Step #5: Use Qwen3 for Data Parsing

To get Qwen3 to scrape data correctly, you need to write an effective prompt. Start by analyzing the structure of the target page:

The structure of the target page

The top section of the page is consistent across all products. On the other hand, the “Additional information” table changes depending on the product. Since you may want your prompt to work across all product pages on the platform, you could describe your task in general terms like so:

Extract main product data from the HTML content below. Respond with a raw string in JSON format containing the scraped data in product attributes as below:\n\n
SAMPLE JSON ATTRIBUTES: \n
sku, name, images, price, description, category + fields extracted from the "Additional information" section

CONTENT:\n
<MARKDOWN_PRODUCT_CONTENT>

This prompt instructs Qwen3 to extract structured data from the main_markdown content. To get reliable results, it is a good idea to make your prompt as clear and specific as possible. That helps the model understand exactly what you expect.

Now, use Hugging Face to run the prompt, as explained in the official documentation:

# Define tge data extraction prompt
prompt = prompt = f"""Extract main product data from the HTML content below. Respond with a raw string in JSON format containing the scraped data in product attributes as below:\n\n
SAMPLE JSON ATTRIBUTES: \n
sku, name, images, price, description, category + fields extracted from the "Additional information" section

CONTENT:\n
{main_markdown}
"""

# Execute the prompt
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

# Retrieve the output in text format
product_raw_string = tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n")

The above code uses apply_chat_template() to format the input message and generates a response from the configured Qwen3 model.

Note: A key detail is setting enable_thinking=False in apply_chat_template(). By default, that option is set to True, which activates the model’s internal “reasoning” mode. That task is useful for complex problem-solving but unnecessary and potentially counterproductive for straightforward tasks like web scraping. Disabling it ensures the model focuses purely on extraction without adding explanations or assumptions.

Fantastic! You just instructed Qwen3 to perform web scraping on the target page.
Now, all that remains is tweaking the output and export it to JSON.

Step #6: Convert the Qwen3 Output

The output produced by the Qwen3-0.6B model can vary slightly between runs. This is typical behavior for LLMs, especially smaller models like the one used here.

Thus, sometimes the variable product_raw_string will contain the desired data as a plain JSON string. Other times, it may wrap the JSON inside a Markdown code block, like this:

```json\n{\n  "sku": "24-UG06",\n  "name": "Affirm Water Bottle",\n  "images": ["https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/ug06-lb-0.jpg"],\n  "price": "$7.00",\n  "description": "You’ll stay hydrated with ease with the Affirm Water Bottle by your side or in hand. Measurements on the outside help you keep track of how much you’re drinking, while the screw-top lid prevents spills. A metal carabiner clip allows you to attach it to the outside of a backpack or bag for easy access.",\n  "category": "Fitness Equipment",\n  "additional_information": {\n    "Activity": "Yoga, Recreation, Sports, Gym",\n    "Gender": "Men, Women, Boys, Girls, Unisex",\n    "Material": "Plastic"\n  }\n}\n```

To handle both cases, you can use a regular expression to extract the JSON content when it appears inside a Markdown block. Otherwise, treat the string as raw JSON. Then, you can parse the resulting JSON data to Python dictionary json.loads():

# Check if the string contains "```json" and extract the raw JSON if present
match = re.search(r'```json\n(.*?)\n```', product_raw_string, re.DOTALL)

if match:
    # Extract the JSON string from the matched group
    json_string = match.group(1)
else:
    # Assume the returned data is already in JSON format
    json_string = product_raw_string

# Parse the extracted JSON string into a Python dictionary
product_data = json.loads(json_string)

Here we go! At this point, you parsed the scraped data into a usable Python object. The last step is to export the scraped data to a more user-friendly format.

Step #7: Export the Scraped Data

Now that you have the product data in a Python dictionary, you can save it to a JSON file like this:

with open("product.json", "w", encoding="utf-8") as json_file:
    json.dump(product_data, json_file, indent=4)

This will create a file named product.json containing your structured product data.

Well done! Your Qwen3 web scraper is now complete.

Step #8: Put It All Together

Here is the final code of your scraper.py Qwen3 scraping script:

from transformers import AutoModelForCausalLM, AutoTokenizer
import requests
from bs4 import BeautifulSoup
from markdownify import markdownify
import json
import re

# The Qwen3 model to use for web scraping
model_name = "Qwen/Qwen3-0.6B"

# Load the tokenizer and the Qwen3 model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# Fetch the HTML content of the target page
url = "https://www.scrapingcourse.com/ecommerce/product/affirm-water-bottle/"
response = requests.get(url)

# Parse the HTML of the target page with BeautifulSoup
soup = BeautifulSoup(response.content, "html.parser")

# Select the #main element
main_element = soup.select_one("#main")

# Get the outer HTML of the selected element and convert it to Markdown
main_html = str(main_element)
main_markdown = markdownify(main_html)

# Define tge data extraction prompt
prompt = prompt = f"""Extract main product data from the HTML content below. Respond with a raw string in JSON format containing the scraped data in product attributes as below:\n\n
SAMPLE JSON ATTRIBUTES: \n
sku, name, images, price, description, category + fields extracted from the "Additional information" section

CONTENT:\n
{main_markdown}
"""

# Execute the prompt
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

# Retrieve the output in text format
product_raw_string = tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n")

# Check if the string contains "```json" and extract the raw JSON if present
match = re.search(r'```json\n(.*?)\n```', product_raw_string, re.DOTALL)

if match:
    # Extract the JSON string from the matched group
    json_string = match.group(1)
else:
    # Assume the returned data is already in JSON format
    json_string = product_raw_string

# Parse the extracted JSON string into a Python dictionary
product_data = json.loads(json_string)

# Export the scraped data to JSON
with open("product.json", "w", encoding="utf-8") as json_file:
    json.dump(product_data, json_file, indent=4)

Run the script with:

python scraper.py

The first time you run the script, Hugging Face will automatically download the selected Qwen3 model. This model is about 1.5GB, so the download may take some time depending on your internet speed. In the terminal, you will see output like:

model.safetensors: 100%|██████████████████████████████████████████████████████████| 1.50G/1.50G [00:49<00:00, 30.2MB/s]
generation_config.json: 100%|█████████████████████████████████████████████████████████████████| 239/239 [00:00<?, ?B/s]

The script may take a bit to complete, as PyTorch will stress your CPU to load and run the model.

Once the script finishes, it will create a file named product.json in your project folder. Open this file, and you should see structured product data like this:

{
    "sku": "24-UG06",
    "name": "Affirm Water Bottle",
    "images": [
        "https://www.scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/ug06-lb-0.jpg"
    ],
    "price": "$7.00",
    "description": "You’ll stay hydrated with ease with the Affirm Water Bottle by your side or in hand. Measurements on the outside help you keep track of how much you\u2019re drinking, while the screw-top lid prevents spills. A metal carabiner clip allows you to attach it to the outside of a backpack or bag for easy access.",
    "category": "Fitness Equipment",
    "additional_information": {
        "Activity": "Yoga, Recreation, Sports, Gym",
        "Gender": "Men, Women, Boys, Girls, Unisex",
        "Material": "Plastic"
    }
}

Note: The exact output may vary slightly due to the nature of LLMs, which can structure the scraped content in different ways.

Et voilà! Your script just transformed raw HTML content into clean, structured JSON. All thanks to Qwen3 web scraping.

Overcoming the Main Limitation of This Approach to Web Scraping

Sure, in our example, everything worked smoothly. But that is only because we were scraping a demo site built specifically for that purpose.

In the real world, most websites are well aware of the value of their public-facing data. Thus, they often implement anti-scraping techniques that can quickly block automated HTTP requests made using tools like requests.

Plus, this approach will not work on JavaScript-heavy sites. That is because the combination of requests and BeautifulSoup works well for static pages, but cannot handle dynamic content. If you are unfamiliar with the difference, take a look at our article on static vs dynamic content.

Other potential blockers include IP bans, rate limiters, TLS fingerprinting, CAPTCHAs, and more. In short, web scraping is not easy—especially now that most websites are equipped to detect and block AI crawlers and bots.

The solution is to utilize a Web Unlocker API built for modern web scraping with requests. Such a service takes care of all the hard stuff for you, including rotating IPs, solving CAPTCHAs, rendering JavaScript, and bypassing bot protection.

All you have to do is pass the URL of the target page to the Web Unlocker API endpoint. The API will return fully unlocked HTML, even if the page relies on JavaScript or is protected by advanced anti-bot systems.

To integrate it into your script, just replace the requests.get() line from Step #3 with the following code:

WEB_UNLOCKER_API_KEY = "<YOUR_WEB_UNLOCKER_API_KEY>"

# Set up authentication headers
headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {WEB_UNLOCKER_API_KEY}"
}

# Define the payload with the target URL
payload = {
    "zone": "unblocker",
    "url": "https://www.scrapingcourse.com/ecommerce/product/affirm-water-bottle/", # Replace this with your target URL on a different scraping scenario
    "format": "raw"
}

# Send the request
response = requests.post("https://api.brightdata.com/request", json=payload, headers=headers)

# Get the unlocked HTML
html_content = response.text

For more details, refer to the official Web Unlocker documentation.

With a Web Unlocker in place, you can confidently use Qwen3 to extract structured data from any website—no more blocks, rendering issues, or missing content.

Alternatives to Qwen3 for Web Scraping

Qwen3 is not the only LLM you can use for automated web data parsing. Explore some alternative approaches in the following guides:

Conclusion

In this tutorial, you learned how to run Qwen3 locally using Hugging Face to build an AI-powered web scraper. One of the biggest hurdles in web scraping is getting blocked, but that was addressed using Bright Data’s Web Unlocker API.

As covered earlier, combining Qwen3 with the Web Unlocker API allows you to extract data from virtually any website. All that with no custom parsing logic required. This setup showcases just one of the many powerful use cases made possible by Bright Data’s infrastructure, helping you build scalable, AI-driven web data pipelines.

So, why stop here? Consider exploring Web Scraper APIs—dedicated endpoints for extracting fresh, structured, and fully compliant web data from over 120 popular websites.

Sign up for a free Bright Data account today and start building with AI-ready scraping solutions!

Antonello Zanini

Technical Writer

5.5 years experience

Antonello Zanini is a technical writer, editor, and software engineer with 5M+ views. Expert in technical content strategy, web development, and project management.

Expertise
Web Development Web Scraping AI Integration