In this tutorial, you will see:
- What Qwen3 is and what makes it stand out as an LLM
- Why it is well-suited for web scraping tasks
- How to use Qwen3 locally for web scraping with Hugging Face
- Its main limitations and how to work around them
- A few alternatives to Qwen3 for AI-powered scraping
Let’s dive in!
What Is Qwen3?
Qwen3 is the latest generation of LLMs developed by Alibaba Cloud’s Qwen team. The model is open-source and is freely explorable on GitHub—available under the Apache 2.0 license. That makes great for research and development.
The main features of Qwen3 include:
- Hybrid reasoning: It can switch between a “thinking mode” for complex logical reasoning (like math or coding) and a “non-thinking mode” for faster, general-purpose responses. This allows you to control the depth of reasoning for optimal performance and cost efficiency.
- Diverse models: Qwen3 offers a comprehensive suite of models, including dense models (ranging from 0.6B to 32B parameters) and Mixture-of-Experts (MoE) models (like the 30B and 235B variants).
- Enhanced capabilities: It shows significant advancements in reasoning, instruction following, agent capabilities, and multilingual support (covering over 100 languages and dialects).
- Training Data: Qwen3 was trained on a massive dataset of approximately 36 trillion tokens, nearly double that of its predecessor, Qwen2.5.
Why Use Qwen3 for Web Scraping?
Qwen3 makes web scraping easier by automating the interpretation and structuring of unstructured content in HTML pages. That eliminates the need for manual data parsing. Instead of writing complex logic to extract data, the model understands the structure of the page for you.
Relying on Qwen3 for web data parsing is especially useful when dealing with common web scraping challenges like:
- Frequently changing page layouts: A popular scenario is Amazon, where each product page can show different data.
- Unstructured data: Qwen3 can extract valuable information from messy, free-form text without requiring hardcoded selectors or regex logic.
- Difficult-to-parse content: For pages with inconsistent or complex structure, an LLM like Qwen3 removes the need for custom parsing logic.
For a deeper dive, read our guide on using AI for web scraping.
Another major advantage is that Qwen3 is open-source. That means you can run it locally on your own machine for free, without relying on third-party APIs or paying for premium LLMs like OpenAI’s. This gives you full control over your scraping architecture.
How to Perform Web Scraping with Qwen3 in Python
In this section, the target page will be the “Affirm Water Bottle” product page from the “Ecommerce Test Site to Learn Web Scraping” sandbox:
This page serves as a solid example because e-commerce product pages usually have inconsistent structures, displaying varying types of data. That variability is what makes e-commerce web scraping particularly challenging—and also where AI can make a big difference.
Here, we will use a Qwen3-powered scraper to intelligently extract product information without writing manual parsing rules.
Note: This tutorial will show you how to use Hugging Face to run Qwen3 models locally and for free. Now, other viable options exist. These include connecting to an LLM provider hosting Qwen3 models, or utilizing solutions like Ollama.
Follow the steps below to start scraping web data using Qwen3!
Step #1: Set Up Your Project
Before getting started, make sure that you have Python 3.10+ installed in your machine. Otherwise, download it and follow the installation instructions.
Next, execute the command below to create a folder for your scraping project:
The qwen3-scraper
directory will serve as the project folder for web scraping using Qwen3.
Navigate to the folder in your terminal and initialize a Python virtual environment inside it:
Load the project folder in your preferred Python IDE. Visual Studio Code with the Python extension or PyCharm Community Edition are both excellent options.
Create a scraper.py
file in the project’s folder, which should now contain:
Right now, scraper.py
is just an empty Python script, but it will soon contain the logic for LLM web scraping.
Then, activate the virtual environment. On Linux or macOS, run:
Equivalently, on Windows, use:
Note: The following steps will guide you through installing all the required libraries. If you prefer to install everything at once, you can use the command below now:
Awesome! Your Python environment is fully set up for web scraping with Qwen3.
Step #2: Configure Qwen3 in Hugging Face
As mentioned at the beginning of this section, we will use Hugging Face to run a Qwen3 model locally. This is now possible because Hugging Face recently added support for Qwen3 models.
First, make sure you are in an activated virtual environment. Then, install the necessary Hugging Face dependencies by running:
Next, in your scraper.py
file, import the required classes from Hugging Face’s transformers
library:
Now, use those classes to load a tokenizer and the Qwen3 model:
In this case, we are using the Qwen/Qwen3-0.6B
model, but you can choose from over 40+ other available Qwen3 models on Hugging Face.
Awesome! You now have everything in place to utilize Qwen3 in your Python script.
Step #3: Get the HTML of the Target Page
Now, it is time to retrieve the HTML content of the target page. You can achieve that using a powerful Python HTTP client like Requests.
In your activated virtual environment, install the Requests library:
Then, in your scraper.py
file, import the library:
Use the get()
method to send an HTTP GET request to the page URL:
The server will respond with the raw HTML of the page. To see the full HTML content, you can print response.content
:
The result should be this HTML string:
You now have the complete HTML of the target page available in Python. Let’s move on to parsing it and extracting the data we need using Qwen3!
Step #4: Convert the Page HTML to Markdown (optional, but recommended)
Note: This step is not strictly required. However, it can save you significant time locally (and money if you are using paid Qwen3 providers). So, it is definitely worth considering.
Take a moment to explore how other AI-powered web scraping tools like Crawl4AI and ScrapeGraphAI handle raw HTML. You will notice they both offer options to convert HTML into Markdown before passing the content to the configured LLM.
Why do they do that? There are two main reasons:
- Cost efficiency: Markdown conversion reduces the number of tokens sent to the AI, helping you save money.
- Faster processing: Less input data means lower computational costs and quicker responses.
For more information, read our guide on why the new AI agents choose Markdown over HTML.
In this case, since Qwen3 runs locally, cost efficiency is not important because you are not connected to a third-party LLM provider. What really matters here is faster processing. Why? Because asking the chosen Qwen3 model (which is one of the smaller available models, btw) to process the entire HTML page can easily push an i7 CPU to 100% usage for several minutes.
That is too much, as you do not want to overheat or freeze your laptop or PC. So, reducing input size by converting to Markdown makes perfect sense.
Time to replicate the HTML-to-Markdown conversion logic and reduce token usage!
First, open the target webpage in incognito mode to ensure a fresh session. Then, right-click anywhere on the page and select “Inspect” to open the DevTools. Now, examine the page structure. You will see that all relevant data is contained within the HTML element identified by the CSS selector #main
:
By focusing on the content inside #main
in the HTML-to-Markdown conversion process, you extract only the part of the page with relevant data. This avoids including headers, footers, and other sections you are not interested in. That way, the final Markdown output will be much shorter.
To select just the HTML in the #main
element, you need a Python HTML parsing library like Beautiful Soup. In your activated virtual environment, install it with this command:
If you are not familiar with its API, follow our guide on Beautiful Soup web scraping.
Then, import it in scraper.py
:
Now, use Beautiful Soup to:
- Parse the raw HTML fetched with Requests
- Select the
#main
element - Extract its HTML content
Implement the three above micro-steps with this snippet:
If you print main_html
, you will see something like this:
This string is much smaller than the full HTML page but still contains about 13,402 characters.
To reduce the size even more without losing important data, convert the extracted HTML to Markdown. First, install the markdownify
library:
Import markdownify
in scraper.py
:
Then, employ it to convert the HTML from #main
to Markdown:
The data conversion process should produce an output as below:
The Markdown version is about 2.53 KB, compared to 13.61 KB for the original #main
HTML. That is an 81% reduction in size! On top of that, what matters is that the Markdown version retains all the key data you need to scrape.
With this simple trick, you reduced a bulky HTML snippet into a compact Markdown string. This will speed up local LLM data parsing via Qwen3 a lot!
Step #5: Use Qwen3 for Data Parsing
To get Qwen3 to scrape data correctly, you need to write an effective prompt. Start by analyzing the structure of the target page:
The top section of the page is consistent across all products. On the other hand, the “Additional information” table changes depending on the product. Since you may want your prompt to work across all product pages on the platform, you could describe your task in general terms like so:
This prompt instructs Qwen3 to extract structured data from the main_markdown
content. To get reliable results, it is a good idea to make your prompt as clear and specific as possible. That helps the model understand exactly what you expect.
Now, use Hugging Face to run the prompt, as explained in the official documentation:
The above code uses apply_chat_template()
to format the input message and generates a response from the configured Qwen3 model.
Note: A key detail is setting enable_thinking=False
in apply_chat_template()
. By default, that option is set to True
, which activates the model’s internal “reasoning” mode. That task is useful for complex problem-solving but unnecessary and potentially counterproductive for straightforward tasks like web scraping. Disabling it ensures the model focuses purely on extraction without adding explanations or assumptions.
Fantastic! You just instructed Qwen3 to perform web scraping on the target page.
Now, all that remains is tweaking the output and export it to JSON.
Step #6: Convert the Qwen3 Output
The output produced by the Qwen3-0.6B model can vary slightly between runs. This is typical behavior for LLMs, especially smaller models like the one used here.
Thus, sometimes the variable product_raw_string
will contain the desired data as a plain JSON string. Other times, it may wrap the JSON inside a Markdown code block, like this:
To handle both cases, you can use a regular expression to extract the JSON content when it appears inside a Markdown block. Otherwise, treat the string as raw JSON. Then, you can parse the resulting JSON data to Python dictionary json.loads()
:
Here we go! At this point, you parsed the scraped data into a usable Python object. The last step is to export the scraped data to a more user-friendly format.
Step #7: Export the Scraped Data
Now that you have the product data in a Python dictionary, you can save it to a JSON file like this:
This will create a file named product.json
containing your structured product data.
Well done! Your Qwen3 web scraper is now complete.
Step #8: Put It All Together
Here is the final code of your scraper.py
Qwen3 scraping script:
Run the script with:
The first time you run the script, Hugging Face will automatically download the selected Qwen3 model. This model is about 1.5GB, so the download may take some time depending on your internet speed. In the terminal, you will see output like:
The script may take a bit to complete, as PyTorch will stress your CPU to load and run the model.
Once the script finishes, it will create a file named product.json
in your project folder. Open this file, and you should see structured product data like this:
Note: The exact output may vary slightly due to the nature of LLMs, which can structure the scraped content in different ways.
Et voilà! Your script just transformed raw HTML content into clean, structured JSON. All thanks to Qwen3 web scraping.
Overcoming the Main Limitation of This Approach to Web Scraping
Sure, in our example, everything worked smoothly. But that is only because we were scraping a demo site built specifically for that purpose.
In the real world, most websites are well aware of the value of their public-facing data. Thus, they often implement anti-scraping techniques that can quickly block automated HTTP requests made using tools like requests
.
Plus, this approach will not work on JavaScript-heavy sites. That is because the combination of requests
and BeautifulSoup works well for static pages, but cannot handle dynamic content. If you are unfamiliar with the difference, take a look at our article on static vs dynamic content.
Other potential blockers include IP bans, rate limiters, TLS fingerprinting, CAPTCHAs, and more. In short, web scraping is not easy—especially now that most websites are equipped to detect and block AI crawlers and bots.
The solution is to utilize a Web Unlocker API built for modern web scraping with requests
. Such a service takes care of all the hard stuff for you, including rotating IPs, solving CAPTCHAs, rendering JavaScript, and bypassing bot protection.
All you have to do is pass the URL of the target page to the Web Unlocker API endpoint. The API will return fully unlocked HTML, even if the page relies on JavaScript or is protected by advanced anti-bot systems.
To integrate it into your script, just replace the requests.get()
line from Step #3 with the following code:
For more details, refer to the official Web Unlocker documentation.
With a Web Unlocker in place, you can confidently use Qwen3 to extract structured data from any website—no more blocks, rendering issues, or missing content.
Alternatives to Qwen3 for Web Scraping
Qwen3 is not the only LLM you can use for automated web data parsing. Explore some alternative approaches in the following guides:
- Web Scraping With Gemini: Complete Tutorial
- Web Scraping Using Perplexity: Step-By-Step Guide
- LLM Web Scraping with ScrapeGraphAI
- How to Build an AI Scraper With Crawl4AI and DeepSeek
- Web Scraping with LLaMA 3: Turn Any Website into Structured JSON
Conclusion
In this tutorial, you learned how to run Qwen3 locally using Hugging Face to build an AI-powered web scraper. One of the biggest hurdles in web scraping is getting blocked, but that was addressed using Bright Data’s Web Unlocker API.
As covered earlier, combining Qwen3 with the Web Unlocker API allows you to extract data from virtually any website. All that with no custom parsing logic required. This setup showcases just one of the many powerful use cases made possible by Bright Data’s infrastructure, helping you build scalable, AI-driven web data pipelines.
So, why stop here? Consider exploring Web Scraper APIs—dedicated endpoints for extracting fresh, structured, and fully compliant web data from over 120 popular websites.
Sign up for a free Bright Data account today and start building with AI-ready scraping solutions!