In this guide, you will learn:
- What
llm-scraper
is - How to use it in a step-by-step walkthrough
- How to use it for code generation
- What the main alternatives are for LLM-based scraping
- Its key limitations and how to overcome them
Let’s dive in!
What Is llm-scraper?
llm-scraper
is a TypeScript library that uses LLMs to extract structured data from any webpage.
Instead of writing custom parsing logic for each site, you simply define a schema, and the package employs an LLM to intelligently fill it by analyzing the page’s content.
The library was first released in mid-2024, so it is still quite new. However, it has already gained over 4,000 stars on GitHub, showing how quickly it is becoming popular:
It is ideal for dynamic or inconsistent websites (like e-commerce) where traditional scrapers often break. In those scenarios, LLM scraping shines.
In detail, the main features supported by llm-scraper
are:
- Integration with multiple LLM providers: Works with local models (like Ollama or GGUF) and cloud providers (like OpenAI and Vercel AI SDK).
- Schema-based data extraction: Define what you want to extract using Zod schemas for strong structure and validation.
- Full type-safety: Designed for TypeScript, so you get complete type-checking at compile time.
- Built on Playwright: Uses Playwright under the hood to control the browser and fetch page content.
- Streaming support: Can stream objects during scraping instead of waiting for the full extraction to complete.
- Code generation: Can generate scraping code dynamically based on your schema and target.
- Multiple page content formatting options: You can choose how to send page content to the LLM
How To Use llm-scraper for Web Scraping
In this tutorial section, you will learn how to use the llm-scraper
library to build an AI-powered scraper. The target site will be an e-commerce product page from the ToScrape website:
This is a great example because scraping e-commerce sites can be tricky. Their page structures often change, and different products can have different layouts and information. Because of that, writing static parsing logic is difficult and not very effective. Thus, AI can definitely help.
Follow the steps below to build an LLM-powered Scraper!
Prerequisites
To follow this tutorial, you will need:
- Node.js installed on your machine
- An OpenAI API key (or a key from a similar provider like Groq)
- Basic understanding of TypeScript and asynchronous programming
Step #1: Project Setup
Before getting started, make sure you have the latest LTS version of Node.js installed locally. If not, download it from the official site and install it.
Now, create a project for your scraper and navigate to it in the terminal:
Inside your project, run the following command to initialize a blank Node.js project:
Open the package.json
file created by the command above and make sure it contains:
Load the project’s folder llm-scraper-project
in your favorite TypeScript IDE, such as Visual Studio Code.
Next, install TypeScript as a dev dependency:
With TypeScript installed, initialize your TypeScript project by running:
A tsconfig.json
file will appear in your project’s folder. Open it, and replace it with:
Now, add a scraper.ts
file to your project:
This file will soon contain your llm-scraper
data extraction logic. Since the script will use asynchronous TypeScript logic, initialize an async function inside it:
Wonderful! You are fully set up and ready to start building your AI-powered scraper.
Step #2: Install the Scraped Libraries
To work, llm-scraper
relies on the following two additional libraries:
- Zod: A TypeScript-first schema declaration and validation library.
- Playwright: A library to automate Chromium, Firefox, and WebKit browsers with a single API.
Install them together with llm-scraper
using the following command:
Playwright requires some extra dependencies (such as the browser binaries). Install them with:
Now, import Zod and Playwright in scraper.ts
:
Great! You now have all the required libraries to get started with LLM web scraping in TypeScript.
Step #3: Set Up OpenAI
llm-scraper
supports several LLM providers, including OpenAI, Groq, Ollama, and GGUF. In this case, we are going to use OpenAI. If you have not done so already, make sure to get your OpenAI API key.
First, install the OpenAI JavaScript client:
Then, import it into your code and use it to initialize your LLM model inside the llmScraping()
function:
For a different integration, refer to the official llm-scraper
` documentation.
To avoid hard-coding the OpenAI key in your code, install dotenv
:
Import dotenv
in your scraper.ts
file and call dotenv.config()
to load the environment variables:
This enables you to load environment variables, like your OpenAI API key, from a .env
files. Thus, add a .env
file to your project:
Open it and add your OpenAI API key like so:
Replace <YOUR_OPENAI_KEY>
with the value of your Open AI key
Note that do not need to manually read the variable in your code. That is because @ai-sdk/openai
automatically attempts to read the OPENAI_KEY
environment variable.
Amazing! LLM integration completed.
Step #4: Connect to the Target Page
llm-scraper
relies on Playwright as the browser automation engine to extract the HTML content of web pages. To get started, add the following lines of code inside llmScraping()
to:
- Initialize a Chromium browser
- Open a new page
- Instruct Playwright to visit the target page
Achieve that with:
At the end, do not forget to close the browser and release its resources:
If you are not familiar with this process, read our guide on Playwright web scraping.
Step #5: Define the Data Schema
Now, llm-scraper
works by feeding the underlying LLM model with a prompt—operating on the content extracted from the page via Playwright—for extracting structured data as defined in a specific data model.
This is where Zod comes in, helping you specify that data model in TypeScript. To understand how the schema of your scraped data should look, open the target site in your browser and start by analyzing the top level of the page:
From here, you should focus on extracting the following data:
- Title
- Price
- Stock status
- Quantity
- Description
Next, move to the last section of the page:
Here, you will be interested in:
- UPC (Universal Product Code)
- Product type
- Tax
- Number of reviews
Put it all together, and you will have the following product schema:
Tip: Do not forget to describe your schema, as this helps the LLM model understand what it should do with the data.
Fantastic! You are ready to launch the AI-powered scraping task in llm-scraper
.
Step #6: Run the Scraping Task
Use the LLM integration you defined in Step 3 to create an LLMScraper
object:
This is the main object exposed by the llm-scraper
library, and it is responsible for performing the AI-powered scraping tasks.
Then, launch the scraper as follows:
The format
parameter defines how the page content is passed to the LLM. The possible values are:
"html"
: The raw HTML of the page."text"
: All the text content extracted from the page."markdown"
: The HTML content converted to Markdown."cleanup"
: A cleaned-up version of the text extracted from the page."image"
: A screenshot of the page.
Note: You can also provide a custom function to control the content formatting if needed.
As discussed in the “Why Are the New AI Agents Choosing Markdown Over HTML?” guide, using the Markdown format is a smart choice because it helps save tokens and speed up the scraping process.
Finally, the scraper.run()
function returns an object that matches your expected Zod schema.
Perfect! Your AI-powered scraping task is complete.
Step #7: Export the Scraped Data
Currently, the scraped data is stored in a JavaScript object. To make the data easier to access, analyze, and share, export it to a JSON file as below:
Note that you do not need any external libraries for this. Just make sure to add the following import at the top of your scraper.ts
file:
Step #8: Put It All Together
scraper.ts
should now contain:
As you can see, llm-scraper
allows you to define a JavaScript-based scraping script in a handful of lines of code.
Compile your script from TypeScript to JavaScript with this command:
A scraper.js
file will appear in your proejct’s folder. Execute it with:
When the script finishes running, a file called product.json
will appear in your project folder.
Open it, and you will see something like this:
This file contains exactly the information displayed on the product page you targeted. As you can see, the data was extracted without needing any custom parsing logic, thanks to the power of LLMs. Well done!
Extra: Code Generation With llm-scraper
llm-scraper
also has the ability to generate the underlying Playwright data parsing logic, given the schema. This is made possible by the generate()
function.
See an example in the snippet below:
As you can see, it takes the Playwright page object and the Zod schema, then returns a string containing the generated Playwright code. In this case, the output is:
You can then execute this generated JavaScript code programmatically and parse the result with:
The data
object will contain the same result as the data
produced in Step #6 of the previous chapter.
llm-scraper
Alternatives for LLM Scraping
llm-scraper
is not the only library available for LLM-powered scraping. Some other noteworthy alternatives include:
- Crawl4AI: A Python library to build blazing-fast, AI-ready web crawling agents, and data pipelines. It is highly flexible and optimized for developers to deploy with speed and precision. You can see it in action in our tutorial on Crawl4AI scraping.
- ScrapeGraphAI: A Python web scraping library that combines LLMs and direct graph logic to build scraping pipelines for websites and local documents (like XML, HTML, JSON, and Markdown). Check it out in our guide on scraping with ScrapeGraphAI.
Limitations to This Approach to Web Scraping
ToScrape, the target site we used in this article, is—as the name suggests—just a scraping sandbox that welcomes scraping scripts. Unfortunately, when using llm-scraper
against real-world websites, things are likely to get much more challenging…
Why? Because e-commerce companies and online businesses know how valuable their data is, and they go to great lengths to protect it. That is true even if that data is publicly available on their product pages.
As a result, most e-commerce platforms implement anti-bot and anti-scraping measures to block automated crawlers. These techniques can stop even scrapers based on browser automation tools like Playwright—just like llm-scraper
.
We are talking about defenses like the infamous Amazon CAPTCHA, which is enough to stop most bots:
Now, even if you manage to bypass CAPTCHAs with Playwright, other challenges like IP bans caused by too many automated requests can shut down your scraping operation.
At this point, the solution is not about endlessly tweaking your script to make it more complex. The idea is to use the right tools.
By integrating Playwright with a browser specifically designed for web scraping—like Scraping Browser—everything becomes much easier. This solution is a cloud-based browser optimized for scraping. It handles IP rotation, automatic retries, advanced anti-bot bypass mechanisms, and even built-in CAPTCHA solving, all without the need to manage infrastructure yourself.
Integrate Scraping Browser with Playwright in llm-scraper
just like any other browser as explained in our docs.
Conclusion
In this blog post, you learned what llm-scraper
has to offer and how to use it to build an AI-powered scraping script in TypeScript. Thanks to its integration with LLMs, you can scrape sites with complex or dynamic page structures.
As we discussed, the most effective way to avoid getting blocked is by utilizing it together with Bright Data’s Scraping Browser, which comes with a built-in CAPTCHA solver and many other anti-bot bypass capabilities.
If you are interested in building an AI agent directly based on that solution, check out Agent Browser. This solution executes agent-driven workflows on remote browsers that never get blocked. It is infinitely scalable and is powered by the world’s most reliable proxy network.
Create a free Bright Data account today and explore our data and scraping solutions to power your AI journey!
No credit card required