Blog / AI
AI

AI-Powered Web Scraping With llm-scraper

Discover how to build an AI-powered web scraper with llm-scraper, extract structured data, and generate scraping code using LLMs and TypeScript.
15 min read
Web Scraping With llm-scraper

In this guide, you will learn:

  • What llm-scraper is
  • How to use it in a step-by-step walkthrough
  • How to use it for code generation
  • What the main alternatives are for LLM-based scraping
  • Its key limitations and how to overcome them

Let’s dive in!

What Is llm-scraper?

llm-scraper is a TypeScript library that uses LLMs to extract structured data from any webpage.
Instead of writing custom parsing logic for each site, you simply define a schema, and the package employs an LLM to intelligently fill it by analyzing the page’s content.

The library was first released in mid-2024, so it is still quite new. However, it has already gained over 4,000 stars on GitHub, showing how quickly it is becoming popular:

The GitHub star evolution of llm-scraper

It is ideal for dynamic or inconsistent websites (like e-commerce) where traditional scrapers often break. In those scenarios, LLM scraping shines.

In detail, the main features supported by llm-scraper are:

  • Integration with multiple LLM providers: Works with local models (like Ollama or GGUF) and cloud providers (like OpenAI and Vercel AI SDK).
  • Schema-based data extraction: Define what you want to extract using Zod schemas for strong structure and validation.
  • Full type-safety: Designed for TypeScript, so you get complete type-checking at compile time.
  • Built on Playwright: Uses Playwright under the hood to control the browser and fetch page content.
  • Streaming support: Can stream objects during scraping instead of waiting for the full extraction to complete.
  • Code generation: Can generate scraping code dynamically based on your schema and target.
  • Multiple page content formatting options: You can choose how to send page content to the LLM

How To Use llm-scraper for Web Scraping

In this tutorial section, you will learn how to use the llm-scraper library to build an AI-powered scraper. The target site will be an e-commerce product page from the ToScrape website:

The target site

This is a great example because scraping e-commerce sites can be tricky. Their page structures often change, and different products can have different layouts and information. Because of that, writing static parsing logic is difficult and not very effective. Thus, AI can definitely help.

Follow the steps below to build an LLM-powered Scraper!

Prerequisites

To follow this tutorial, you will need:

  • Node.js installed on your machine
  • An OpenAI API key (or a key from a similar provider like Groq)
  • Basic understanding of TypeScript and asynchronous programming

Step #1: Project Setup

Before getting started, make sure you have the latest LTS version of Node.js installed locally. If not, download it from the official site and install it.

Now, create a project for your scraper and navigate to it in the terminal:

mkdir llm-scraper-project
cd llm-scraper-project

Inside your project, run the following command to initialize a blank Node.js project:

npm init -y

Open the package.json file created by the command above and make sure it contains:

"type": "module"

Load the project’s folder llm-scraper-project in your favorite TypeScript IDE, such as Visual Studio Code.

Next, install TypeScript as a dev dependency:

npm install typescript --save-dev

With TypeScript installed, initialize your TypeScript project by running:

npx tsc --init

A tsconfig.json file will appear in your project’s folder. Open it, and replace it with:

{
  "compilerOptions": {
    "module": "ESNext",
    "target": "ESNext",
    "moduleResolution": "node",
    "esModuleInterop": true,
    "allowSyntheticDefaultImports": true,
    "strict": true,
    "skipLibCheck": true
  }
}

Now, add a scraper.ts file to your project:

The llm-scraper project file structure

This file will soon contain your llm-scraper data extraction logic. Since the script will use asynchronous TypeScript logic, initialize an async function inside it:

async function llmScraping() {
  // your LLM scraping logic...
}

llmScraping()

Wonderful! You are fully set up and ready to start building your AI-powered scraper.

Step #2: Install the Scraped Libraries

To work, llm-scraper relies on the following two additional libraries:

  • Zod: A TypeScript-first schema declaration and validation library.
  • Playwright: A library to automate Chromium, Firefox, and WebKit browsers with a single API.

Install them together with llm-scraper using the following command:

npm install zod playwright llm-scraper

Playwright requires some extra dependencies (such as the browser binaries). Install them with:

npx playwright install

Now, import Zod and Playwright in scraper.ts:

import { z } from "zod"
import { chromium } from "playwright"
import LLMScraper from "llm-scraper"

Great! You now have all the required libraries to get started with LLM web scraping in TypeScript.

Step #3: Set Up OpenAI

llm-scraper supports several LLM providers, including OpenAI, Groq, Ollama, and GGUF. In this case, we are going to use OpenAI. If you have not done so already, make sure to get your OpenAI API key.

First, install the OpenAI JavaScript client:

npm install @ai-sdk/openai

Then, import it into your code and use it to initialize your LLM model inside the llmScraping() function:

// other imports...
import { openai } from "@ai-sdk/openai"

// ...

const llm = openai.chat("gpt-4o")

For a different integration, refer to the official llm-scraper` documentation.

To avoid hard-coding the OpenAI key in your code, install dotenv:

npm install dotenv

Import dotenv in your scraper.ts file and call dotenv.config() to load the environment variables:

// other imports...
import * as dotenv from "dotenv"

// ...

dotenv.config()

This enables you to load environment variables, like your OpenAI API key, from a .env files. Thus, add a .env file to your project:

Adding the .env file to your project

Open it and add your OpenAI API key like so:

OPENAI_API_KEY="<YOUR_OPENAI_KEY>"

Replace <YOUR_OPENAI_KEY> with the value of your Open AI key

Note that do not need to manually read the variable in your code. That is because @ai-sdk/openai automatically attempts to read the OPENAI_KEY environment variable.

Amazing! LLM integration completed.

Step #4: Connect to the Target Page

llm-scraper relies on Playwright as the browser automation engine to extract the HTML content of web pages. To get started, add the following lines of code inside llmScraping() to:

  1. Initialize a Chromium browser
  2. Open a new page
  3. Instruct Playwright to visit the target page

Achieve that with:

const browser = await chromium.launch()
const page = await browser.newPage()

await page.goto("https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html")

At the end, do not forget to close the browser and release its resources:

await page.close()
await browser.close()

If you are not familiar with this process, read our guide on Playwright web scraping.

Step #5: Define the Data Schema

Now, llm-scraper works by feeding the underlying LLM model with a prompt—operating on the content extracted from the page via Playwright—for extracting structured data as defined in a specific data model.

This is where Zod comes in, helping you specify that data model in TypeScript. To understand how the schema of your scraped data should look, open the target site in your browser and start by analyzing the top level of the page:

The top level of the target page

From here, you should focus on extracting the following data:

  • Title
  • Price
  • Stock status
  • Quantity
  • Description

Next, move to the last section of the page:

The last section of the target page

Here, you will be interested in:

  • UPC (Universal Product Code)
  • Product type
  • Tax
  • Number of reviews

Put it all together, and you will have the following product schema:

const productSchema = z.object({
  title: z.string().describe("The name of the product"),
  price: z.string().describe("The price of the product, typically formatted as a string like '£19.99'"),
  stock: z.string().describe("The availability status of the product, such as 'In Stock' or 'Out of Stock'"),
  quantity: z.string().describe("The specific quantity of products available in stock"),
  description: z.string().describe("A detailed description of the product, including features and specifications"),
  upc: z.string().describe("The Universal Product Code (UPC) to uniquely identify the product"),
  productType: z.string().describe("The category or type of the product, such as 'Books', 'Clothing', etc."),
  tax: z.string().describe("Information about the applicable tax amount for the product"),
  reviews: z.number().describe("The number of reviews the product has received"),
})

Tip: Do not forget to describe your schema, as this helps the LLM model understand what it should do with the data.

Fantastic! You are ready to launch the AI-powered scraping task in llm-scraper.

Step #6: Run the Scraping Task

Use the LLM integration you defined in Step 3 to create an LLMScraper object:

const scraper = new LLMScraper(llm)

This is the main object exposed by the llm-scraper library, and it is responsible for performing the AI-powered scraping tasks.

Then, launch the scraper as follows:

const { data } = await scraper.run(page, productSchema, {
  format: "markdown",
})

The format parameter defines how the page content is passed to the LLM. The possible values are:

  • "html": The raw HTML of the page.
  • "text": All the text content extracted from the page.
  • "markdown": The HTML content converted to Markdown.
  • "cleanup": A cleaned-up version of the text extracted from the page.
  • "image": A screenshot of the page.

Note: You can also provide a custom function to control the content formatting if needed.

As discussed in the “Why Are the New AI Agents Choosing Markdown Over HTML?” guide, using the Markdown format is a smart choice because it helps save tokens and speed up the scraping process.

Finally, the scraper.run() function returns an object that matches your expected Zod schema.

Perfect! Your AI-powered scraping task is complete.

Step #7: Export the Scraped Data

Currently, the scraped data is stored in a JavaScript object. To make the data easier to access, analyze, and share, export it to a JSON file as below:

const jsonData = JSON.stringify(data, null, 4)
await fs.writeFile("product.json", jsonData, "utf8")

Note that you do not need any external libraries for this. Just make sure to add the following import at the top of your scraper.ts file:

import { promises as fs } from "fs"

Step #8: Put It All Together

scraper.ts should now contain:

import { z } from "zod"
import { chromium } from "playwright"
import LLMScraper from "llm-scraper"
import { openai } from "@ai-sdk/openai"
import * as dotenv from "dotenv"
import { promises as fs } from "fs"

// load the environment variables from the local .env file
dotenv.config()

async function llmScraping() {
  // initialize the LLM engine
  const llm = openai.chat("gpt-4o")

  // launch a browser instance and open a new page
  const browser = await chromium.launch()
  const page = await browser.newPage()

  // navigate to the target e-commerce product page
  await page.goto("https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html")

  // define the product schema
  const productSchema = z.object({
    title: z.string().describe("The name of the product"),
    price: z.string().describe("The price of the product, typically formatted as a string like '£19.99'"),
    stock: z.string().describe("The availability status of the product, such as 'In Stock' or 'Out of Stock'"),
    quantity: z.string().describe("The specific quantity of products available in stock"),
    description: z.string().describe("A detailed description of the product, including features and specifications"),
    upc: z.string().describe("The Universal Product Code (UPC) to uniquely identify the product"),
    productType: z.string().describe("The category or type of the product, such as 'Books', 'Clothing', etc."),
    tax: z.string().describe("Information about the applicable tax amount for the product"),
    reviews: z.number().describe("The number of reviews the product has received"),
  })

  // create a new LLMScraper instance
  const scraper = new LLMScraper(llm)

  // run the LLM scraper
  const { data } = await scraper.run(page, productSchema, {
    format: "markdown", // or "html", "text", etc.
  })

  // conver the scraped data to a JSON string
  const jsonData = JSON.stringify(data, null, 4)
  // populate an output file with the JSON string
  await fs.writeFile("product.json", jsonData, "utf8")

  // close the page and the browser and release their resources
  await page.close()
  await browser.close()
}

llmScraping()

As you can see, llm-scraper allows you to define a JavaScript-based scraping script in a handful of lines of code.

Compile your script from TypeScript to JavaScript with this command:

npx tsc

A scraper.js file will appear in your proejct’s folder. Execute it with:

node scraper.js

When the script finishes running, a file called product.json will appear in your project folder.

Open it, and you will see something like this:

{
  "title": "A Light in the Attic",
  "price": "£51.77",
  "stock": "In Stock",
  "quantity": "22",
  "description": "It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? Rockabye Rockabye baby, in the treetop Don't you know a treetop Is no safe place to rock? And who put you up there, And your cradle, too? Baby, I think someone down here's Got it in for you. Shel, you never sounded so good.",
  "upc": "a897fe39b1053632",
  "productType": "Books",
  "tax": "£0.00",
  "reviews": 0
}

This file contains exactly the information displayed on the product page you targeted. As you can see, the data was extracted without needing any custom parsing logic, thanks to the power of LLMs. Well done!

Extra: Code Generation With llm-scraper

llm-scraper also has the ability to generate the underlying Playwright data parsing logic, given the schema. This is made possible by the generate() function.

See an example in the snippet below:

const { code } = await scraper.generate(page, productSchema)

As you can see, it takes the Playwright page object and the Zod schema, then returns a string containing the generated Playwright code. In this case, the output is:

(function() {
function extractData() {
const title = document.querySelector('h1').innerText;
const price = document.querySelector('.price_color').innerText;
const stockText = document.querySelector('.instock.availability').innerText.trim();
const stock = stockText.includes('In stock') ? 'In Stock' : 'Out of Stock';
const quantityMatch = stockText.match(/\d+/);
const quantity = quantityMatch ? quantityMatch[0] : '0';
const description = document.querySelector('#product_description ~ p').innerText;
const upc = document.querySelector('th:contains("UPC") + td').innerText;
const productType = document.querySelector('th:contains("Product Type") + td').innerText;
const tax = document.querySelector('th:contains("Tax") + td').innerText;
const reviews = parseInt(document.querySelector('th:contains("Number of reviews") + td').innerText, 10);
    return {
        title,
        price,
        stock,
        quantity,
        description,
        upc,
        productType,
        tax,
        reviews
    };
}

const data = extractData();
console.log(data);
})()

You can then execute this generated JavaScript code programmatically and parse the result with:

const result = await page.evaluate(code)
const data = schema.parse(result)

The data object will contain the same result as the data produced in Step #6 of the previous chapter.

llm-scraper Alternatives for LLM Scraping

llm-scraper is not the only library available for LLM-powered scraping. Some other noteworthy alternatives include:

  • Crawl4AI: A Python library to build blazing-fast, AI-ready web crawling agents, and data pipelines. It is highly flexible and optimized for developers to deploy with speed and precision. You can see it in action in our tutorial on Crawl4AI scraping.
  • ScrapeGraphAI: A Python web scraping library that combines LLMs and direct graph logic to build scraping pipelines for websites and local documents (like XML, HTML, JSON, and Markdown). Check it out in our guide on scraping with ScrapeGraphAI.

Limitations to This Approach to Web Scraping

ToScrape, the target site we used in this article, is—as the name suggests—just a scraping sandbox that welcomes scraping scripts. Unfortunately, when using llm-scraper against real-world websites, things are likely to get much more challenging…

Why? Because e-commerce companies and online businesses know how valuable their data is, and they go to great lengths to protect it. That is true even if that data is publicly available on their product pages.

As a result, most e-commerce platforms implement anti-bot and anti-scraping measures to block automated crawlers. These techniques can stop even scrapers based on browser automation tools like Playwright—just like llm-scraper.

We are talking about defenses like the infamous Amazon CAPTCHA, which is enough to stop most bots:

A CAPTCHA verification screen from Amazon asking users to enter characters displayed in a distorted text image. It includes instructions and options to continue or try a different image.

Now, even if you manage to bypass CAPTCHAs with Playwright, other challenges like IP bans caused by too many automated requests can shut down your scraping operation.

At this point, the solution is not about endlessly tweaking your script to make it more complex. The idea is to use the right tools.

By integrating Playwright with a browser specifically designed for web scraping—like Scraping Browser—everything becomes much easier. This solution is a cloud-based browser optimized for scraping. It handles IP rotation, automatic retries, advanced anti-bot bypass mechanisms, and even built-in CAPTCHA solving, all without the need to manage infrastructure yourself.

Integrate Scraping Browser with Playwright in llm-scraper just like any other browser as explained in our docs.

Conclusion

In this blog post, you learned what llm-scraper has to offer and how to use it to build an AI-powered scraping script in TypeScript. Thanks to its integration with LLMs, you can scrape sites with complex or dynamic page structures.

As we discussed, the most effective way to avoid getting blocked is by utilizing it together with Bright Data’s Scraping Browser, which comes with a built-in CAPTCHA solver and many other anti-bot bypass capabilities.

If you are interested in building an AI agent directly based on that solution, check out Agent Browser. This solution executes agent-driven workflows on remote browsers that never get blocked. It is infinitely scalable and is powered by the world’s most reliable proxy network.

Create a free Bright Data account today and explore our data and scraping solutions to power your AI journey!

No credit card required