Blog / AI
AI

Building AI-Ready Vector Datasets for LLMs: A Guide with Bright Data, Google Gemini, and Pinecone

This article shows how to build an automated pipeline for creating high-quality, AI-ready vector datasets for domain-specific Large Language Models.
20 min read
Building AI-Ready Vector Datasets for LLMs blog image

Large Language Models (LLMs) are transforming how we access information and build intelligent applications. To harness their full potential, especially with domain-specific knowledge or proprietary data, it’s critical to create high-quality, structured vector datasets. An LLM’s performance and accuracy are directly tied to the quality of its input data. Poorly prepared datasets can lead to subpar results, while well-curated ones can turn an LLM into a true domain expert.

In this guide, we will walk through how to build an automated pipeline for generating AI-ready vector datasets, step by step.

The Challenge: Sourcing and Preparing Data for LLMs

While LLMs are trained on vast general-purpose text corpora, they often fall short when applied to specific tasks or domains, such as answering product-related queries, analyzing industry news, or interpreting customer feedback. To make them truly useful, you need high-quality data that’s tailored to your use case.

This data is typically spread across the web, hidden behind complex site structures, or protected by anti-bot measures.

Our automated workflow solves this with a streamlined pipeline that handles the toughest parts of dataset creation:

  • Web Data Extraction. Uses Bright Data to extract data at scale, tapping into their AI-focused infrastructure to bypass challenges like CAPTCHAs and IP blocks.
  • Data Structuring. Uses Google Gemini to parse, clean, and convert raw content into well-structured JSON.
  • Semantic Embedding. Transforms text into vector embeddings that capture rich contextual meaning.
  • Storage & Retrieval. Indexes vectors in Pinecone, a fast and scalable semantic search database.
  • AI-Ready Output. Produces high-quality datasets ready for fine-tuning, RAG, or other domain-specific AI applications.

Core Technologies Overview

Before building the pipeline, let’s take a quick look at the core technologies involved and how each one supports the workflow.

Bright Data: Scalable Web Data Collection

The first step in creating an AI-ready vector dataset is collecting relevant and high-quality source data. While some of it may come from internal systems like knowledge bases or documentation, a large portion is often sourced from the public web.

However, modern websites use sophisticated anti-bot mechanisms, such as CAPTCHAs, IP rate limiting, and browser fingerprinting, that make scraping at scale difficult.

Bright Data solves this challenge with its Web Unlocker API, which abstracts away the complexity of data collection. It automatically handles proxy rotation, CAPTCHA solving, and browser emulation, letting you focus entirely on the data rather than how to access it.

Google Gemini: Intelligent Content Transformation

Gemini is a family of powerful multimodal AI models developed by Google that excel at understanding and processing various types of content. In our data extraction pipeline, Gemini serves three key functions:

  1. Content Parsing: Processes raw HTML or, preferably, cleaned Markdown content.
  2. Information Extraction: Identifies and extracts specific data points based on a predefined schema.
  3. Data Structuring: Transforms extracted information into a clean, structured JSON format.

This AI-powered approach offers major advantages over traditional methods that rely on brittle CSS selectors or fragile regular expressions, especially in use cases such as:

  • Dynamic Web Pages – Pages where the layout or DOM changes frequently (common in eCommerce sites, news portals, and other high-velocity domains).
  • Unstructured Content: Extracting structured data from long-form or poorly organized text blocks.
  • Complex Parsing Logic: Avoiding the need to maintain and debug custom scraping rules for each site or content variation.

For a deeper dive into how AI is transforming the data extraction process, explore Using AI for Web Scraping. If you’re looking for a hands-on tutorial that walks through implementing Gemini in your scraping workflow, check out our comprehensive guide: Web Scraping with Gemini.

Sentence Transformers: Generating Semantic Embeddings

Embeddings are dense vector representations of text (or other data types) in a high-dimensional space. These vectors capture semantic meaning, allowing similar pieces of text to be represented by vectors that are close together, measured using metrics like cosine similarity or Euclidean distance. This property is important for applications like semantic search, clustering, and retrieval-augmented generation (RAG), where finding relevant content depends on semantic proximity.

The Sentence Transformers library provides an easy-to-use interface for generating high-quality sentence and paragraph embeddings. Built on top of Hugging Face Transformers, it supports a wide range of pre-trained models fine-tuned for semantic tasks.

One of the most popular and effective models in this ecosystem is all-MiniLM-L6-v2. Here’s why it stands out:

  • Architecture: Based on the MiniLM architecture, optimized for speed and size while maintaining strong performance.
  • Embedding Dimension: Maps inputs to a 384-dimensional vector space, making it both efficient and compact.
  • Training Objective: Fine-tuned on over 1 billion sentence pairs using a contrastive learning approach to enhance semantic understanding.
  • Performance: Delivers state-of-the-art or near-state-of-the-art results on tasks like sentence similarity, semantic clustering, and information retrieval.
  • Input Length: Handles up to 256 word pieces (tokens), with longer text automatically truncated—an important consideration during text chunking.

While larger models may offer slightly more nuanced embeddings, all-MiniLM-L6-v2 provides an exceptional balance between performance, efficiency, and cost. Its 384-dimensional vectors are:

  • Faster to compute.
  • Less resource-intensive.
  • Easier to store and index.

For most practical use cases, especially in early-stage development or resource-constrained environments, this model is more than sufficient. The marginal drop in accuracy on edge cases is typically outweighed by the significant gains in speed and scalability. So it is recommended to use all-MiniLM-L6-v2 when building the first iteration of your AI application or when optimizing for performance on modest infrastructure.

Pinecone: Storing and Searching Vector Embeddings

Once text is transformed into vector embeddings, you need a specialized database to store, manage, and query them efficiently. Traditional databases aren’t designed for this—vector databases are purpose-built to handle the high-dimensional nature of embedding data, allowing real-time similarity search essential for RAG pipelines, semantic search, personalization, and other AI-driven applications.

Pinecone is a popular vector database known for its developer-friendly interface, low-latency search performance, and fully managed infrastructure. It efficiently manages the complexities of vector indexing and search at scale, abstracting the intricacies of vector search infrastructure. Its key components include:

  • Indexes: Storage containers for your vectors.
  • Vectors: The actual embeddings with associated metadata.
  • Collections: Static snapshots of indexes for backup and versioning.
  • Namespaces: Data partitioning within an index for multi-tenancy.

Pinecone offers two deployment architectures: Serverless and Pod-Based. For most use cases, especially when starting out or dealing with dynamic loads, Serverless is the recommended option due to its simplicity and cost efficiency.

Setup & Prerequisites

Before building the pipeline, make sure the following components are properly configured.

Prerequisites

  • Python 3.9 or later must be installed on your system
  • Gather the following API credentials:
    • Bright Data API Key and Web Unlocker Zone Name
    • Google Gemini API Key
    • Pinecone API Key

Refer to the tool-specific setup sections below for instructions on generating each API key.

Install Required Libraries

Install the core Python libraries for this project:

pip install requests python-dotenv google-generativeai sentence-transformers pinecone

These libraries provide:

  • requests: A popular HTTP client for interacting with APIs (requests guide)
  • python-dotenv: Securely loads API keys from environment variables
  • google-generativeai: Official Gemini SDK from Google (also supports JavaScript, Go, and other languages)
  • sentence-transformers: Pre-trained models for generating semantic vector embeddings
  • pinecone: SDK for Pinecone’s vector database (language SDKs available for Python, Node.js, Go, and more)

Configure Environment Variables

Create a .env file in your project’s root directory and add your API keys:

BRIGHT_DATA_API_KEY="your_bright_data_api_key_here"
GEMINI_API_KEY="your_gemini_api_key_here"
PINECONE_API_KEY="your_pinecone_api_key_here"

Bright Data Setup

To use Bright Data’s Web Unlocker:

  1. Create an API token
  2. Set up a Web Unlocker zone from your Bright Data dashboard

For implementation examples and integration code, explore the Web Unlocker GitHub repo.

If you’re still comparing solutions, this AI scraping tools comparison offers insights into how Bright Data stacks up against other platforms.

Gemini Setup

To generate a Gemini API key:

  1. Go to Google AI Studio
  2. Click “+ Create API key”
  3. Copy the key and store it securely

Tip: The free tier is sufficient for development and small-scale testing. For production use, where you may require higher throughput (RPM/RPD), larger token windows (TPM), or enterprise-grade privacy and access to advanced models, refer to rate limits and pricing plans.

Pinecone Setup

  1. Sign up at Pinecone.io
  2. Copy your API key from the dashboard
  3. To create a new index:
    • Navigate to Indexes → click Create index
    • Set the following:
      • Index Name: Choose a clear name (e.g., semantic-search-index)
      • Vector Type: Select Dense
      • Dimensions: Match the output dimension of your embedding model (e.g., 384 for all-MiniLM-L6-v2)
      • Metric: Choose cosine (alternatives: euclidean, dotproduct)
      • Capacity Mode: Use Serverless
      • Cloud & Region: Pick your preferred provider and location (e.g., AWS us-east-1)
    • Click Create index

You’ll see the index with a green status and initially zero records once setup is complete.

pinecone-index-configuration

Building the Pipeline: Step-by-Step Implementation

Now that our prerequisites are configured, let’s build our data pipeline using Walmart’s MacBook Air M1 product reviews as a practical example.

walmart-macbook-reviews

Step 1: Data Acquisition with Bright Data Web Unlocker

The foundation of our pipeline involves fetching raw HTML content from target URLs. Bright Data’s Web Unlocker excels at bypassing the sophisticated anti-scraping measures commonly employed by e-commerce sites like Walmart.

walmart-robot-detection

Lets start with this implementation for fetching webpage content:

import requests
import os
from dotenv import load_dotenv

# Load API key from environment
load_dotenv()
BRIGHT_DATA_API_KEY = os.getenv("BRIGHT_DATA_API_KEY")

def fetch_page(url: str) -> str:
    """Fetch page content using Bright Data Web Unlocker (Markdown format)"""
    try:
        response = requests.post(
            "https://api.brightdata.com/request",
            headers={
                "Authorization": f"Bearer {BRIGHT_DATA_API_KEY}",
                "Content-Type": "application/json",
            },
            json={
                "zone": "web_unlocker2",
                "url": url,
                "format": "raw",
                "data_format": "markdown",  # Direct HTML-to-Markdown conversion
            },
            timeout=60,
        )

        if response.status_code == 200 and len(response.text) > 1000:
            return response.text

    except Exception as e:
        print(f"Request failed: {str(e)}")

    return None

# Example usage
walmart_url = "https://www.walmart.com/reviews/product/609040889?page=1"
page_content = fetch_page(walmart_url)

if page_content:
    print(f"Success! Retrieved {len(page_content)} characters")
else:
    print("Failed to fetch page")

Why use Markdown instead of raw HTML? In our pipeline, we request content in Markdown format (data_format: 'markdown') for several important reasons. Markdown strips away HTML tags, styling, and other noise, reducing complexity and leaving only the essential content. This results in a significantly lower token count, making LLM processing more efficient. It also preserves the semantic structure in a cleaner, more readable format, which enhances both clarity and processing speed. Operations like embedding generation and vector indexing become faster and lighter.

For more context on why modern AI agents favor Markdown, read Why Are the New AI Agents Choosing Markdown Over HTML.

Step 2: Handling Pagination

Walmart distributes product reviews across numerous pages. To capture complete datasets, implement pagination handling. You need to:

  1. Build the correct page URL (?page=1, ?page=2, etc.)
  2. Fetch the content for each page
  3. Detect if there’s a “next page” or not
  4. Continue until no more pages are available

Here’s a simple pagination loop that fetches content until no page=n+1 reference is found:

current_page = 1

while True:
    url = f"https://www.walmart.com/reviews/product/609040889?page={current_page}"
    page_content = fetch_page(url)

    if not page_content:
        print(f"Stopping at page {current_page}: fetch failed or no content.")
        break

    # Do something with the fetched content here
    print(f"Fetched page {current_page}")

    # Check for presence of next page reference
    if f"page={current_page + 1}" not in page_content:
        print("No next page found. Ending pagination.")
        break

    current_page += 1

Step 3: Structured Data Extraction with Google Gemini

With clean Markdown content from the previous step, we’ll now use Google Gemini to extract specific information from the reviews and structure it as JSON. This transforms unstructured text into organized data that our vector database can efficiently index.

We’ll use the gemini-2.0-flash model, which offers impressive specifications for our use case:

  • Input Context: 1,048,576 tokens
  • Output Limit: 8,192 tokens
  • Multimodal Support: Text, code, images, audio, and video

In our case, the markdown text of the Walmart review page typically contains around 3,000 tokens, well within the model’s limit. This means we can send the entire page at once without splitting it into smaller chunks.

token-counting

If your documents exceed the context window, you’ll need to implement chunking strategies. But for typical web pages, Gemini’s capacity makes this unnecessary.

Here’s a sample Python function that uses Gemini to extract reviews in a structured JSON format:

import google.generativeai as genai
import json

# Initialize Gemini with JSON output configuration
model = genai.GenerativeModel(
    model_name="gemini-2.0-flash",
    generation_config={"response_mime_type": "application/json"},
)

def extract_reviews(markdown: str) -> list[dict]:
    """Extract structured review data from Markdown using Gemini."""
    prompt = f"""
    Extract all customer reviews from this Walmart product page content.
    Return a JSON array of review objects with the following structure:

    {{
      "reviews": [
        {{
          "date": "YYYY-MM-DD or original date format if available",
          "title": "Review title/headline",
          "description": "Review text content",
          "rating": <integer from 1–5>
        }}
      ]
    }}

    Rules:
    - Include all reviews found on the page
    - Use null for any missing fields
    - Convert ratings to integers (1–5)
    - Extract the full review text, not just snippets
    - Preserve original review text without summarizing

    Here's the page content:
    {markdown}
    """

    response = model.generate_content(prompt)
    result = json.loads(response.text)

    # Normalize and clean results
    return [
        {
            "date": review.get("date"),
            "title": review.get("title"),
            "description": review.get("description", "").strip(),
            "rating": review.get("rating"),
        }
        for review in result.get("reviews", [])
    ]

Prompt engineering is key when working with LLMs. In our implementation, we set response_mime_type: "application/json" to ensure Gemini returns valid JSON, eliminating the need for complex text parsing. The prompt itself is carefully designed to reduce hallucinations by instructing Gemini to rely solely on the provided content. It also enforces a strict JSON schema for structural consistency, preserves full review text without summarization, and handles missing fields gracefully.

After processing a Walmart review page, you’ll receive structured data like this:

[
  {
    "date": "Apr 13, 2025",
    "title": "Better than buying OPEN BOX",
    "description": "I bought an older product OPEN BOX (which I consider UNUSED) from another website. The battery was dead. Walmart offered NEW at a lower price. WOW!!!!",
    "rating": 5
  },
  {
    "date": "Dec 8, 2024",
    "title": "No support",
    "description": "The young man who delivered my laptop gave me the laptop with no receipt or directions. I asked where my receipt and some kind of manual were. He said it would be under my purchases. I would happily change this review if I knew where to go for help and support. The next day I went to the electronics department for help, and he had no idea.",
    "rating": 3
  }
  // ... more reviews
]

For a working example that combines all steps (fetching, processing, and extraction), check out the complete implementation on GitHub.

Step 4: Generating Vector Embeddings with Sentence Transformers

With clean, structured review data in JSON format, we now generate semantic vector embeddings for each review. These embeddings will be used for downstream tasks like semantic search or indexing in a vector database like Pinecone.

To capture the full context of a customer review, we combine the review title and description into a single string before embedding. This helps the model encode both the sentiment and subject matter more effectively.

Here’s the sample code:

from sentence_transformers import SentenceTransformer

# Load the embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")

def generate_embeddings(reviews):
    """Generate 384-dimensional vector embeddings from review titles and descriptions."""
    texts = []
    valid_indices = []

    # Combine title and description into a single string for embedding
    for idx, review in enumerate(reviews):
        text_parts = []
        if review.get("title"):
            text_parts.append(f"Review Title: {review['title']}")
        if review.get("description"):
            text_parts.append(f"Review Description: {review['description']}")

        if text_parts:
            texts.append(". ".join(text_parts))
            valid_indices.append(idx)

    # Generate embeddings using batch processing
    embeddings = model.encode(
        texts, show_progress_bar=True, batch_size=32, convert_to_numpy=True
    ).tolist()

    # Attach embeddings back to original review objects
    for emb_idx, review_idx in enumerate(valid_indices):
        reviews[review_idx]["embedding"] = embeddings[emb_idx]

    return reviews

What this code does:

  1. Model Initialization: Loads the all-MiniLM-L6-v2 model, which returns 384-dimensional dense embeddings.
  2. Input Preparation: Combines the title and description of each review into a single string.
  3. Batch Encoding: Uses model.encode() with batching for efficient processing:
    • batch_size=32: Optimizes speed and memory usage
    • show_progress_bar=True: Displays a progress bar during encoding
    • convert_to_numpy=True: Converts outputs to NumPy arrays for easier manipulation
  4. Embedding Injection: Attaches each vector back to the corresponding review object under the key "embedding".

Important Note: Pinecone does not support null values in metadata. If any field is missing, you must omit the key entirely when uploading to Pinecone. Do not use "N/A" or empty strings unless they hold specific meaning in your filtering logic.

While the sanitation function isn’t shown here (to keep the code readable), the final implementation will include metadata cleanup before ingestion.

After embedding generation, each review object includes a 384-dimensional vector:

{
  "date": "Apr 9, 2024",
  "title": "Amazing Laptop!",
  "description": "This M1 MacBook Air is incredibly fast and the battery lasts forever.",
  "rating": 5,
  "embedding": [0.0123, -0.0456, 0.0789, ..., 0.0345]  // 384 dimensions
}

With embeddings generated, our reviews are ready for vector storage in Pinecone.

Step 5: Storing Embeddings and Metadata in Pinecone

The final step in our pipeline involves uploading the embedded reviews to Pinecone.

Here is the Python code to upsert data to Pinecone:

import uuid
from pinecone import Pinecone, Index

# Initialize Pinecone client with your API key
pc = Pinecone(api_key="PINECONE_API_KEY")
index = pc.Index("brightdata-ai-dataset")  # Replace with your actual index name

# Sample review data (with embeddings already attached)
reviews_with_embeddings = [
    {
        "date": "Apr 9, 2024",
        "title": "Amazing Laptop!",
        "description": "This M1 MacBook Air is incredibly fast and the battery lasts forever.",
        "rating": 5,
        "embedding": [0.0123, -0.0456, ..., 0.0789],  # 384-dimensional vector
    }
    # ... more reviews
]

# Prepare vector records for upload
vectors = []
for review in reviews_with_embeddings:
    if "embedding" not in review:
        continue  # Skip entries without embeddings

    vectors.append(
        {
            "id": str(uuid.uuid4()),  # Unique vector ID
            "values": review["embedding"],
            "metadata": {
                "title": review.get("title"),
                "description": review.get("description"),
                "rating": review.get("rating"),
                # Add more metadata fields if needed
            },
        }
    )

# Batch upload to Pinecone (100 vectors per request)
for i in range(0, len(vectors), 100):
    batch = vectors[i : i + 100]
    index.upsert(vectors=batch)

Each vector you upsert into Pinecone should include:

  • id: A unique string identifier (required)
  • values: The vector itself (list of floats, e.g., 384-dimensional)
  • metadata: Optional key-value pairs for filtering and context (JSON-compatible)

Example vector structure:

Once the upload is complete, your Pinecone index will be populated with review vectors:

pinecone-index-with-records

Your AI-ready vector dataset is now stored in Pinecone and is ready for the next steps 🔥

For a working example that combines all steps (embedding generation, Pinecone upload), check out the complete implementation on GitHub.

(Optional but Recommended) Utilizing the AI-Ready Dataset

With your embeddings now indexed in Pinecone, you can power applications like semantic search and RAG systems. This step shows how to query your vector database and generate intelligent responses.

Semantic Search

The simplest way to leverage your vectorized dataset is through semantic search. Unlike keyword search, semantic search allows users to query in natural language and retrieve conceptually similar content, even if they don’t share the same words.

Let’s test the system with natural language queries:

queries = [
    "good price for students",
    "lightweight and good for travel",
]

For the query “good price for students”, you might see:

#1 (Score: 0.6201)
ID: 75878bdc-8d96-416a-8292-484971c3bd61
Date: Aug 3, 2024
Rating: 5.0
Description: Just what my daughter needed for college and the price was perfect

#2 (Score: 0.5868)
ID: 758963ae-0927-4e82-bece-d098991f5a73
Date: Jun 13, 2024
Rating: 5.0
Description: The price was right. Perfect graduation present for my grandson

🙌 It works beautifully! Natural language queries return highly relevant results.

This is how semantic search works:

  1. Query Embedding: The search query is converted to a vector using the same all-MiniLM-L6-v2 model used for indexing.
  2. Vector Search: Pinecone finds the most similar vectors using cosine similarity.
  3. Metadata Retrieval: Results include both similarity scores and associated metadata.

For full working implementation, check out: Semantic Search Client Python file.

Beyond Search: Retrieval Augmented Generation (RAG)

Once you have semantic search working, you’re just a step away from building an LLM-powered RAG system. Retrieval Augmented Generation (RAG) lets your LLM generate grounded responses using external context, like your vectorized dataset.

RAG Flow:

  1. User asks a question (e.g., “Is this MacBook good for college students?”).
  2. Semantic search retrieves relevant documents from Pinecone.
  3. The retrieved context + question is sent to an LLM like Google Gemini.
  4. LLM responds with the facts from your dataset.

Example RAG responses:

🤔 User: Is the battery life good for college use?
🤖 Assistant: Yes, users report long battery life—enough to last through full days of classes and study.

🤔 User: How does this compare to a Chromebook?
🤖 Assistant: One review says the MacBook Air "works so smoothly compared to a Chromebook".

See the full code used for RAG and semantic search: RAG Chatbot Implementation.

Next Steps

You’ve successfully built a complete pipeline for creating AI-ready vector datasets. Here’s how to expand and optimize your implementation:

  1. Scale Data Acquisition: For more extensive data needs, explore Bright Data’s full AI-Ready Web Data Infrastructure for unlimited, compliant web data access optimized for AI models and agents.
  2. Experiment with Embedding Models: While all-MiniLM-L6-v2 is efficient, you may get better results for certain use cases by switching to larger or multilingual models. You can also try embedding APIs from Google Gemini and OpenAI.
  3. Refine Extraction Prompts: Tailor the Gemini prompt for different website structures or data schemas you need to extract.
  4. Leverage Advanced Pinecone Features: Explore filtering, namespaces, metadata indexing, and hybrid search by diving into the official Pinecone documentation.
  5. Automate the Pipeline: Integrate this pipeline into a production workflow using tools like Apache Airflow or Prefect for orchestration, AWS Step Functions or Google Cloud Workflows for cloud-native scheduling.
  6. Build an AI-Powered Application: Use the semantic search or RAG components to create real-world tools such as customer support chatbots, knowledge base search, and recommendation engines.

Conclusion

You’ve successfully built a complete, robust pipeline that creates and manages AI-ready vector datasets, transforming raw web data into valuable assets for large language models. By combining Bright Data for scalable web scraping, Google Gemini for intelligent structured extraction, Sentence Transformers for generating semantic embeddings, and Pinecone for vector storage and retrieval, you’ve effectively prepared your custom data to enhance LLM applications.

This approach grounds LLMs in specific domain knowledge, delivering more accurate, relevant, and valuable AI-powered solutions.

No credit card required

Further Reading & Resources

Explore these resources to deepen your understanding of AI, LLMs, and web data extraction: