Large Language Models (LLMs) are transforming how we access information and build intelligent applications. To harness their full potential, especially with domain-specific knowledge or proprietary data, it’s critical to create high-quality, structured vector datasets. An LLM’s performance and accuracy are directly tied to the quality of its input data. Poorly prepared datasets can lead to subpar results, while well-curated ones can turn an LLM into a true domain expert.
In this guide, we will walk through how to build an automated pipeline for generating AI-ready vector datasets, step by step.
The Challenge: Sourcing and Preparing Data for LLMs
While LLMs are trained on vast general-purpose text corpora, they often fall short when applied to specific tasks or domains, such as answering product-related queries, analyzing industry news, or interpreting customer feedback. To make them truly useful, you need high-quality data that’s tailored to your use case.
This data is typically spread across the web, hidden behind complex site structures, or protected by anti-bot measures.
Our automated workflow solves this with a streamlined pipeline that handles the toughest parts of dataset creation:
- Web Data Extraction. Uses Bright Data to extract data at scale, tapping into their AI-focused infrastructure to bypass challenges like CAPTCHAs and IP blocks.
- Data Structuring. Uses Google Gemini to parse, clean, and convert raw content into well-structured JSON.
- Semantic Embedding. Transforms text into vector embeddings that capture rich contextual meaning.
- Storage & Retrieval. Indexes vectors in Pinecone, a fast and scalable semantic search database.
- AI-Ready Output. Produces high-quality datasets ready for fine-tuning, RAG, or other domain-specific AI applications.
Core Technologies Overview
Before building the pipeline, let’s take a quick look at the core technologies involved and how each one supports the workflow.
Bright Data: Scalable Web Data Collection
The first step in creating an AI-ready vector dataset is collecting relevant and high-quality source data. While some of it may come from internal systems like knowledge bases or documentation, a large portion is often sourced from the public web.
However, modern websites use sophisticated anti-bot mechanisms, such as CAPTCHAs, IP rate limiting, and browser fingerprinting, that make scraping at scale difficult.
Bright Data solves this challenge with its Web Unlocker API, which abstracts away the complexity of data collection. It automatically handles proxy rotation, CAPTCHA solving, and browser emulation, letting you focus entirely on the data rather than how to access it.
Google Gemini: Intelligent Content Transformation
Gemini is a family of powerful multimodal AI models developed by Google that excel at understanding and processing various types of content. In our data extraction pipeline, Gemini serves three key functions:
- Content Parsing: Processes raw HTML or, preferably, cleaned Markdown content.
- Information Extraction: Identifies and extracts specific data points based on a predefined schema.
- Data Structuring: Transforms extracted information into a clean, structured JSON format.
This AI-powered approach offers major advantages over traditional methods that rely on brittle CSS selectors or fragile regular expressions, especially in use cases such as:
- Dynamic Web Pages – Pages where the layout or DOM changes frequently (common in eCommerce sites, news portals, and other high-velocity domains).
- Unstructured Content: Extracting structured data from long-form or poorly organized text blocks.
- Complex Parsing Logic: Avoiding the need to maintain and debug custom scraping rules for each site or content variation.
For a deeper dive into how AI is transforming the data extraction process, explore Using AI for Web Scraping. If you’re looking for a hands-on tutorial that walks through implementing Gemini in your scraping workflow, check out our comprehensive guide: Web Scraping with Gemini.
Sentence Transformers: Generating Semantic Embeddings
Embeddings are dense vector representations of text (or other data types) in a high-dimensional space. These vectors capture semantic meaning, allowing similar pieces of text to be represented by vectors that are close together, measured using metrics like cosine similarity or Euclidean distance. This property is important for applications like semantic search, clustering, and retrieval-augmented generation (RAG), where finding relevant content depends on semantic proximity.
The Sentence Transformers library provides an easy-to-use interface for generating high-quality sentence and paragraph embeddings. Built on top of Hugging Face Transformers, it supports a wide range of pre-trained models fine-tuned for semantic tasks.
One of the most popular and effective models in this ecosystem is all-MiniLM-L6-v2
. Here’s why it stands out:
- Architecture: Based on the MiniLM architecture, optimized for speed and size while maintaining strong performance.
- Embedding Dimension: Maps inputs to a 384-dimensional vector space, making it both efficient and compact.
- Training Objective: Fine-tuned on over 1 billion sentence pairs using a contrastive learning approach to enhance semantic understanding.
- Performance: Delivers state-of-the-art or near-state-of-the-art results on tasks like sentence similarity, semantic clustering, and information retrieval.
- Input Length: Handles up to 256 word pieces (tokens), with longer text automatically truncated—an important consideration during text chunking.
While larger models may offer slightly more nuanced embeddings, all-MiniLM-L6-v2
provides an exceptional balance between performance, efficiency, and cost. Its 384-dimensional vectors are:
- Faster to compute.
- Less resource-intensive.
- Easier to store and index.
For most practical use cases, especially in early-stage development or resource-constrained environments, this model is more than sufficient. The marginal drop in accuracy on edge cases is typically outweighed by the significant gains in speed and scalability. So it is recommended to use all-MiniLM-L6-v2
when building the first iteration of your AI application or when optimizing for performance on modest infrastructure.
Pinecone: Storing and Searching Vector Embeddings
Once text is transformed into vector embeddings, you need a specialized database to store, manage, and query them efficiently. Traditional databases aren’t designed for this—vector databases are purpose-built to handle the high-dimensional nature of embedding data, allowing real-time similarity search essential for RAG pipelines, semantic search, personalization, and other AI-driven applications.
Pinecone is a popular vector database known for its developer-friendly interface, low-latency search performance, and fully managed infrastructure. It efficiently manages the complexities of vector indexing and search at scale, abstracting the intricacies of vector search infrastructure. Its key components include:
- Indexes: Storage containers for your vectors.
- Vectors: The actual embeddings with associated metadata.
- Collections: Static snapshots of indexes for backup and versioning.
- Namespaces: Data partitioning within an index for multi-tenancy.
Pinecone offers two deployment architectures: Serverless and Pod-Based. For most use cases, especially when starting out or dealing with dynamic loads, Serverless is the recommended option due to its simplicity and cost efficiency.
Setup & Prerequisites
Before building the pipeline, make sure the following components are properly configured.
Prerequisites
- Python 3.9 or later must be installed on your system
- Gather the following API credentials:
- Bright Data API Key and Web Unlocker Zone Name
- Google Gemini API Key
- Pinecone API Key
Refer to the tool-specific setup sections below for instructions on generating each API key.
Install Required Libraries
Install the core Python libraries for this project:
These libraries provide:
requests
: A popular HTTP client for interacting with APIs (requests guide)python-dotenv
: Securely loads API keys from environment variablesgoogle-generativeai
: Official Gemini SDK from Google (also supports JavaScript, Go, and other languages)sentence-transformers
: Pre-trained models for generating semantic vector embeddingspinecone
: SDK for Pinecone’s vector database (language SDKs available for Python, Node.js, Go, and more)
Configure Environment Variables
Create a .env
file in your project’s root directory and add your API keys:
Bright Data Setup
To use Bright Data’s Web Unlocker:
- Create an API token
- Set up a Web Unlocker zone from your Bright Data dashboard
For implementation examples and integration code, explore the Web Unlocker GitHub repo.
If you’re still comparing solutions, this AI scraping tools comparison offers insights into how Bright Data stacks up against other platforms.
Gemini Setup
To generate a Gemini API key:
- Go to Google AI Studio
- Click “+ Create API key”
- Copy the key and store it securely
Tip: The free tier is sufficient for development and small-scale testing. For production use, where you may require higher throughput (RPM/RPD), larger token windows (TPM), or enterprise-grade privacy and access to advanced models, refer to rate limits and pricing plans.
Pinecone Setup
- Sign up at Pinecone.io
- Copy your API key from the dashboard
- To create a new index:
- Navigate to Indexes → click Create index
- Set the following:
- Index Name: Choose a clear name (e.g.,
semantic-search-index
) - Vector Type: Select Dense
- Dimensions: Match the output dimension of your embedding model (e.g.,
384
forall-MiniLM-L6-v2
) - Metric: Choose cosine (alternatives:
euclidean
,dotproduct
) - Capacity Mode: Use Serverless
- Cloud & Region: Pick your preferred provider and location (e.g., AWS
us-east-1
)
- Index Name: Choose a clear name (e.g.,
- Click Create index
You’ll see the index with a green status and initially zero records once setup is complete.
Building the Pipeline: Step-by-Step Implementation
Now that our prerequisites are configured, let’s build our data pipeline using Walmart’s MacBook Air M1 product reviews as a practical example.
Step 1: Data Acquisition with Bright Data Web Unlocker
The foundation of our pipeline involves fetching raw HTML content from target URLs. Bright Data’s Web Unlocker excels at bypassing the sophisticated anti-scraping measures commonly employed by e-commerce sites like Walmart.
Lets start with this implementation for fetching webpage content:
Why use Markdown instead of raw HTML? In our pipeline, we request content in Markdown format (data_format: 'markdown'
) for several important reasons. Markdown strips away HTML tags, styling, and other noise, reducing complexity and leaving only the essential content. This results in a significantly lower token count, making LLM processing more efficient. It also preserves the semantic structure in a cleaner, more readable format, which enhances both clarity and processing speed. Operations like embedding generation and vector indexing become faster and lighter.
For more context on why modern AI agents favor Markdown, read Why Are the New AI Agents Choosing Markdown Over HTML.
Step 2: Handling Pagination
Walmart distributes product reviews across numerous pages. To capture complete datasets, implement pagination handling. You need to:
- Build the correct page URL (
?page=1
,?page=2
, etc.) - Fetch the content for each page
- Detect if there’s a “next page” or not
- Continue until no more pages are available
Here’s a simple pagination loop that fetches content until no page=n+1
reference is found:
Step 3: Structured Data Extraction with Google Gemini
With clean Markdown content from the previous step, we’ll now use Google Gemini to extract specific information from the reviews and structure it as JSON. This transforms unstructured text into organized data that our vector database can efficiently index.
We’ll use the gemini-2.0-flash
model, which offers impressive specifications for our use case:
- Input Context: 1,048,576 tokens
- Output Limit: 8,192 tokens
- Multimodal Support: Text, code, images, audio, and video
In our case, the markdown text of the Walmart review page typically contains around 3,000 tokens, well within the model’s limit. This means we can send the entire page at once without splitting it into smaller chunks.
If your documents exceed the context window, you’ll need to implement chunking strategies. But for typical web pages, Gemini’s capacity makes this unnecessary.
Here’s a sample Python function that uses Gemini to extract reviews in a structured JSON format:
Prompt engineering is key when working with LLMs. In our implementation, we set response_mime_type: "application/json"
to ensure Gemini returns valid JSON, eliminating the need for complex text parsing. The prompt itself is carefully designed to reduce hallucinations by instructing Gemini to rely solely on the provided content. It also enforces a strict JSON schema for structural consistency, preserves full review text without summarization, and handles missing fields gracefully.
After processing a Walmart review page, you’ll receive structured data like this:
For a working example that combines all steps (fetching, processing, and extraction), check out the complete implementation on GitHub.
Step 4: Generating Vector Embeddings with Sentence Transformers
With clean, structured review data in JSON format, we now generate semantic vector embeddings for each review. These embeddings will be used for downstream tasks like semantic search or indexing in a vector database like Pinecone.
To capture the full context of a customer review, we combine the review title and description into a single string before embedding. This helps the model encode both the sentiment and subject matter more effectively.
Here’s the sample code:
What this code does:
- Model Initialization: Loads the
all-MiniLM-L6-v2
model, which returns 384-dimensional dense embeddings. - Input Preparation: Combines the
title
anddescription
of each review into a single string. - Batch Encoding: Uses
model.encode()
with batching for efficient processing:batch_size=32
: Optimizes speed and memory usageshow_progress_bar=True
: Displays a progress bar during encodingconvert_to_numpy=True
: Converts outputs to NumPy arrays for easier manipulation
- Embedding Injection: Attaches each vector back to the corresponding review object under the key
"embedding"
.
Important Note: Pinecone does not support null
values in metadata. If any field is missing, you must omit the key entirely when uploading to Pinecone. Do not use "N/A"
or empty strings unless they hold specific meaning in your filtering logic.
While the sanitation function isn’t shown here (to keep the code readable), the final implementation will include metadata cleanup before ingestion.
After embedding generation, each review object includes a 384-dimensional vector:
With embeddings generated, our reviews are ready for vector storage in Pinecone.
Step 5: Storing Embeddings and Metadata in Pinecone
The final step in our pipeline involves uploading the embedded reviews to Pinecone.
Here is the Python code to upsert data to Pinecone:
Each vector you upsert into Pinecone should include:
id
: A unique string identifier (required)values
: The vector itself (list of floats, e.g., 384-dimensional)metadata
: Optional key-value pairs for filtering and context (JSON-compatible)
Example vector structure:
Once the upload is complete, your Pinecone index will be populated with review vectors:
Your AI-ready vector dataset is now stored in Pinecone and is ready for the next steps 🔥
For a working example that combines all steps (embedding generation, Pinecone upload), check out the complete implementation on GitHub.
(Optional but Recommended) Utilizing the AI-Ready Dataset
With your embeddings now indexed in Pinecone, you can power applications like semantic search and RAG systems. This step shows how to query your vector database and generate intelligent responses.
Semantic Search
The simplest way to leverage your vectorized dataset is through semantic search. Unlike keyword search, semantic search allows users to query in natural language and retrieve conceptually similar content, even if they don’t share the same words.
Let’s test the system with natural language queries:
For the query “good price for students”, you might see:
🙌 It works beautifully! Natural language queries return highly relevant results.
This is how semantic search works:
- Query Embedding: The search query is converted to a vector using the same
all-MiniLM-L6-v2
model used for indexing. - Vector Search: Pinecone finds the most similar vectors using cosine similarity.
- Metadata Retrieval: Results include both similarity scores and associated metadata.
For full working implementation, check out: Semantic Search Client Python file.
Beyond Search: Retrieval Augmented Generation (RAG)
Once you have semantic search working, you’re just a step away from building an LLM-powered RAG system. Retrieval Augmented Generation (RAG) lets your LLM generate grounded responses using external context, like your vectorized dataset.
RAG Flow:
- User asks a question (e.g., “Is this MacBook good for college students?”).
- Semantic search retrieves relevant documents from Pinecone.
- The retrieved context + question is sent to an LLM like Google Gemini.
- LLM responds with the facts from your dataset.
Example RAG responses:
See the full code used for RAG and semantic search: RAG Chatbot Implementation.
Next Steps
You’ve successfully built a complete pipeline for creating AI-ready vector datasets. Here’s how to expand and optimize your implementation:
- Scale Data Acquisition: For more extensive data needs, explore Bright Data’s full AI-Ready Web Data Infrastructure for unlimited, compliant web data access optimized for AI models and agents.
- Experiment with Embedding Models: While
all-MiniLM-L6-v2
is efficient, you may get better results for certain use cases by switching to larger or multilingual models. You can also try embedding APIs from Google Gemini and OpenAI. - Refine Extraction Prompts: Tailor the Gemini prompt for different website structures or data schemas you need to extract.
- Leverage Advanced Pinecone Features: Explore filtering, namespaces, metadata indexing, and hybrid search by diving into the official Pinecone documentation.
- Automate the Pipeline: Integrate this pipeline into a production workflow using tools like Apache Airflow or Prefect for orchestration, AWS Step Functions or Google Cloud Workflows for cloud-native scheduling.
- Build an AI-Powered Application: Use the semantic search or RAG components to create real-world tools such as customer support chatbots, knowledge base search, and recommendation engines.
Conclusion
You’ve successfully built a complete, robust pipeline that creates and manages AI-ready vector datasets, transforming raw web data into valuable assets for large language models. By combining Bright Data for scalable web scraping, Google Gemini for intelligent structured extraction, Sentence Transformers for generating semantic embeddings, and Pinecone for vector storage and retrieval, you’ve effectively prepared your custom data to enhance LLM applications.
This approach grounds LLMs in specific domain knowledge, delivering more accurate, relevant, and valuable AI-powered solutions.
No credit card required
Further Reading & Resources
Explore these resources to deepen your understanding of AI, LLMs, and web data extraction:
- RAG & Chatbots:
- AI & Web Scraping Techniques:
- Fine-Tuning & Datasets:
- Core Concepts: