Google’s latest open-weight AI model, Gemma 3, released in March 2025, delivers impressive performance that matches many proprietary LLMs while running efficiently on hardware with limited resources. This advancement in open-source AI works across various platforms, offering developers worldwide powerful capabilities in an accessible format.
In this guide, we’ll walk you through fine-tuning Gemma 3 on a custom question-answering dataset derived from Trustpilot reviews. We’ll use Bright Data to scrape customer reviews, process them into structured QA pairs, and leverage Unsloth for efficient fine-tuning with minimal compute. By the end, you’ll have created a specialized AI assistant that understands domain-specific questions and is ready to host on Hugging Face Hub.
Let’s dive in!
Understanding Gemma 3
Google’s Gemma 3 family launched in March 2025 with four open-weight sizes—1B, 4B, 12B, and 27B parameters—all designed to run on a single GPU.
- The 1B model is text-only with a 32K-token context window.
- The 4B, 12B, and 27B models add multimodal (text + image) input and support a 128K-token window.
On the LMArena human-preference leaderboard, Gemma 3-27B-IT scores ahead of much larger models such as Llama 3 405B and DeepSeek-V3, offering state-of-the-art quality without requiring a multi-GPU footprint.
Image Source: Introducing Gemma 3
Key Features of Gemma 3 Models
Here are some notable features of the Gemma 3 models:
- Multimodal input (text + image) is available on the 4B, 12B, and 27B models.
- Long context—up to 128K tokens (32K on the 1B model).
- Multilingual capabilities – 35+ languages supported out-of-the-box; 140+ languages pretrained.
- Quantization-Aware Training – Official QAT versions significantly reduce memory usage (by approximately 3x) while maintaining high quality.
- Function Calling & Structured Output – Includes built-in support for automating calls and receiving structured responses.
- Efficiency – Designed to run on a single GPU/TPU or even on consumer devices, from phones and laptops to workstations.
- Safety (ShieldGemma) – Features an integrated content filtering framework.
Why Fine-Tune Gemma 3?
Fine-tuning takes a pre-trained model like Gemma 3 and teaches it new behaviors for your specific domain or task, without the time and cost of training from scratch. With its compact design and, on the 4B+ variants, multimodal support, Gemma 3 is lightweight, affordable, and feasible to fine-tune even on hardware with limited resources.
Benefits of fine-tuning include:
- Domain specialization – Helps the model understand industry-specific language and perform better on specialized tasks within your domain.
- Knowledge enhancement – Adds important facts and context that were not part of the model’s original training data.
- Behavior refinement – Adjusts how the model responds, so it matches your brand’s tone or preferred output format.
- Resource optimization – Achieves high-quality results while using significantly fewer compute resources compared to training a new model from scratch.
Prerequisites
Before you begin this tutorial, ensure you have the following:
- Python 3.9 or higher is installed on your system.
- Basic knowledge of Python programming.
- Access to a computing environment with GPU support (e.g., Google Colab, Jupyter Notebook, or Kaggle Notebooks).
- Understanding of Machine Learning and Large Language Model (LLM) fundamentals.
- Experience using an IDE such as VS Code or a similar one.
You’ll also need access credentials for external services:
- A Hugging Face account and a token with Write access.
👉 Create a token here - A Bright Data account and API token.
👉 Sign up and follow the instructions to generate a token. - An OpenAI account and API key.
👉 Get your API key here
💡 Make sure your OpenAI account has sufficient credit. Manage billing here and track usage here.
Building a Custom Dataset for Fine-Tuning
Fine-tuning works best when your dataset closely reflects the behavior you want the model to learn. By creating a custom dataset tailored to your specific use case, you can dramatically improve how well the model performs. Remember the classic rule: “Garbage in, garbage out.” That’s why investing time in dataset preparation is so important.
A high-quality dataset should:
- Match your specific use case – The closer your dataset aligns with your target application, the more relevant your model’s outputs will be.
- Maintain consistent formatting – A uniform structure (like question–answer pairs) helps the model learn patterns more effectively.
- Include diverse examples – A variety of scenarios helps the model generalize across different inputs.
- Be clean and error-free – Removing inconsistencies and noise prevents the model from picking up unwanted behavior.
We’ll start with raw reviews like this:
And transform them into structured question-answer pairs like this:
This dataset will teach Gemma 3 to extract insights from customer feedback, identify sentiment patterns, and provide actionable recommendations.
Setup Steps
#1 Install Libraries: Open your project environment and install all the necessary Python libraries listed in the requirements.txt file. You can do this by running the following command in your terminal or notebook:
#2 Configure Environment Variables: Create a .env
file in your project’s root directory and securely store your API keys.
Step 1: Data Collection with Bright Data
The crucial first step is data sourcing. To build our fine-tuning dataset, we will collect raw review data from Trustpilot. Due to Trustpilot’s robust anti-bot measures, we will use Bright Data’s Trustpilot Scraper API. This API effectively manages IP rotation, CAPTCHA resolution, and dynamic content handling, allowing efficient collection of structured reviews at scale, bypassing the complexities of building your scraping solution.
Here’s a Python script showing how to use Bright Data’s API to collect reviews step by step:
This script performs the following steps:
- Authentication: It uses your
API_KEY
to authenticate with the Bright Data API via theAuthorization
header. - Trigger Collection: It sends a POST request to trigger a data collection ‘snapshot’ for the specified
TARGET_URL
(HubSpot’s Trustpilot page in this case), associated with yourDATASET_ID
. - Wait for Completion: It periodically polls the API using the returned
snapshot_id
to check if the data collection is complete. - Retrieve Data: Once the API indicates the data is ready, the script fetches the review data in JSON format.
- Save Output: It saves the collected list of review objects into a structured JSON file (
trustpilot_reviews.json
).
Each review in the resulting JSON file provides detailed information, such as:
Learn how to find the best data for LLM training with our guide: Top Sources for LLM Training Data.
Step 2: Converting JSON to Markdown
After collecting the raw review data, the next step is to convert it into a clean, readable format suitable for processing. We’ll use Markdown, which offers a lightweight plain-text structure that reduces noise during tokenization, potentially improves fine-tuning performance, and ensures consistent separation between different content sections.
To perform the conversion, simply run this script 👉 convert-trustpilot-json-to-markdown.py
This script reads the JSON data from the output of Step 1 and generates a Markdown file containing a structured summary and individual customer reviews.
Here’s an example of the Markdown output structure:
Learn why AI agents prefer Markdown over HTML by reading more in our guide.
Step 3: Chunking and Processing the Document
With the Markdown document ready, the next crucial step is to split it into smaller, manageable chunks. This is important because Large Language Models (LLMs) have input token limits, and fine-tuning often works best with examples of an appropriate length. Additionally, processing these chunks can improve their clarity and coherence for the model.
We use LangChain’s RecursiveCharacterTextSplitter
to split the Markdown file. This method recursively breaks down text based on a list of separators, which helps keep related pieces of text together. To preserve context that might span across split points, we apply an overlap between consecutive chunks. For this process, we use a chunk size of 1,024 characters with a 256-character overlap.
After splitting, each chunk is optionally passed to an LLM (like GPT-4o) to improve its overall clarity and coherence while strictly maintaining the original meaning of the review text. This enhancement step aims to make the data structure and content within each chunk optimally clear for the subsequent fine-tuning process.
Each processed chunk is then assigned a unique identifier and stored in a JSON Lines (.jsonl
) file format, preparing them for the next stage of the pipeline.
Here’s the Python function using the LLM for clarity improvement:
Find the complete code for this step here 👉 split-markdown-into-chunks.py
The output is a JSON Lines file where each line represents a review chunk with a unique identifier and the potentially improved review content:
Step 4: Generating QA Pairs
The final data preparation step transforms the processed review chunks into structured Question–Answer (QA) pairs suitable for fine-tuning a language model. We use OpenAI’s GPT-4o to generate one QA pair for each chunk in the .jsonl
file created in Step 3.
For each chunk, the script calls the OpenAI API using a carefully designed system prompt:
The script includes built-in rate limiting and retry mechanisms to handle temporary API interruptions and ensure stable execution. You can find the complete implementation in generate-qa-pairs.py.
The output is saved as a JSON array, where each object contains a generated question and answer pair, linked by the original chunk’s ID:
Once generated, it’s highly recommended to push the resulting QA dataset to the Hugging Face Hub. This makes it easily accessible for fine-tuning and sharing. You can see an example of the published dataset here: trustpilot-reviews-qa-dataset.
Fine-Tuning Gemma 3 with Unsloth: Step-by-Step
Now that we have our custom Q&A dataset prepared, let’s fine-tune the Gemma 3 model. We’ll use Unsloth, an open-source library that provides significant memory and speed improvements for LoRA/QLoRA training compared to standard Hugging Face implementations. These optimizations make fine-tuning models like Gemma 3 more accessible on single-GPU setups, provided the GPU has sufficient VRAM.
Gemma 3 Size | Approximate VRAM Required* | Suitable Platforms |
---|---|---|
4B | ~15 GB | Free Google Colab (T4), Kaggle (P100 16 GB) |
12B | ≥24 GB | Colab Pro+ (A100/A10), RTX 4090, A40 |
27B | 22–24 GB (with 4-bit QLoRA, batch size = 1); ~40 GB otherwise | A100 40 GB, H100, Multi-GPU Setups |
Note: VRAM requirements can vary based on batch size, sequence length, and specific quantization techniques. The requirement for the 27B model is with 4-bit QLoRA and a small batch size (e.g., 1 or 2); higher batch sizes or less aggressive quantization will require substantially more VRAM (~40 GB+).
For beginners, starting with the 4B model on a free Colab notebook is recommended, as it comfortably supports loading, training, and deployment with Unsloth. Upgrading to the 12B or 27B models should only be considered when access to higher-VRAM GPUs or paid cloud tiers is available.
To change the runtime type in Google Colab and select a T4 GPU, follow these steps:
- Click on the Runtime menu at the top.
- Select Change runtime type.
- In the dialog that appears, under Hardware accelerator, choose GPU.
- Click Save to apply the changes.
Step 1: Setting Up the Environment
First, install the necessary libraries. If you are in a Colab or Jupyter environment, you can run these commands directly in a code cell.
Here’s a brief explanation of the key packages installed:
unsloth
: Provides the core optimizations for faster and more memory-efficient LLM training and loading using techniques like fused kernels.peft
: Parameter-Efficient Fine-Tuning methods (like LoRA). Allows training only a small number of additional parameters instead of the full model.trl
: Transformer Reinforcement Learning. Includes theSFTTrainer
which simplifies the process of supervised fine-tuning.bitsandbytes
: Enables k-bit (4-bit and 8-bit) quantization, dramatically reducing the model’s memory footprint.accelerate
: Hugging Face library to seamlessly run PyTorch training across various hardware setups (single GPU, multi-GPU, etc.).datasets
: Hugging Face library for loading, processing, and managing datasets efficiently.transformers
: Hugging Face’s core library for pre-trained models, tokenizers, and utilities.huggingface_hub
: Utilities to interact with the Hugging Face Hub (login, download, upload).vllm
(Optional): A fast LLM inference engine. Can be installed separately if needed for deployment.
Step 2: Hugging Face Authentication
You’ll need to log in to the Hugging Face Hub from your environment to download the model and potentially upload the fine-tuned result later.
In Google Colab, the most secure way to manage your Hugging Face token is by using the “Secrets” tab:
Step 3: Loading the Model and Tokenizer
To begin fine-tuning, we will efficiently load the Gemma 3 instruction-tuned model using Unsloth’s FastModel
. For this example, we’ll use the unsloth/gemma-3-4b-it
model, which is a 4-bit quantized version optimized by Unsloth to fit within the memory constraints of typical Colab GPUs.
Check out Unsloth’s Gemma 3 collection on Hugging Face. It includes models in 1B, 4B, 12B, and 27B sizes, available in GGUF, 4-bit, and 16-bit formats.
What’s happening in this code:
FastModel.from_pretrained()
: Unsloth’s optimized model loader.model_name="unsloth/gemma-3-4b-it"
: Specifies the model variant to load. We choose the 4B instruction-tuned (it
) version, pre-optimized by Unsloth.max_seq_length=2048
: Sets the maximum number of tokens the model can process at once. Adjust this based on the length of your data chunks and desired context window, balancing memory usage and the ability to process longer inputs.load_in_4bit=True
: Essential for training on limited VRAM. This loads the model weights in 4-bit precision usingbitsandbytes
.full_finetuning=False
: Tells Unsloth to prepare the model for PEFT/LoRA fine-tuning, meaning only adapter layers will be trained, not all model parameters.get_chat_template(tokenizer, chat_template="gemma-3")
: Wraps the tokenizer to automatically format prompts into Gemma 3’s expected chat format (<start_of_turn>user\n...\n<end_of_turn><start_of_turn>model\n...\n<end_of_turn>
). This is crucial for fine-tuning instruction-following models correctly and ensuring the model learns to generate responses in the expected conversational turns.
Step 4: Loading and Preparing the Dataset for Training
We load the dataset we previously uploaded to the Hugging Face Hub and then transform it into the chat-based format expected by the tokenizer and trainer.
In this code:
load_dataset()
: Fetches our Q&A dataset from the Hugging Face Hub.standardize_data_formats()
: Ensures consistent field names across different datasets, specifically looking for ‘question’ and ‘answer’ in this case.formatting_prompts_func()
: This critical function processes batches of our Q&A pairs. It uses thetokenizer.apply_chat_template()
method to convert them into strings formatted correctly for Gemma 3 instruction fine-tuning. This format includes special turn tokens like<start_of_turn>user\n
and<start_of_turn>model\n
, which are essential for the model to understand conversational structure. We remove the initial<bos>
token as theSFTTrainer
adds its own.dataset.map(...)
: Applies theformatting_prompts_func
to the entire dataset efficiently, creating a new ‘text’ column containing the formatted strings and removing the original columns.
Step 5: Configuring LoRA and the Trainer
Now we configure the PEFT (LoRA) settings and the SFTTrainer
from the trl
library. LoRA works by injecting small, trainable matrices into key layers of the pre-trained model. Only these small adapter matrices are updated during fine-tuning, drastically reducing the number of parameters to train and thus minimizing memory usage.
In this code:
FastModel.get_peft_model()
: Configures the loaded model for LoRA fine-tuning with the specified parameters.r
is the LoRA rank, controlling the size of the adapter matrices.target_modules
specifies which model layers (like attention and MLP projections) will receive these adapters.lora_alpha
is a scaling factor.use_gradient_checkpointing
is a memory-saving technique provided by Unsloth.SFTConfig()
: Defines the training hyperparameters for theSFTTrainer
.per_device_train_batch_size
andgradient_accumulation_steps
work together to determine the effective batch size used for calculating gradients.max_steps
sets the total training iterations.learning_rate
,optim
,weight_decay
, andlr_scheduler_type
control the optimization process.dataset_text_field
tells the trainer which column in the dataset contains the formatted training examples.SFTTrainer()
: Instantiates the trainer, bringing together the LoRA-configured model, the prepared dataset, the tokenizer, and the training arguments defined inSFTConfig
.train_on_responses_only()
: A utility function (part oftrl
and compatible with Unsloth) that modifies the trainer’s loss calculation. It sets the loss to be computed only on the tokens corresponding to the model’s expected response (<start_of_turn>model\n...
), ignoring the tokens of the user’s prompt (<start_of_turn>user\n...
). This is essential for teaching the model to generate relevant answers rather than simply repeating or completing the input prompt. We provide the exact string prefixes used in the chat template to delineate these sections.
Step 6: Training the Model
With everything set up, we can initiate the fine-tuning process. The trainer.train()
method handles the training loop based on the configurations provided in the SFTConfig
.
The trainer will output progress updates, including the training loss. You should observe the loss decreasing over steps, indicating that the model is learning from the data. The total training time will depend on the dataset size, model size, hyperparameters, and the specific GPU used. For our example dataset and the 4B model on a T4 GPU, the training for 200 steps should complete relatively quickly (e.g., under 15-30 minutes, depending on exact setup and data length).
Step 7: Testing the Fine-Tuned Model (Inference)
After training, let’s test our fine-tuned model to see how well it responds to questions based on the Trustpilot review data it was trained on. We’ll use the model.generate
method with a TextStreamer
for a more interactive output.
See the model’s responses in the image below:
🔥 Great, it’s working fine!
A successful fine-tuning process means the model generates answers that are more analytical and directly derived from the review content it was fine-tuned on, reflecting the style and insights present in your custom dataset, rather than generic responses.
Step 8: Saving and Pushing Your Fine-Tuned Model
Finally, save your fine-tuned LoRA adapters and tokenizer. You can save them locally and also push them to the Hugging Face Hub for easy sharing, versioning, and deployment.
The fine-tuned model is now available on Hugging Face Hub:
Conclusion
This guide demonstrated an end-to-end approach to fine-tuning Google’s Gemma 3 for a practical use case: generating analytical answers from customer reviews. We covered the entire workflow—from collecting high-quality, domain-specific data via Bright Data’s web scraper API, structuring it into a QA format using LLM-powered processing, to fine-tuning the Gemma 3 4B model efficiently using the Unsloth library on resource-constrained hardware.
The result is a specialized LLM that is adept at extracting insights and interpreting sentiment from raw review data, transforming it into structured, actionable answers. This method is highly adaptable—you can apply this same workflow to fine-tune Gemma 3 (or other suitable LLMs) on various domain-specific datasets to create AI assistants tailored to different needs.
For further exploration into AI-driven data extraction strategies, consider these additional resources:
- Web Scraping with LLaMA 3
- Web Scraping with MCP Servers
- AI-Powered Scraping with LLM-Scraper
- ScrapeGraphAI for LLM Web Scraping
For more fine-tuning optimizations and examples using Unsloth, check out the Unsloth Notebooks Collection.
No credit card required