How to Fine-Tune Google Gemma 3 with Bright Data

Google’s latest open-weight AI model, Gemma 3, released in March 2025, delivers impressive performance that matches many proprietary LLMs while running efficiently on hardware with limited resources. This advancement in open-source AI works across various platforms, offering developers worldwide powerful capabilities in an accessible format.

In this guide, we’ll walk you through fine-tuning Gemma 3 on a custom question-answering dataset derived from Trustpilot reviews. We’ll use Bright Data to scrape customer reviews, process them into structured QA pairs, and leverage Unsloth for efficient fine-tuning with minimal compute. By the end, you’ll have created a specialized AI assistant that understands domain-specific questions and is ready to host on Hugging Face Hub.

Let’s dive in!

Understanding Gemma 3

Google’s Gemma 3 family launched in March 2025 with four open-weight sizes—1B, 4B, 12B, and 27B parameters—all designed to run on a single GPU.

The 1B model is text-only with a 32K-token context window.
The 4B, 12B, and 27B models add multimodal (text + image) input and support a 128K-token window.

On the LMArena human-preference leaderboard, Gemma 3-27B-IT scores ahead of much larger models such as Llama 3 405B and DeepSeek-V3, offering state-of-the-art quality without requiring a multi-GPU footprint.

A scatter plot showing model performance (ELO Score) versus model size (in billion parameters). The plot features various models indicated by points, including Gemma 3 27B IT, Qwen 2.5 72B Instruct, Llama 3.3 70B Instruct, Meta Llama 3.1 70B Instruct, and more. The x-axis represents model size, while the y-axis represents ELO Score, with specific points highlighted for their performance.

Image Source: Introducing Gemma 3

Key Features of Gemma 3 Models

Here are some notable features of the Gemma 3 models:

Multimodal input (text + image) is available on the 4B, 12B, and 27B models.
Long context—up to 128K tokens (32K on the 1B model).
Multilingual capabilities – 35+ languages supported out-of-the-box; 140+ languages pretrained.
Quantization-Aware Training – Official QAT versions significantly reduce memory usage (by approximately 3x) while maintaining high quality.
Function Calling & Structured Output – Includes built-in support for automating calls and receiving structured responses.
Efficiency – Designed to run on a single GPU/TPU or even on consumer devices, from phones and laptops to workstations.
Safety (ShieldGemma) – Features an integrated content filtering framework.

Why Fine-Tune Gemma 3?

Fine-tuning takes a pre-trained model like Gemma 3 and teaches it new behaviors for your specific domain or task, without the time and cost of training from scratch. With its compact design and, on the 4B+ variants, multimodal support, Gemma 3 is lightweight, affordable, and feasible to fine-tune even on hardware with limited resources.

Benefits of fine-tuning include:

Domain specialization – Helps the model understand industry-specific language and perform better on specialized tasks within your domain.
Knowledge enhancement – Adds important facts and context that were not part of the model’s original training data.
Behavior refinement – Adjusts how the model responds, so it matches your brand’s tone or preferred output format.
Resource optimization – Achieves high-quality results while using significantly fewer compute resources compared to training a new model from scratch.

Prerequisites

Before you begin this tutorial, ensure you have the following:

Python 3.9 or higher is installed on your system.
Basic knowledge of Python programming.
Access to a computing environment with GPU support (e.g., Google Colab, Jupyter Notebook, or Kaggle Notebooks).
Understanding of Machine Learning and Large Language Model (LLM) fundamentals.
Experience using an IDE such as VS Code or a similar one.

You’ll also need access credentials for external services:

A Hugging Face account and a token with Write access.
👉 Create a token here
A Bright Data account and API token.
👉 Sign up and follow the instructions to generate a token.
An OpenAI account and API key.
👉 Get your API key here
💡 Make sure your OpenAI account has sufficient credit. Manage billing here and track usage here.

Building a Custom Dataset for Fine-Tuning

Fine-tuning works best when your dataset closely reflects the behavior you want the model to learn. By creating a custom dataset tailored to your specific use case, you can dramatically improve how well the model performs. Remember the classic rule: “Garbage in, garbage out.” That’s why investing time in dataset preparation is so important.

A high-quality dataset should:

Match your specific use case – The closer your dataset aligns with your target application, the more relevant your model’s outputs will be.
Maintain consistent formatting – A uniform structure (like question–answer pairs) helps the model learn patterns more effectively.
Include diverse examples – A variety of scenarios helps the model generalize across different inputs.
Be clean and error-free – Removing inconsistencies and noise prevents the model from picking up unwanted behavior.

We’ll start with raw reviews like this:

And transform them into structured question-answer pairs like this:

A dataset table displaying rows with three columns: 'id', 'question', and 'answer'. The 'question' column contains inquiries related to HubSpot's customer support and satisfaction, while the 'answer' column provides insights based on customer reviews.

This dataset will teach Gemma 3 to extract insights from customer feedback, identify sentiment patterns, and provide actionable recommendations.

Setup Steps

#1 Install Libraries: Open your project environment and install all the necessary Python libraries listed in the requirements.txt file. You can do this by running the following command in your terminal or notebook:

pip install -r requirements.txt

#2 Configure Environment Variables: Create a .env file in your project’s root directory and securely store your API keys.

OPENAI_API_KEY="your_openai_key_here"
HF_TOKEN="your_hugging_face_token_here"

Step 1: Data Collection with Bright Data

The crucial first step is data sourcing. To build our fine-tuning dataset, we will collect raw review data from Trustpilot. Due to Trustpilot’s robust anti-bot measures, we will use Bright Data’s Trustpilot Scraper API. This API effectively manages IP rotation, CAPTCHA resolution, and dynamic content handling, allowing efficient collection of structured reviews at scale, bypassing the complexities of building your scraping solution.

Here’s a Python script showing how to use Bright Data’s API to collect reviews step by step:

import time
import json
import requests
from typing import Optional

# --- Configuration ---
API_KEY = "YOUR_API_KEY"  # Replace with your Bright Data API key
DATASET_ID = "gd_lm5zmhwd2sni130p"  # Replace with your Dataset ID
TARGET_URL = "https://www.trustpilot.com/review/hubspot.com"  # Target company page
OUTPUT_FILE = "trustpilot_reviews.json"  # Output file name
HEADERS = {"Authorization": f"Bearer {API_KEY}"}
TIMEOUT = 30  # Request timeout in seconds

# --- Functions ---
def trigger_snapshot() -> Optional[str]:
    """Triggers a Bright Data snapshot collection job."""
    print(f"Triggering snapshot for: {TARGET_URL}")
    try:
        resp = requests.post(
            "https://api.brightdata.com/datasets/v3/trigger",
            headers=HEADERS,
            params={"dataset_id": DATASET_ID},
            json=[{"url": TARGET_URL}],
            timeout=TIMEOUT,
        )
        resp.raise_for_status()  # Raise HTTPError for bad responses (4xx or 5xx)
        snapshot_id = resp.json().get("snapshot_id")
        print(f"Snapshot triggered successfully. ID: {snapshot_id}")
        return snapshot_id
    except requests.RequestException as e:
        print(f"Error triggering snapshot: {e}")
    except json.JSONDecodeError:
        print(f"Error decoding trigger response: {resp.text}")
    return None

def wait_for_snapshot(snapshot_id: str) -> Optional[list]:
    """Polls the API until snapshot data is ready and returns it."""
    check_url = f"https://api.brightdata.com/datasets/v3/snapshot/{snapshot_id}"
    print(f"Waiting for snapshot {snapshot_id} to complete...")
    while True:
        try:
            resp = requests.get(
                check_url,
                headers=HEADERS,
                params={"format": "json"},
                timeout=TIMEOUT,
            )
            resp.raise_for_status()
            # Check if response is the final data (list) or status info (dict)
            if isinstance(resp.json(), list):
                print("Snapshot data is ready.")
                return resp.json()
            else:
                pass
        except requests.RequestException as e:
            print(f"Error checking snapshot status: {e}")
            return None  # Stop polling on error
        except json.JSONDecodeError:
            print(f"Error decoding snapshot status response: {resp.text}")
            return None  # Stop polling on error

        print("Data not ready yet. Waiting 30 seconds...")
        time.sleep(30)

def save_reviews(reviews: list, output_file: str) -> bool:
    """Saves the collected reviews list to a JSON file."""
    try:
        with open(output_file, "w", encoding="utf-8") as f:
            json.dump(reviews, f, indent=2, ensure_ascii=False)
        print(f"Successfully saved {len(reviews)} reviews to {output_file}")
        return True
    except (IOError, TypeError) as e:
        print(f"Error saving reviews to file: {e}")
        return False
    except Exception as e:
        print(f"An unexpected error occurred during saving: {e}")
        return False

def main():
    """Main execution flow for collecting Trustpilot reviews."""
    print("Starting Trustpilot review collection process...")
    snapshot_id = trigger_snapshot()
    if not snapshot_id:
        print("Failed to trigger snapshot. Exiting.")
        return

    reviews = wait_for_snapshot(snapshot_id)
    if not reviews:
        print("Failed to retrieve reviews from snapshot. Exiting.")
        return

    if not save_reviews(reviews, OUTPUT_FILE):
        print("Failed to save the collected reviews.")
    else:
        print("Review collection process completed.")

if __name__ == "__main__":
    main()

This script performs the following steps:

Authentication: It uses your API_KEY to authenticate with the Bright Data API via the Authorization header.
Trigger Collection: It sends a POST request to trigger a data collection ‘snapshot’ for the specified TARGET_URL (HubSpot’s Trustpilot page in this case), associated with your DATASET_ID.
Wait for Completion: It periodically polls the API using the returned snapshot_id to check if the data collection is complete.
Retrieve Data: Once the API indicates the data is ready, the script fetches the review data in JSON format.
Save Output: It saves the collected list of review objects into a structured JSON file (trustpilot_reviews.json).

Each review in the resulting JSON file provides detailed information, such as:

{
    "review_id": "680af52fb0bab688237f75c5",
    "review_date": "2025-04-25T04:36:31.000Z",
    "review_rating": 1,
    "review_title": "Cancel Auto Renewal Doesn't Work",
    "review_content": "I was with Hubspot for almost 3 years...",
    "reviewer_name": "Steven Barrett",
    "reviewer_location": "AU",
    "is_verified_review": false,
    "review_date_of_experience": "2025-04-19T00:00:00.000Z",
    // Additional fields omitted for brevity
}

Learn how to find the best data for LLM training with our guide: Top Sources for LLM Training Data.

Step 2: Converting JSON to Markdown

After collecting the raw review data, the next step is to convert it into a clean, readable format suitable for processing. We’ll use Markdown, which offers a lightweight plain-text structure that reduces noise during tokenization, potentially improves fine-tuning performance, and ensures consistent separation between different content sections.

To perform the conversion, simply run this script 👉 convert-trustpilot-json-to-markdown.py

This script reads the JSON data from the output of Step 1 and generates a Markdown file containing a structured summary and individual customer reviews.

Here’s an example of the Markdown output structure:

# HubSpot Review Summary
[Visit Website](https://www.hubspot.com/)
**Overall Rating**: 2.3
**Total Reviews**: 873
**Location**: United States
**Industry**: Electronics & Technology

> HubSpot is a leading growth platform... Grow Better.
---

### Review by Steven Barrett (AU)
- **Posted on**: April 25, 2025
- **Experience Date**: April 19, 2025
- **Rating**: 1
- **Title**: *Cancel Auto Renewal Doesn't Work*

I was with Hubspot for almost 3 years... Avoid.

[View Full Review](https://www.trustpilot.com/reviews/680af52fb0bab688237f75c5)

---

Learn why AI agents prefer Markdown over HTML by reading more in our guide.

Step 3: Chunking and Processing the Document

With the Markdown document ready, the next crucial step is to split it into smaller, manageable chunks. This is important because Large Language Models (LLMs) have input token limits, and fine-tuning often works best with examples of an appropriate length. Additionally, processing these chunks can improve their clarity and coherence for the model.

We use LangChain’s RecursiveCharacterTextSplitter to split the Markdown file. This method recursively breaks down text based on a list of separators, which helps keep related pieces of text together. To preserve context that might span across split points, we apply an overlap between consecutive chunks. For this process, we use a chunk size of 1,024 characters with a 256-character overlap.

After splitting, each chunk is optionally passed to an LLM (like GPT-4o) to improve its overall clarity and coherence while strictly maintaining the original meaning of the review text. This enhancement step aims to make the data structure and content within each chunk optimally clear for the subsequent fine-tuning process.

Each processed chunk is then assigned a unique identifier and stored in a JSON Lines (.jsonl) file format, preparing them for the next stage of the pipeline.

Here’s the Python function using the LLM for clarity improvement:

def improve_review_chunk(text: str, client: OpenAI, model: str = "gpt-4o") -> str:
    prompt = """Improve this review's clarity while preserving its meaning:
{text}

Return only the improved text without additional commentary."""
    response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": prompt},
                {"role": "user", "content": text}
            ]
        )
    return response.choices[0].message.content

Find the complete code for this step here 👉 split-markdown-into-chunks.py

The output is a JSON Lines file where each line represents a review chunk with a unique identifier and the potentially improved review content:

[
  {
    "id": "f8a3b1c9-e4d5-4f6a-8b7c-2d9e0a1b3c4d", // Unique chunk ID
    "review": "# HubSpot Review Summary\n\n[Visit Website](https://www.hubspot.com/)...\n---\n\n### Review by Steven Barrett (AU)\n- **Posted on**: April 25, 2024...\n- **Rating**: 1\n- **Title**: *Cancel Auto Renewal Doesn't Work*\n\nI was with Hubspot for almost 3 years... [Text continues - may be improved]" // Chunk content (potentially refined)
  },
  // ... more chunk objects
]

Step 4: Generating QA Pairs

The final data preparation step transforms the processed review chunks into structured Question–Answer (QA) pairs suitable for fine-tuning a language model. We use OpenAI’s GPT-4o to generate one QA pair for each chunk in the .jsonl file created in Step 3.

For each chunk, the script calls the OpenAI API using a carefully designed system prompt:

SYSTEM_PROMPT = """
You are an expert at transforming customer reviews into insightful question–answer pairs. For each review, generate exactly 1 high-quality QA pair.

PURPOSE:
These QA pairs will train a customer service AI to understand feedback patterns about HubSpot products and identify actionable insights.

GUIDELINES FOR QUESTIONS:
- Make questions general and applicable to similar situations
- Phrase from a stakeholder perspective (e.g., "What feature gaps are causing customer frustration?")
- Focus on product features, usability, pricing, or service impact

GUIDELINES FOR ANSWERS:
- Provide analytical responses (3–5 sentences)
- Extract insights without quoting verbatim
- Offer actionable recommendations
- Maintain objectivity and clarity

FORMAT REQUIREMENTS:
- Start with "Q: " followed by your question
- Then "A: " followed by a plain-text answer
"""

The script includes built-in rate limiting and retry mechanisms to handle temporary API interruptions and ensure stable execution. You can find the complete implementation in generate-qa-pairs.py.

The output is saved as a JSON array, where each object contains a generated question and answer pair, linked by the original chunk’s ID:

[
  {
    "id": "82d53a10-9f37-4d03-8d3b-38812e39ecdc",
    "question": "How can pricing and customer support issues impact customer satisfaction and retention for HubSpot?",
    "answer": "Pricing concerns, particularly when customers feel they are overpaying for services they find unusable or unsupported, can significantly impact customer satisfaction and retention..."
  }
  // ... more QA pairs
]

Once generated, it’s highly recommended to push the resulting QA dataset to the Hugging Face Hub. This makes it easily accessible for fine-tuning and sharing. You can see an example of the published dataset here: trustpilot-reviews-qa-dataset.

Fine-Tuning Gemma 3 with Unsloth: Step-by-Step

Now that we have our custom Q&A dataset prepared, let’s fine-tune the Gemma 3 model. We’ll use Unsloth, an open-source library that provides significant memory and speed improvements for LoRA/QLoRA training compared to standard Hugging Face implementations. These optimizations make fine-tuning models like Gemma 3 more accessible on single-GPU setups, provided the GPU has sufficient VRAM.

Gemma 3 Size	Approximate VRAM Required*	Suitable Platforms
4B	~15 GB	Free Google Colab (T4), Kaggle (P100 16 GB)
12B	≥24 GB	Colab Pro+ (A100/A10), RTX 4090, A40
27B	22–24 GB (with 4-bit QLoRA, batch size = 1); ~40 GB otherwise	A100 40 GB, H100, Multi-GPU Setups

Note: VRAM requirements can vary based on batch size, sequence length, and specific quantization techniques. The requirement for the 27B model is with 4-bit QLoRA and a small batch size (e.g., 1 or 2); higher batch sizes or less aggressive quantization will require substantially more VRAM (~40 GB+).

For beginners, starting with the 4B model on a free Colab notebook is recommended, as it comfortably supports loading, training, and deployment with Unsloth. Upgrading to the 12B or 27B models should only be considered when access to higher-VRAM GPUs or paid cloud tiers is available.

To change the runtime type in Google Colab and select a T4 GPU, follow these steps:

Click on the Runtime menu at the top.
Select Change runtime type.
In the dialog that appears, under Hardware accelerator, choose GPU.
Click Save to apply the changes.

Screenshot of a coding environment showing the option to change runtime type in a Python notebook, featuring various hardware accelerator options including T4 GPU and a loading progress indicator for model files on the left.

Step 1: Setting Up the Environment

First, install the necessary libraries. If you are in a Colab or Jupyter environment, you can run these commands directly in a code cell.

%%capture
!pip install --no-deps unsloth vllm
import sys, re, requests; modules = list(sys.modules.keys())
for x in modules: sys.modules.pop(x) if "PIL" in x or "google" in x else None
!pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft "trl==0.15.2" triton cut_cross_entropy unsloth_zoo
!pip install sentencepiece protobuf datasets huggingface_hub hf_transfer

# vLLM requirements - vLLM breaks Colab due to reinstalling numpy
f = requests.get("https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/requirements/common.txt").content
with open("vllm_requirements.txt", "wb") as file:
    file.write(re.sub(rb"(transformers|numpy|xformers)[^n]{1,}n", b"", f))
!pip install -r vllm_requirements.txt

Here’s a brief explanation of the key packages installed:

unsloth: Provides the core optimizations for faster and more memory-efficient LLM training and loading using techniques like fused kernels.
peft: Parameter-Efficient Fine-Tuning methods (like LoRA). Allows training only a small number of additional parameters instead of the full model.
trl: Transformer Reinforcement Learning. Includes the SFTTrainer which simplifies the process of supervised fine-tuning.
bitsandbytes: Enables k-bit (4-bit and 8-bit) quantization, dramatically reducing the model’s memory footprint.
accelerate: Hugging Face library to seamlessly run PyTorch training across various hardware setups (single GPU, multi-GPU, etc.).
datasets: Hugging Face library for loading, processing, and managing datasets efficiently.
transformers: Hugging Face’s core library for pre-trained models, tokenizers, and utilities.
huggingface_hub: Utilities to interact with the Hugging Face Hub (login, download, upload).
vllm (Optional): A fast LLM inference engine. Can be installed separately if needed for deployment.

Step 2: Hugging Face Authentication

You’ll need to log in to the Hugging Face Hub from your environment to download the model and potentially upload the fine-tuned result later.

import os
from huggingface_hub import login
from google.colab import userdata

hf_token = userdata.get('HF_TOKEN')
if not hf_token:
    raise ValueError("Please set your HF_TOKEN environment variable before running.")

try:
    login(hf_token)
    print("Successfully logged in to Hugging Face Hub.")
except Exception as e:
    print(f"Error logging in to Hugging Face Hub: {e}")

In Google Colab, the most secure way to manage your Hugging Face token is by using the “Secrets” tab:

A screenshot from Google Colab showing the 'Secrets' section where users can configure environment variables. It displays options to add a new secret, with a highlighted 'HF_TOKEN' name and its respective value obscured. Instructions on how to access the secrets in Python are provided below.

Step 3: Loading the Model and Tokenizer

To begin fine-tuning, we will efficiently load the Gemma 3 instruction-tuned model using Unsloth’s FastModel. For this example, we’ll use the unsloth/gemma-3-4b-it model, which is a 4-bit quantized version optimized by Unsloth to fit within the memory constraints of typical Colab GPUs.

Check out Unsloth’s Gemma 3 collection on Hugging Face. It includes models in 1B, 4B, 12B, and 27B sizes, available in GGUF, 4-bit, and 16-bit formats.

from unsloth import FastModel
from unsloth.chat_templates import get_chat_template
import torch # Import torch for checking CUDA

# Ensure CUDA is available
if not torch.cuda.is_available():
    raise RuntimeError("CUDA is not available. A GPU is required for this tutorial.")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA device name: {torch.cuda.get_device_name(0)}")

model, tokenizer = FastModel.from_pretrained(
    model_name="unsloth/gemma-3-4b-it", # Using the 4B instruction-tuned model optimized by Unsloth
    max_seq_length=2048, # Set max context length
    load_in_4bit=True,   # Enable 4-bit quantization
    full_finetuning=False, # Use PEFT (LoRA)
    token=hf_token,      # Pass your Hugging Face token
)

# Apply the correct chat template for Gemma 3
tokenizer = get_chat_template(tokenizer, chat_template="gemma-3")

print("Model and Tokenizer loaded successfully.")

What’s happening in this code:

FastModel.from_pretrained(): Unsloth’s optimized model loader.
model_name="unsloth/gemma-3-4b-it": Specifies the model variant to load. We choose the 4B instruction-tuned (it) version, pre-optimized by Unsloth.
max_seq_length=2048: Sets the maximum number of tokens the model can process at once. Adjust this based on the length of your data chunks and desired context window, balancing memory usage and the ability to process longer inputs.
load_in_4bit=True: Essential for training on limited VRAM. This loads the model weights in 4-bit precision using bitsandbytes.
full_finetuning=False: Tells Unsloth to prepare the model for PEFT/LoRA fine-tuning, meaning only adapter layers will be trained, not all model parameters.
get_chat_template(tokenizer, chat_template="gemma-3"): Wraps the tokenizer to automatically format prompts into Gemma 3’s expected chat format (<start_of_turn>usern...n<end_of_turn><start_of_turn>modeln...n<end_of_turn>). This is crucial for fine-tuning instruction-following models correctly and ensuring the model learns to generate responses in the expected conversational turns.

Step 4: Loading and Preparing the Dataset for Training

We load the dataset we previously uploaded to the Hugging Face Hub and then transform it into the chat-based format expected by the tokenizer and trainer.

from datasets import load_dataset
from unsloth.chat_templates import standardize_data_formats, train_on_responses_only # train_on_responses_only imported earlier

# 1. Load the dataset from Hugging Face Hub
dataset_name = "triposatt/trustpilot-reviews-qa-dataset" # Replace with your dataset name
dataset = load_dataset(dataset_name, split="train")

print(f"Dataset '{dataset_name}' loaded.")
print(dataset)

# 2. Normalize any odd formats (ensure 'question' and 'answer' fields exist)
dataset = standardize_data_formats(dataset)
print("Dataset standardized.")

# 3. Define a function to format examples into chat template
def formatting_prompts_func(examples):
    """Formats question-answer pairs into Gemma 3 chat template."""
    questions = examples["question"]
    answers = examples["answer"]
    texts = []
    for q, a in zip(questions, answers):
        # Structure the conversation as a list of roles and content
        conv = [
            {"role": "user", "content": q},
            {"role": "assistant", "content": a},
        ]
        # Apply the chat template
        txt = tokenizer.apply_chat_template(
            conv,
            tokenize=False, # Return string, not token IDs
            add_generation_prompt=False # Don't add the model's start tag at the end yet
        )
        # Gemma 3 tokenizer adds <bos> by default, which the trainer will re-add
        # We remove it here to avoid double <bos> tokens
        txt = txt.removeprefix(tokenizer.bos_token)
        texts.append(txt)
    return {"text": texts}

# 4. Apply the formatting function to the dataset
dataset = dataset.map(formatting_prompts_func, batched=True, remove_columns=["question", "answer"])
print("Dataset formatted with chat template.")
print(dataset) # Inspect the new 'text' column

In this code:

load_dataset(): Fetches our Q&A dataset from the Hugging Face Hub.
standardize_data_formats(): Ensures consistent field names across different datasets, specifically looking for ‘question’ and ‘answer’ in this case.
formatting_prompts_func(): This critical function processes batches of our Q&A pairs. It uses the tokenizer.apply_chat_template() method to convert them into strings formatted correctly for Gemma 3 instruction fine-tuning. This format includes special turn tokens like <start_of_turn>usern and <start_of_turn>modeln, which are essential for the model to understand conversational structure. We remove the initial <bos> token as the SFTTrainer adds its own.
dataset.map(...): Applies the formatting_prompts_func to the entire dataset efficiently, creating a new ‘text’ column containing the formatted strings and removing the original columns.

Step 5: Configuring LoRA and the Trainer

Now we configure the PEFT (LoRA) settings and the SFTTrainer from the trl library. LoRA works by injecting small, trainable matrices into key layers of the pre-trained model. Only these small adapter matrices are updated during fine-tuning, drastically reducing the number of parameters to train and thus minimizing memory usage.

from trl import SFTTrainer, SFTConfig
import torch

# 1. Configure LoRA
model = FastModel.get_peft_model(
    model,
    r=8, # LoRA rank (a common value) - lower rank means fewer parameters, higher means more expressive
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj", # Attention layers
        "gate_proj", "up_proj", "down_proj"      # MLP layers
    ],
    # Set True if you want to fine-tune language layers (recommended for text tasks)
    # and Attention/MLP modules (where LoRA is applied)
    finetune_language_layers=True,
    finetune_attention_modules=True,
    finetune_mlp_modules=True,
    # finetune_vision_layers=False, # Only relevant for multimodal models (12B/27B)
    lora_alpha=8, # LoRA scaling factor (often set equal to r)
    lora_dropout=0, # Dropout for LoRA layers
    bias="none", # Don't train bias terms
    use_gradient_checkpointing="unsloth", # Memory optimization
    random_state=1000, # Seed for reproducibility
    use_rslora=False, # Rank-Stabilized LoRA (optional alternative)
    # modules_to_save=["embed_tokens", "lm_head"], # Optional: train embedding/output layers
)

print("Model configured for PEFT (LoRA).")

# 2. Configure the SFTTrainer
# Determine a reasonable max_steps based on dataset size and epochs
# For demonstration, a small number of steps is used (e.g., 30)
# For a real use case, calculate steps = (dataset_size / batch_size / grad_accum) * num_epochs
dataset_size = len(dataset)
per_device_train_batch_size = 2 # Adjust based on your GPU VRAM
gradient_accumulation_steps = 4 # Accumulate gradients to simulate larger batch size (batch_size * grad_accum = 8)
num_train_epochs = 3 # Example: 3 epochs

# Calculate total training steps
total_steps = int((dataset_size / per_device_train_batch_size / gradient_accumulation_steps) * num_train_epochs)
# Ensure max_steps is not 0 if dataset is small or calculation results in < 1 step
max_steps = max(30, total_steps) # Set a minimum or calculate properly

print(f"Calculated total training steps for {num_train_epochs} epochs: {total_steps}. Using max_steps={max_steps}")

sft_config = SFTConfig(
    dataset_text_field="text", # The column in our dataset containing the formatted chat text
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    warmup_steps=max(5, int(max_steps * 0.03)), # Warmup for first few steps (e.g., 3% of total steps)
    max_steps=max_steps, # Total number of training steps
    learning_rate=2e-4, # Learning rate
    logging_steps=max(1, int(max_steps * 0.01)), # Log every 1% of total steps (min 1)
    optim="adamw_8bit", # 8-bit AdamW optimizer (memory efficient)
    weight_decay=0.01, # L2 regularization
    lr_scheduler_type="linear", # Linear learning rate decay
    seed=3407, # Random seed
    report_to="none", # Disable reporting to platforms like W&B unless needed
    output_dir="./results", # Directory to save checkpoints and logs
)

# 3. Build the SFTTrainer instance
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    eval_dataset=None, # Optional: provide a validation dataset
    args=sft_config,
)

print("SFTTrainer built.")

# 4. Mask out the input portion for training
# This teaches the model to only generate the assistant’s response
# It prevents the model from just copying the user’s prompt
# Pass the literal prefixes for instruction and response turns from the chat template
trainer = train_on_responses_only(
    trainer,
    instruction_part="<start_of_turn>usern", # Literal string before user content
    response_part="<start_of_turn>modeln",  # Literal string before model content
)

print("Trainer configured to train only on responses.")

In this code:

FastModel.get_peft_model(): Configures the loaded model for LoRA fine-tuning with the specified parameters. r is the LoRA rank, controlling the size of the adapter matrices. target_modules specifies which model layers (like attention and MLP projections) will receive these adapters. lora_alpha is a scaling factor. use_gradient_checkpointing is a memory-saving technique provided by Unsloth.
SFTConfig(): Defines the training hyperparameters for the SFTTrainer. per_device_train_batch_size and gradient_accumulation_steps work together to determine the effective batch size used for calculating gradients. max_steps sets the total training iterations. learning_rate, optim, weight_decay, and lr_scheduler_type control the optimization process. dataset_text_field tells the trainer which column in the dataset contains the formatted training examples.
SFTTrainer(): Instantiates the trainer, bringing together the LoRA-configured model, the prepared dataset, the tokenizer, and the training arguments defined in SFTConfig.
train_on_responses_only(): A utility function (part of trl and compatible with Unsloth) that modifies the trainer’s loss calculation. It sets the loss to be computed only on the tokens corresponding to the model’s expected response (<start_of_turn>modeln...), ignoring the tokens of the user’s prompt (<start_of_turn>usern...). This is essential for teaching the model to generate relevant answers rather than simply repeating or completing the input prompt. We provide the exact string prefixes used in the chat template to delineate these sections.

Step 6: Training the Model

With everything set up, we can initiate the fine-tuning process. The trainer.train() method handles the training loop based on the configurations provided in the SFTConfig.

# Optional: clear CUDA cache before training
torch.cuda.empty_cache()

print("Starting training...")
# Use mixed precision training for efficiency
# Unsloth automatically handles float16/bf16 based on GPU capabilities and model
with torch.amp.autocast(device_type="cuda", dtype=torch.float16): # Or torch.bfloat16 if supported
     trainer.train()

print("Training finished.")

The trainer will output progress updates, including the training loss. You should observe the loss decreasing over steps, indicating that the model is learning from the data. The total training time will depend on the dataset size, model size, hyperparameters, and the specific GPU used. For our example dataset and the 4B model on a T4 GPU, the training for 200 steps should complete relatively quickly (e.g., under 15-30 minutes, depending on exact setup and data length).

Step 7: Testing the Fine-Tuned Model (Inference)

After training, let’s test our fine-tuned model to see how well it responds to questions based on the Trustpilot review data it was trained on. We’ll use the model.generate method with a TextStreamer for a more interactive output.

from transformers import TextStreamer

# Define some test questions related to the dataset content
questions = [
    "What are common issues or complaints mentioned in the reviews?",
    "What do customers like most about the product/service?",
    "How is the customer support perceived?",
    "Are there any recurring themes regarding pricing or value?"
    # Add more questions here based on your dataset content
]

# Set up a streamer for real-time output
# skip_prompt=True prevents printing the input prompt again
# skip_special_tokens=True removes chat template tokens from output
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

print("n--- Testing Fine-Tuned Model ---")

# Iterate through questions and generate answers
for idx, q in enumerate(questions, start=1):
    # Build the conversation prompt in the correct Gemma 3 chat format
    conv = [{"role": "user", "content": q}]

    # Apply the chat template and add the generation prompt token
    # add_generation_prompt=True includes the <start_of_turn>model tag
    prompt = tokenizer.apply_chat_template(
        conv,
        add_generation_prompt=True,
        tokenize=False
    )

    # Tokenize the prompt and move to GPU
    inputs = tokenizer([prompt], return_tensors="pt", padding=True).to("cuda")

    # Display the question
    print(f"n=== Question {idx}: {q}n")

    # Generate the response with streaming
    # Pass the tokenized inputs directly to model.generate
    _ = model.generate(
        **inputs,
        streamer=streamer, # Use the streamer for token-by-token output
        max_new_tokens=256, # Limit the response length
        temperature=0.7, # Control randomness (lower=more deterministic)
        top_p=0.95, # Nucleus sampling
        top_k=64, # Top-k sampling
        use_cache=True, # Use cache for faster generation
        # Add stopping criteria if needed, e.g., stopping after <end_of_turn>
        # eos_token_id=tokenizer.eos_token_id,
    )
    # Add a separator after each answer
    print("n" + "="*40)

print("n--- Testing Complete ---")

See the model’s responses in the image below:

Text discussing customer reviews, highlighting common complaints about communication delays and product quality issues, positive feedback on HubSpot's user-friendly design and CRM capabilities, customer support perception as friendly and efficient, and themes regarding value for money in pricing strategies.

🔥 Great, it’s working fine!

A successful fine-tuning process means the model generates answers that are more analytical and directly derived from the review content it was fine-tuned on, reflecting the style and insights present in your custom dataset, rather than generic responses.

Step 8: Saving and Pushing Your Fine-Tuned Model

Finally, save your fine-tuned LoRA adapters and tokenizer. You can save them locally and also push them to the Hugging Face Hub for easy sharing, versioning, and deployment.

# Define local path and Hub repository ID
new_model_local = "gemma-3-4b-trustpilot-qa-adapter" # Local directory name
new_model_online = "YOUR_HF_USERNAME/gemma-3-4b-trustpilot-qa" # Hub repo name

# 1. Save locally
print(f"Saving model adapter and tokenizer locally to '{new_model_local}'...")
model.save_pretrained(new_model_local)
tokenizer.save_pretrained(new_model_local)
print("Saved locally.")

# 2. Push to Hugging Face Hub
print(f"Pushing model adapter and tokenizer to Hugging Face Hub '{new_model_online}'...")
model.push_to_hub(new_model_online, token=hf_token)
tokenizer.push_to_hub(new_model_online, token=hf_token)

The fine-tuned model is now available on Hugging Face Hub:

Screenshot of the Hugging Face model card for 'triposatt/gemma-3-4b-trustpilot-qa', displaying the model's details including developer, license, and information about the finetuning process with Unsloath and Hugging Face's TRL library.

Conclusion

This guide demonstrated an end-to-end approach to fine-tuning Google’s Gemma 3 for a practical use case: generating analytical answers from customer reviews. We covered the entire workflow—from collecting high-quality, domain-specific data via Bright Data’s web scraper API, structuring it into a QA format using LLM-powered processing, to fine-tuning the Gemma 3 4B model efficiently using the Unsloth library on resource-constrained hardware.

The result is a specialized LLM that is adept at extracting insights and interpreting sentiment from raw review data, transforming it into structured, actionable answers. This method is highly adaptable—you can apply this same workflow to fine-tune Gemma 3 (or other suitable LLMs) on various domain-specific datasets to create AI assistants tailored to different needs.

For further exploration into AI-driven data extraction strategies, consider these additional resources:

For more fine-tuning optimizations and examples using Unsloth, check out the Unsloth Notebooks Collection.

Start free trial

Start free with Google