Supervised Fine-Tuning for LLMs: Step-by-Step Python Guide

In this guide on supervised fine-tuning LLMs, you will learn:

What is supervised fine-tuning in the context of LLMs
What is the goal of this practice
The resources you need to implement it
The workflow for supervised fine-tuning LLMs
Technical challenges and considerations in SFT implementation
A step-by-step tutorial to replicate supervised fine-tuning on an LLM

Let’s dive in!

Defining Supervised Fine-Tuning (SFT) in the LLM Context

Supervised fine-tuning (SFT) is a form of transfer learning applied to LLMs. Transfer learning is a machine learning method where you apply the knowledge gained from solving one problem to help solve a different, but related problem.

In the context of LLMS, supervised fine-tuning involves taking a pre-trained model and further training it. You do so by adjusting its parameters, using a curated dataset with labeled examples relevant to a specific task.

The “supervised” nature is key in the process because engineers are behind the whole process:

They prepare the fine-tuning dataset with explicit input-output pairs. That dataset contains the “correct answers” that steer the model’s learning for specific prompts.
They curate the entire fine-tuning process from the setup to monitoring and validation.

In detail, compared to RAG (Retrieval-Augmented Generation), fine-tuning modifies the model itself. RAG leaves the model untouched, and it just feeds it with new relevant information. Instead, fine-tuning involves a new training process.

What Is the Goal of Supervised Fine-Tuning in LLMs?

The primary goal of supervised fine-tuning is to improve an LLM’s performance on specific tasks beyond its general pre-trained capabilities. This includes:

Task specialization: Make the model proficient in tasks like text summarization, code generation for a particular language, sentiment analysis, and more.
Domain adaptation: Infuse the model with knowledge and terminology specific to niche fields. Common examples are legal document analysis, medical report generation, and financial forecasting.
Tone alignment: Train the model to adopt a consistent brand voice, level of formality, or specific conversational persona required for applications like chatbots.

What You Need to Implement Fine-Tuning

At this point, you might be wondering: why not train an LLM to do so from the beginning? The reason is simple. Training an LLM involves:

A vast amount of data, often stored across several data centers.
A lot of hardware.
Investing a lot of money and time.

To give an idea, to train models like ChatGPT, Gemini, and similar, you need:

Several months—if not even years.
A lot of GPUs—each costing thousands of dollars—distributed over several data centers.

For fine-tuning, instead, you only need three things:

A pre-trained model.
A computer.
A small, labeled dataset.

It should now be clearer why fine-tuning is so convenient.

The Mechanics Behind The Supervised Fine-Tuning Workflow

Executing a supervised fine-tuning process involves several technical steps. Time to break the entire process down!

Step #1: Curate The High-Quality SFT Dataset

The efficacy of the process is highly dependent on the quality and relevance of the fine-tuning dataset. This involves:

Data sourcing: Obtain raw data relevant to the target task using one of the many data sourcing techniques. The size can range from hundreds to tens of thousands of rows. This depends on the task complexity and the pre-trained model.
Data structuring: Transform the raw data into a structured format suitable for the SFT process. One of the typical formats is JSON Lines (JSONL), where each line is a JSON object containing distinct fields for the input and the desired output.
Quality assurance: Ensure that the data is accurate, consistent, diverse, and free from biases that could negatively impact the model’s behavior.

Step #2: Select an Appropriate Pre-trained Model

There are several pre-trained models available. Key considerations to choose the best for your case include:

Model size: It is denoted by the number of parameters (e.g., 7B, 13B, 70B) and correlates with the model’s capacity to learn complex patterns and nuances. Larger models generally offer higher performance, but demand more hardware resources for fine-tuning.
Architecture: The specific architecture can be suited to certain tasks. For example:
- Decoder-only transformers (GPT series, Llama, Mistral, PaLM): They excel at generative tasks where the output is a continuation of the input. That includes text generation, summarization, instruction following, and dialogue systems. Their architecture is inherently suited for predicting the next token in a sequence.
- Encoder-decoder transformers (T5, BART, Flan-T5): They have distinct encoder and decoder components. These models are strong performers on tasks that involve transforming an input sequence into an output sequence, such as translation, text summarization, and question answering.
Base capabilities: Evaluate the pre-trained model’s performance on academic benchmarks relevant to your target task before committing to fine-tuning. This provides a baseline and indicates the model’s aptitude. For example:
- Models like GPT-4, Claude 3 Opus, or Gemini Advanced generally lead to complex reasoning.
- Open models like Llama 3 series have shown strong performance across a wide range of benchmarks. They compete closely with proprietary models on tasks like reasoning, coding, and general knowledge.
- Models specifically pre-trained for coding, such as CodeLlama or WizardCoder, exhibit superior performance on code generation tasks.
Licensing: The license under which a model is released dictates how it can be used, modified, and distributed. Some models have open-source permissions, while others do not.
Existing Fine-Tunes: Sometimes, starting from a model already fine-tuned for a related task can be more efficient than starting from a raw base model. This is a form of intermediate fine-tuning.

Step #3: Implement The Training Loop

The core of the process involves iterating through the labeled dataset and adjusting the model’s weights like so:

Forward pass: The model processes an input (the prompt) from the SFT dataset and generates an output.
Loss calculation: A loss function compares the model’s generated output, token-by-token, against the target output provided in the dataset. This is useful to quantify the error.
Backward pass (backpropagation): The calculated loss is used to compute gradients, indicating how much each model weight contributed to the error.
Weight update: An optimizer algorithm uses the gradients and a specified learning rate to adjust the model’s weights. The goal is to minimize the errors on subsequent iterations.
Hyperparameters tuning: Parameters controlling the training process are tuned to improve the performance of the model.

Step #4: Evaluate the Results

As the final step, you have to assess the fine-tuned model’s performance. This involves:

Validation set: A portion of the labeled dataset must be used to monitor progress and prevent overfitting.
Test set: You need a dataset with data that is new and unseen by the model, which can be a subset of the fine-tuning dataset. This step is executed for final evaluation after fine-tuning is complete.
Metrics: Define quantitative metrics relevant to the task to evaluate the performance.
Qualitative analysis: Review of model outputs to assess coherence, relevance, style adherence, and safety.

Technical Challenges and Considerations in SFT Implementation

Implementing SFT presents several technical challenges, like:

Dataset quality and scale: The most significant challenge lies in creating or sourcing a sufficiently large, high-quality, and representative labeled dataset. Poor data quality directly translates to poor model performance. Data extraction, cleaning, labeling, aggregation, and formatting require substantial effort and domain expertise.
Mitigating catastrophic forgetting: Intensively fine-tuning on a specific task can cause the model to “forget” some of its general capabilities learned during pre-training. Strategies like using lower learning rates, fine-tuning for fewer epochs, or incorporating diverse data can help mitigate this phenomenon.
Hyperparameter optimization strategies: Finding the optimal set of hyperparameters is an empirical process. That requires multiple experiments and careful monitoring of validation metrics. Automated hyperparameter tuning frameworks can assist, but add complexity.

Supervised Fine-Tuning LLMs: Step-by-Step Tutorial

Time to put the theory into practice. This tutorial section will guide you through fine-tuning a lightweight LLM so that you can use your computer without extra hardware. The goal is to fine-tune an LLM to generate e-commerce product descriptions, given a list of characteristics.

You will use DistilGPT2 as an LLM, which is a distilled version of GPT-2 with smaller size and efficiency.

Let’s fine-tune the chosen model!

Prerequisites

To replicate this tutorial on supervised fine-tuning LLMs, you must have at least Python 3.10+ installed on your machine.

You also need a CSV dataset for the fine-tuning. Here, we will use a custom dataset that contains data for e-commerce products. The dataset contains the following fields for each product:

Category: Electronics, books, kitchen, and similar.
Name: The name of the product.
Features: The main features of the product.
Color: The color of the product.
Description: Text that describes what the product is or does.

Below is an image that shows a sample of the data used:

A sample of the dataset used for fine-tuning

Step #1: Get Started

Suppose you call the main folder of your project fine_tuning/. At the end of this step, the folder will have the following structure:

fine_tuning/
    ├── data/
    |   └── data.csv
    ├── results_distilgpt2/
    ├── tuning.py
    └── venv/

Where:

data.csv contains the labeled data for fine-tuning the LLM presented earlier.
results_distilgpts/ is the folder that will contain the results. It will be automatically created during the process.
tuning.py is the Python file that contains all the logic.
venv/ contains the virtual environment.

You can create the venv/ virtual environment directory like so:

python -m venv venv

To activate it, on Windows, run:

venv\Scripts\activate

Equivalently, on macOS/Linux, execute:

source venv/bin/activate

In the activated virtual environment, install the needed libraries for this tutorial:

pip install transformers datasets torch trl

The libraries used in this project are:

transformers: Hugging Face’s library for state-of-the-art machine learning models.
datasets: Hugging Face’s library for accessing and processing datasets.
torch: PyTorch, an open-source machine learning framework.
trl: Hugging Face’s Transformer Reinforcement Learning library, which includes tools for SFT like SFTTrainer.

Perfect! Your Python environment for LLM fine-tuning is correctly set up.

Step #2: Initial Setup, Data Loading, and Text Formatting

As a first step, in tuning.py, you have to set the whole process up:

import os
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, pipeline
from trl import SFTTrainer

# Configure parameters
base_model_name = "distilgpt2"  # Pre-trained model to be fine-tuned
data_file_path = "data/data.csv" # Path to your CSV data file
output_model_dir = "./results_distilgpt2"  # Directory to save training results and checkpoints
final_model_path = os.path.join(output_model_dir, "final_model") # Path to save the final fine-tuned model
max_seq_length_for_tokenization = 512 # Maximum sequence length for tokenizing inputs

# Load dataset
raw_dataset = load_dataset("csv", data_files=data_file_path)

# Define a function to format entries into a prompt-completion pair
def format_dataset_entry(data_item):
    prompt_template = "Generate a product description for the following item:\nFeatures: {features_data}\n\nDescription:"
    data_item["text"] = prompt_template.format(features_data=data_item["features"]) + " " + data_item["description"]
    return data_item

# Apply the formatting function to the train split of the dataset
text_formatted_dataset = raw_dataset["train"].map(format_dataset_entry)

This snippet:

Defines the name of the LLM to use with base_model_name.
Defines the path where the CSV file is and opens it with the method load_dataset().
Creates a folder that will store results (results_distilgpt2/).
Creates the format_dataset_entry() function that transforms each row from raw_dataset into text format for fine-tuning. It also appends the modified content into the “description” column in the CSV, overriding the current one. This will provide the model with cleaned textual descriptions.
Applies the format_dataset_entry() function to every item in the training split dataset with the map() method.

Well done! You have concluded the initial setup of the process.

Step #3: Tokenize the Dataset

Language models do not understand raw text. They operate on numerical representations called tokens. This step involves loading the pre-trained tokenizer and using it to convert the formatted text entries into sequences of token IDs:

# Load the tokenizer associated with the base model
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
tokenizer.pad_token = tokenizer.eos_token # Padding token

# Define a function to tokenize the "text" field of the dataset
def tokenize_function(data_items_batch):
    return tokenizer(
        data_items_batch["text"],
        truncation=True,
        padding="max_length",
        max_length=max_seq_length_for_tokenization,
    )

# Apply the tokenization function to the formatted dataset
tokenized_dataset = text_formatted_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=text_formatted_dataset.column_names
)

The above code does the following:

Loads the tokenizer associated with the base_model_name with the method AutoTokenizer.from_pretrained().
Defines a padding token to make all input sequences the same length when feeding them into the model.
Tokenizes the dataset with the custom function tokenize_function() and applies the tokenization to the dataset.

Wonderful! The dataset is tokenized.

Step #4: Configure and Run The Fine-Tuning Process

The dataset is prepared and tokenized, so now you can move to the core of the fine-tuning task:

# Load the pre-trained language model
model = AutoModelForCausalLM.from_pretrained(base_model_name)

# Define training arguments for the SFTTrainer
training_args = TrainingArguments(
    output_dir=output_model_dir,
    num_train_epochs=3,
    per_device_train_batch_size=1,
    learning_rate=2e-5,
    logging_steps=10,
    save_steps=50,
    report_to="none",
)

# Initialize the SFTTrainer
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)

# Begin the training process
trainer.train()

# Save the fine-tuned model and tokenizer to the specified path
trainer.save_model(final_model_path)
tokenizer.save_pretrained(final_model_path)

This code:

Loads the base model with the method AutoModelForCausalLM.from_pretrained().
Defines the settings for the process, such as the number of epochs, batch size, and learning rate, using the method TrainingArguments().
Initializes and runs SFTTrainer(), providing the model with the tokenized dataset and training arguments. This manages the actual fine-tuning loop.
Trains the model and saves results in the dedicated folder.

Fantastic! You have started the fine-tuning process.

Step #5: Evaluate and Test the Fine-Tuned Model

You now have to evaluate the performance and see how well the model generates descriptions for new, unseen product features:

# Load the fine-tuned model and tokenizer for testing
fine_tuned_model_for_testing = AutoModelForCausalLM.from_pretrained(final_model_path)
fine_tuned_tokenizer_for_testing = AutoTokenizer.from_pretrained(final_model_path)

# Create a text-generation pipeline with the fine-tuned model
generator_fine_tuned = pipeline("text-generation", model=fine_tuned_model_for_testing, tokenizer=fine_tuned_tokenizer_for_testing)

# Define example product features for testing
test_product_features = [
    "Category: Kitchen, Name: Electric Kettle, Features: 1.7L capacity, Stainless steel, Auto shut-off, Boil-dry protection, Color: Silver",
    "Category: Office, Name: Ergonomic Office Chair, Features: Lumbar support, Adjustable height, Mesh back, Swivel, Color: Black"
]

# Define the prompt template
prompt_template_inference = "Generate a product description for the following item:\nFeatures: {features}\n\nDescription:"

# Generate descriptions for each test item
for features_item in test_product_features:
    full_prompt = prompt_template_inference.format(features=features_item)
    print(f"\nPROMPT:\n{full_prompt}")

    generated_outputs = generator_fine_tuned(
        full_prompt,
        max_new_tokens=70,
        num_return_sequences=1,
        pad_token_id=fine_tuned_tokenizer_for_testing.eos_token_id,
        eos_token_id=fine_tuned_tokenizer_for_testing.eos_token_id
    )
    print(f"GENERATED (Fine-tuned):\n{generated_outputs[0]['generated_text']}")

This snippet does the following:

Loads the fine-tuned model and tokenizer.
Creates a text-generation pipeline with the fine-tuned model using the method pipeline().
Defines a list that provides new, unseen product descriptions to feed and evaluate the model’s fine-tuning with test_product_features.
Generates the descriptions for each test item in the for loop.
Prints the generated descriptions by the fine-tuned model.

Cool! You have set up a pipeline for testing and evaluating the model’s performance.

Step #6: Put It All Together

Your tuning.py file should now contain:

import os
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, pipeline
from trl import SFTTrainer

# Configure parameters
base_model_name = "distilgpt2"  # Pre-trained model
data_file_path = "data/data.csv" # Path to data
output_model_dir = "./results_distilgpt2"  # Directory to save results and checkpoints
final_model_path = os.path.join(output_model_dir, "final_model") # Path to save the final fine-tuned model
max_seq_length_for_tokenization = 512 # Maximum length for tokenizing inputs

# Load dataset
raw_dataset = load_dataset("csv", data_files=data_file_path)

# Define a function to format entries into a prompt-completion pair
def format_dataset_entry(data_item):
    prompt_template = "Generate a product description for the following item:\nFeatures: {features_data}\n\nDescription:"
    data_item["text"] = prompt_template.format(features_data=data_item["features"]) + " " + data_item["description"]
    return data_item

# Apply the formatting function to the train split of the dataset
text_formatted_dataset = raw_dataset["train"].map(format_dataset_entry)

# Load the tokenizer associated with the base model
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
tokenizer.pad_token = tokenizer.eos_token

# Define a function to tokenize the "text" field of the dataset
def tokenize_function(data_items_batch):
    return tokenizer(
        data_items_batch["text"],
        truncation=True,
        padding="max_length",
        max_length=max_seq_length_for_tokenization,
    )

# Apply the tokenization function to the formatted dataset
tokenized_dataset = text_formatted_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=text_formatted_dataset.column_names
)

# Load the pre-trained language model
model = AutoModelForCausalLM.from_pretrained(base_model_name)

# Define training arguments for the SFTTrainer
training_args = TrainingArguments(
    output_dir=output_model_dir,
    num_train_epochs=3,
    per_device_train_batch_size=1,
    learning_rate=2e-5,
    logging_steps=10,
    save_steps=50,
    report_to="none",
)

# Initialize the SFTTrainer
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)

# Begin the training process
trainer.train()

# Save the fine-tuned model and tokenizer to the specified path
trainer.save_model(final_model_path)
tokenizer.save_pretrained(final_model_path)

# Load the fine-tuned model and tokenizer for testing
fine_tuned_model_for_testing = AutoModelForCausalLM.from_pretrained(final_model_path)
fine_tuned_tokenizer_for_testing = AutoTokenizer.from_pretrained(final_model_path)

# Create a text-generation pipeline with the fine-tuned model
generator_fine_tuned = pipeline("text-generation", model=fine_tuned_model_for_testing, tokenizer=fine_tuned_tokenizer_for_testing)

# Define example product features for testing
test_product_features = [
    "Category: Kitchen, Name: Electric Kettle, Features: 1.7L capacity, Stainless steel, Auto shut-off, Boil-dry protection, Color: Silver",
    "Category: Office, Name: Ergonomic Office Chair, Features: Lumbar support, Adjustable height, Mesh back, Swivel, Color: Black"
]

# Define the prompt template
prompt_template_inference = "Generate a product description for the following item:\nFeatures: {features}\n\nDescription:"

# Generate descriptions for each test item
for features_item in test_product_features:
    full_prompt = prompt_template_inference.format(features=features_item)
    print(f"\nPROMPT:\n{full_prompt}")

    generated_outputs = generator_fine_tuned(
        full_prompt,
        max_new_tokens=70,
        num_return_sequences=1,
        pad_token_id=fine_tuned_tokenizer_for_testing.eos_token_id,
        eos_token_id=fine_tuned_tokenizer_for_testing.eos_token_id
    )
    print(f"GENERATED (Fine-tuned):\n{generated_outputs[0]['generated_text']}")

Run your code with:

python tuning.py

The expected result will be like this:

As shown, the prompt asks the model to create the description for an electric kettle, providing the needed information. The model creates the description, as expected. If you break this result down, you can see:

Generated product name: It generates the product name “ProChef Luxury Living Hood Series X”, which sounds plausible for a kitchen item, and the description.
Contradictory detail: It concludes with “Available in a stunning Pure White finish,” which contradicts the “Color: Silver” provided in the prompt’s features. Thus, while fine-tuning guides the model, it does not guarantee perfect consistency. This happens especially with smaller models and limited training.

Imperfections and contradictions are typical of generative models, especially for smaller ones like distilgpt2. They can also depend on the size and quality of the fine-tuning dataset, and the number of training epochs. In this case, the dataset used has only 300 rows. A bigger dataset would have led to a better description of the kettle.

To show the loss in quality, below is the expected result with a CSV that only has 5 rows:

As you can see, the result is a total hallucination. To break it down:

Generated product name: ****It names the product “Office chair” instead of “Ergonomic Office chair” in the first phrase.
Contradictory detail: ****In the second phrase, the name becomes “Lumbar”. So, the model confuses the name with the feature (“Lumbar support”).
Inconsistent grammar: Both phrases have inconsistent and wrong grammar.
Missing features: The color and the features (“Lumbar support,” “Adjustable height,” “Swivel”) are not mentioned in the description.

However, if you were to provide the same prompt to the distilgpt2 model without the fine-tuning process, the output would be significantly worse. That occurs because this model was not trained on such specific data. For example, it could not be able to provide the description for the kettle.

Note that, when the process is completed, the code will automatically generate a folder called results_distilgpt2/. This will report the results. Inside it, you will find sub-folders that report the results saved by the model at different epochs:

File folder with results saved by the model

This is useful if you want to get any of these as checkpoints and use them as you wish.

Et voilà! LLM fine-tuning completed.

Conclusion

In this article, you learned what supervised fine-tuning is in the context of LLMs. You have also gone through a step-by-step process to replicate the fine-tuning process.

The core SFT relies on having high-quality datasets for fine-tuning your models. Luckily, Bright Data has you covered with numerous services for dataset acquisition or creation:

Scraping Browser: A Playwright, Selenium-, Puppeter-compatible browser with built-in unlocking capabilities.
Web Scraper APIs: Pre-configured APIs for extracting structured data from 100+ major domains.
Web Unlocker: An all-in-one API that handles site unlocking on sites with anti-bot protections.
SERP API: A specialized API that unlocks search engine results and extracts complete SERP data.
Foundation models: Access compliant, web-scale datasets to power pre-training, evaluation, and fine-tuning.
Data providers: Connect with trusted providers to source high-quality, AI-ready datasets at scale.
Data packages: Get curated, ready-to-use datasets—structured, enriched, and annotated.

Create a Bright Data account for free to test our data extraction services and explore our dataset marketplace!

Start free trial

Start free with Google

No credit card required

What Is Supervised Fine-Tuning in LLMs?