Fine-Tuning Llama 4 With Web Data: Step-by-Step Tutorial

In this guide on fine-tuning Llama 4 with web data, you will learn:

What fine-tuning is
How to retrieve the fine-tuning-ready datasets using some scraping APIs
How to set up the cloud infrastructure for the fine-tuning process
How to fine-tune Llama 4 with a step-by-step tutorial

Let’s dive in!

What Is Fine-tuning?

Fine-tuning—also known as supervised fine-tuning (SFT)—is a process used to improve specific knowledge or ability in a pre-trained LLM. In the context of LLMs, pre-training refers to training an AI model from scratch.

SFT is used because a model mimics its training data. However, as of today, LLMs are mainly generalistic models. This means that if you want a model to learn specific knowledge, you have to fine-tune it.

If you want to learn more about SFT, read our guide on supervised fine-tuning in LLMs.

Scraping the Data to Fine-Tune LLama 4

To fine-tune an LLM, you first need a fine-tuning dataset. This section walks you through how to retrieve data from a website using Bright Data’s Web Scraper APIs—dedicated endpoints for 100+ domains that scrape fresh data for you and retrieve it in the desired format.

The target webpage is going to be the Amazon best-sellers office products page:

The Amazon best-seller products in the category “office products”

Follow the steps below to retrieve the fine-tuning data!

Requirements

To use the code to retrieve the data from Amazon, you need:

Python 3.10+ installed on your machine.
A valid Bright Data Scraper API key.

Follow the Bright Data documentation to retrieve your API key.

Project Structure and Dependencies

Suppose you call the main folder of your project amazon_scraper/. At the end of this step, the folder will have the following structure:

amazon_scraper/
    ├── scraper.py
    └── venv/

Where:

scraper.py is the Python file that contains the coding logic.
venv/ contains the virtual environment.

You can create the venv/ virtual environment directory like so:

python -m venv venv

To activate it, on Windows, run:

venvScriptsactivate

Equivalently, on macOS and Linux, execute:

source venv/bin/activate

In the activated virtual environment, install the dependencies with:

pip install requests

Where requests is a library for making HTTP web requests.

Great! You are now ready to get the data of interest using the Scraper APIs by Bright Data.

Step #1: Define the Scraping Logic

The following snippet defines the whole scraping logic:

import requests
import json
import time


def trigger_amazon_products_scraping(api_key, urls):
    # Endpoint to trigger the Web Scraper API task
    url = "https://api.brightdata.com/datasets/v3/trigger"

    params = {
        "dataset_id": "gd_l7q7dkf244hwjntr0",
        "include_errors": "true",
        "type": "discover_new",
        "discover_by": "best_sellers_url",
    }

    # Convert the input data in the desired format to call the API
    data = [{"category_url": url} for url in urls]

    headers = {
      "Authorization": f"Bearer {api_key}",
      "Content-Type": "application/json",
    }

    response = requests.post(url, headers=headers, params=params, json=data)

    if response.status_code == 200:
        snapshot_id = response.json()["snapshot_id"]
        print(f"Request successful! Response: {snapshot_id}")
        return response.json()["snapshot_id"]
    else:
        print(f"Request failed! Error: {response.status_code}")
        print(response.text)

def poll_and_retrieve_snapshot(api_key, snapshot_id, output_file, polling_timeout=20):
    snapshot_url = f"https://api.brightdata.com/datasets/v3/snapshot/{snapshot_id}?format=json"
    headers = {
        "Authorization": f"Bearer {api_key}"
    }

    print(f"Polling snapshot for ID: {snapshot_id}...")

    while True:
        response = requests.get(snapshot_url, headers=headers)

        if response.status_code == 200:
            print("Snapshot is ready. Downloading...")
            snapshot_data = response.json()

            # Write the snapshot to an output json file
            with open(output_file, "w", encoding="utf-8") as file:
                json.dump(snapshot_data, file, indent=4)

            print(f"Snapshot saved to {output_file}")
            return
        elif response.status_code == 202:
            print(F"Snapshot is not ready yet. Retrying in {polling_timeout} seconds...")
            time.sleep(polling_timeout)
        else:
            print(f"Request failed! Error: {response.status_code}")
            print(response.text)
            break

if __name__ == "__main__":
    BRIGHT_DATA_API_KEY = "<your api key>" # Replace it with your Bright Data's Web Scraper API key or read it from the envs
    # URLs of best-selling products to retrieve data from
    urls = [
        "https://www.amazon.com/gp/bestsellers/office-products/ref=pd_zg_ts_office-products"
    ]
    snapshot_id = trigger_amazon_products_scraping(BRIGHT_DATA_API_KEY, urls)
    poll_and_retrieve_snapshot(BRIGHT_DATA_API_KEY, snapshot_id, "amazon-data.json")

This code:

Creates the trigger_amazon_products_scraping() function that initiates the web scraping task by:
- Defining the Scraper API endpoint to trigger.
- Setting up the parameters for the scraping activity.
- Formatting the input urls into a JSON structure that the API expects.
- Sending a POST request to the Bright Data Scraper API with the specified endpoint, headers, parameters, and data.
- Managing the response status.
Creates a poll_and_retrieve_snapshot() function that checks the status of the scraping task (identified by snapshot_id) and retrieves the data once it is ready.

Note that the scraping API was called using only one URL. Thus, the above code retrieves the data only from one target Amazon page. This is sufficient for the scope of this tutorial, but you can add as many Amazon URLs as you prefer in the list.

Consider that the more URLs you add, the more the dataset will increase in size. A bigger dataset—if curated well—means better fine-tuning. On the other hand, the bigger the dataset, the longer the computational time needed.

Perfect! Your scraping logic is well-defined, and you are now ready to run the script.

Step #2: Run the Script

To scrape the target webpage, run the script with:

python scraper.py

You will get a result as follows:

Request successful! Response: s_m9in0ojm4tu1v8h78
Polling snapshot for ID: s_m9in0ojm4tu1v8h78...
Snapshot is not ready yet. Retrying in 20 seconds...
# ...
Snapshot is not ready yet. Retrying in 20 seconds...
Snapshot is ready. Downloading...
Snapshot saved to amazon-data.json

At the end of the process, the project folder will contain:

amazon_scraper/
    ├── scraper.py
    ├── amazon-data.json # <-- Note the fine-tuning dataset
    └── venv/

The process has automatically created the amazon-data.json that contains the scraped data. Below is the expected structure of the JSON file:

[
    {
        "title": "Amazon Basics Multipurpose Copy Printer Paper, 8.5 x 11 inches, 20 lb, 1 Ream, 500 Sheets, 92 Bright, White",
        "seller_name": "Amazon.com",
        "brand": "Amazon Basics",
        "description": "Product Description Amazon Basics Multipurpose Copy Printer Paper, 8.5 x 11 Inch 20Lb Paper - 1 Ream (500 Sheets), 92 GE Bright White From the Manufacturer AmazonBasics",
        "initial_price": 6.65,
        "currency": "USD",
        "availability": "In Stock",
        "reviews_count": 190989,
        "categories": [
            "Office Products",
            "Office & School Supplies",
            "Paper",
            "Copy & Printing Paper",
            "Copy & Multipurpose Paper"
        ],
        ...
      // omitted for brevity...
}

Very well! You have successfully scraped data from Amazon and saved it into a JSON file. This JSON file is the fine-tuning dataset you will use later in the fine-tuning process.

Setting Up Hugging Face to Use Llama 4

The model you will use is Llama-4-Scout-17B-16E-Instruct from Hugging Face.

If you have never used Hugging Face before, when you click on the link for the first time you will be asked to create an account:

Logging in or signing up to Hugging Face

After creating the account, if you have never used any Llama 4 model, you need to compile the agreement form. Click on “Expand to review and access” to read and compile the form:

After filling in the form, your request will be reviewed:

Check the status of your request in the “Gated Repositories” section:

Checking the status in the Gated Repositories section

Once your request is accepted, you can create a new token. Go to the “Access Tokens” and create a token with write permissions. Then, copy and save it somewhere safe to use it later:

Setting the right permissions in the access token

Hooray! You have completed all the necessary steps to use a Llama 4 model with Hugging Face.

Setting Up the Cloud Infrastructure to Fine-tune Llama 4

The Llama 4 models are very big—and their name helps you understand how big they are. For example, Llama-4-Scout-17B-16E-Instruct means it has 17 billion parameters with 128 experts.

The fine-tuning process requires you to train the model using the fine-tuning dataset you retrieved before. Since the model has 17 billion parameters, you need a lot of hardware to do so. Specifically, you need more than one GPU. For this reason, you will use a cloud service to carry out the fine-tuning process.

For this tutorial, you will use RunPod as a cloud service. Go to “RunPod” and create an account. Then, go to the “Billing” menu and add $25 using the credit card:

Note: You will pay immediately 25$ and RunPod will add the equivalent of 25$ in credits to your account. You will consume credits hourly, depending on how many hours your pod will be up when deployed. So, deploy it only when you are sure you can use it. Otherwise, you will consume credits without actually using them. The actual hourly consumption depends on the type and number of GPUs you will choose in the next steps.

Navigate to the “Pods” menu to begin configuring your pod. The pod serves as a virtual server that provides you with the necessary CPUs, GPUs, memory, and storage for your tasks. Click the “Deploy” button:

You can choose among different configurations:

Choosing different types of GPUs in RunPod

Select the “H200 SXM GPU” option. Give the pod a name and select the number of GPUs. 3 GPUs are fine for this tutorial:

Select “Start a Jupyter Notebook” and click on “Deploy on Demand.” Now, go to the “Pods” section and edit your Pod:

Change the “Contained Disk” and “Volume Disk” values as below, then save:

Changing the settings of a Pod in RunPod

When the setup is complete, click on the “Connect” button:

Connect the Pod to a Jupyter Lab notebook

This allows you to connect the Pod to a Jupiter Lab notebook:

Select the Notebook with the “Python 3 (ipykernel)” card:

Very well! You now have the right infrastructure to train the Llama 4 model.

Fine-Tuning Llama 4 With the Scraped Data

Before starting fine-tuning your model, upload the amazon-data.json file to your Jupyter Lab notebook. To do so, click on the “Upload files” button:

Uploading files in Jupyter Lab notebooks

The objective of the fine-tuning for this tutorial is to train Llama 4 using the amazon-data.json dataset. This way you teach Llama 4 how to create descriptions for office objects given some characteristics like the name of the object and some features.

You are now ready to start training the model. Follow the steps below to fine-tune Llama 4 with fresh web data!

Step #1: Install the Libraries

In the first cell of your notebook, install the needed libraries:

%%capture
!pip install transformers==4.51.0
%pip install -U datasets
%pip install -U accelerate
%pip install -U peft
%pip install -U trl
%pip install -U bitsandbytes
%pip install huggingface_hub[hf_xet]

Those libraries are:

transformers: Provides thousands of pre-trained models.
datasets: Offers access to a vast collection of datasets and efficient data processing tools.
accelerate: Simplifies running PyTorch training scripts across various distributed configurations with minimal code changes.
peft: Enables fine-tuning large pre-trained models more efficiently by only updating a small subset of parameters.
trl: Designed for training transformer language models using reinforcement learning techniques.
scipy: A library for scientific and technical computing in Python.
huggingface_hub: Provides a Python interface to interact with the Hugging Face Hub. This allows you to download and upload models, datasets, and Spaces.
bitsandbytes: Offers easy-to-use 8-bit optimizers and quantization functions, reducing the memory footprint for training and inference of large deep learning models.

Perfect! You have installed the needed libraries for the fine-tuning process.

Step #2: Connect to Hugging Face

In the second cell of your notebook, write:

from huggingface_hub import notebook_login, login

# Interactive login
notebook_login()
print("Login cell executed. If successful, you can proceed.")

When you run it, it will display the following:

In the “Token” box, paste the token you have created on your Hugging Face account.

Awesome! You are now able to retrieve the Llama 4 model from Hugging Face.

Step #3: Load the Llama 4 Model

In the third cell of your notebook, write the following code:

import os
import torch
import json
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, Llama4ForConditionalGeneration, BitsAndBytesConfig
from trl import SFTTrainer

# Load model
base_model_name = "meta-llama/Llama-4-Scout-17B-16E-Instruct"

# Configuration for BitsAndBytes quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=False,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# Load the Llama4 model with specified configurations
model = Llama4ForConditionalGeneration.from_pre-trained(
    base_model_name,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    quantization_config=bnb_config,
    trust_remote_code=True,
)

# Disable caching for the model
model.config.use_cache = False
# Set pre-training tensor parallelism to 1
model.config.pre-training_tp = 1

# Path to fine-tuning JSON data file.
fine_tuning_data_file_path = "amazon-data.json"

# Path to results
output_model_dir = "results_llama_office_items_finetuned/"
final_model_adapter_path = os.path.join(output_model_dir, "final_adapter")
max_seq_length_for_tokenization = 1024

# Create output directory
os.makedirs(output_model_dir)

The above snippet:

Defines the name of the model to load with base_model_name.
Configure the model’s weights with bnb_config using the BitsAndBytesConfig() method.
Loads the model with the method from_pre-trained() for training it.
Loads the fine-tuning dataset with fine_tuning_data_file_path.
Defines the output directory path for the results and creates it with the method makedirs().

When the cell finishes running, you should see a result like this:

Fantastic! Your Llama 4 model is set up and loaded into the notebook.

Step #4: Prepare the Fine-Tuning Dataset for the Training Process

Write the following code in the fourth cell of your notebook to prepare the fine-tuning dataset for the training process:

from datasets import Dataset

# Open fine-tuning dataset
with open(fine_tuning_data_file_path, "r") as f:
  data_list = json.load(f)

# Convert the list of data items into a Hugging Face Dataset object
raw_fine_tuning_dataset = Dataset.from_list(data_list)
print(f"Converted JSON data to Hugging Face Dataset. Num examples: {len(raw_fine_tuning_dataset)}")

def format_fine_tuning_entry(data_item):
    system_message = "You are an expert copywriter. Generate a concise and appealing product description based on the provided details."
    # ADJUST THE FOLLOWING LINES to your fine-tuning file
    item_title = data_item.get("title")
    item_brand = data_item.get("brand")
    item_category = data_item.get("categories")
    item_name = data_item.get("name")
    item_features_list = data_item.get("features")
    item_features_str = ", ".join(item_features_list) if isinstance(item_features_list, list) else str(item_features_list)
    target_description = data_item.get("description")

    # Training prompt
    user_prompt = (
        f"Generate a product description for the following item:n"
        f"Title: {item_title}nBrand: {item_brand}nCategory: {item_category}n"
        f"Name: {item_name}nFeatures: {item_features_str}nDescription:"
    )
    # Llama chat format
    formatted_string = (
        f"<|start_header_id|>system<|end_header_id|>nn{system_message}<|eot_id|>"
        f"<|start_header_id|>user<|end_header_id|>nn{user_prompt}<|eot_id|>"
        f"<|start_header_id|>assistant<|end_header_id|>nn{target_description}<|eot_id|>"
    )
    return {"text": formatted_string}

# Apply the formatting function to each entry in the raw dataset to structure it for fine-tuning
text_formatted_dataset = raw_fine_tuning_dataset.map(format_fine_tuning_entry)

# Tokenizer Setup
tokenizer = AutoTokenizer.from_pre-trained(base_model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

# Pre-tokenize the dataset
def tokenize_function_for_sft(examples):
    # Tokenize the "text" field which contains the full chat-formatted string
    tokenized_output = tokenizer(
        examples["text"],
        truncation=True,
        padding="max_length",
        max_length=max_seq_length_for_tokenization,
    )
    return tokenized_output

# Apply the tokenization function to the formatted dataset
tokenized_train_dataset = text_formatted_dataset.map(
    tokenize_function_for_sft,
    batched=True,
    remove_columns=["text"]
  )

This cell of the notebook:

Opens the fine-tuning dataset and converts it into a Hugging Face Dataset object using the method Dataset.from_list().
Defines a format_fine_tuning_entry() function. Its purpose is to take a single data item (a product’s details) and transform it into a structured text format suitable for instruction fine-tuning a chat model like Llama. Note that this must be tailored to the structure of your fine-tuning dataset.
Tokenizes the dataset and applies the tokenization with the method map(). This is done because Language models do not understand raw text. They operate on numerical representations called tokens.

When the cell finishes running, the expected result is as follows:

Note that the value of “Num examples” depends on your fine-tuning dataset.

Incredible! Your fine-tuning dataset is ready for the fine-tuning process.

Step #5: Configure the Environment and Parameters for Parameter-Efficient Fine-Tuning (PEFT)

In a new cell of your notebook, write the following code for setting the environment and parameters for PEFT:

from transformers import BitsAndBytesConfig
from peft import LoraConfig

# QLoRA configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# LoRA configuration
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
)

This code:

Defines the QLoRA configuration for quantization with the method BitsAndBytesConfig() to specify how a pre-trained language model should be quantized when loaded. Quantization is a technique to reduce computational and memory costs.
Defines LoRA configuration to set up the model for parameter-efficient fine-tuning with the method LoraConfig().

Very well! The environment is ready for efficient fine-tuning.

Step #6: Initialize the Training Process

In a new cell, write the following code to initialize the training process:

from peft import get_peft_model, prepare_model_for_kbit_training
from transformers import TrainingArguments

# Prepare model for k-bit training
model = prepare_model_for_kbit_training(
    model,
    gradient_checkpointing_kwargs={"use_reentrant": False}
)
# Apply the PEFT (LoRA) configuration to the model.
model = get_peft_model(model, lora_config)
# Disable caching in the model's configuration.
model.config.use_cache = False
# Print the number of trainable parameters in the model.
model.print_trainable_parameters()

# Define Training Arguments
training_args = TrainingArguments(
    output_dir=output_model_dir,
    num_train_epochs=3,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    logging_steps=25,
    save_steps=50,
    fp16=True,
    optim="paged_adamw_8bit",
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    report_to="none",
    max_grad_norm=0.3,
    save_total_limit=2,
)

# Initialize SFTTrainer
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    peft_config=lora_config,
)

The code in this cell:

The prepare_model_for_kbit_training() method readies the pre-loaded model for training with quantization.
The get_peft_model() method takes the quantized and prepared base model and applies the lora_config.
Defines the training arguments by calling the TrainingArguments() class.
Initializes the trainer with SFTTrainer().

Below is the expected result:

Step #7: Train the Model

The process is finally ready to train the Llama 4 model using the train() method:

# Train the model
trainer.train()

# Save Fine-tuned Model
trainer.save_model(final_model_adapter_path) # Saves the LoRA adapter
tokenizer.save_pre-trained(final_model_adapter_path) # Save tokenizer with the adapter

The result is as follows:

Training the model with the fine-tuning data

Note that you may obtain different numbers due to the stochastic nature of AI.

Step #8: Prepare the Model for Inference

To prepare the model for inference, write the following code in a new cell:

# Load the model with quantization for inference
base_model_for_inference = AutoModelForCausalLM.from_pre-trained(
  base_model_name,
  quantization_config=bnb_config,
  device_map="auto",
  trust_remote_code=True
)
# Load the fine-tuned LoRA adapter and attach it to the model
fine_tuned_model_for_testing = PeftModel.from_pre-trained(
    base_model_for_inference,
    final_model_adapter_path
)

# Merge LoRA adapter into the base model
fine_tuned_model_for_testing = fine_tuned_model_for_testing.merge_and_unload()

# Load the tokenizer
fine_tuned_tokenizer_for_testing = AutoTokenizer.from_pre-trained(
    final_model_adapter_path,
    trust_remote_code=True
)
# Configure the tokenizer for inference
fine_tuned_tokenizer_for_testing.pad_token = fine_tuned_tokenizer_for_testing.eos_token
fine_tuned_tokenizer_for_testing.padding_side = "left"
# Set the fine-tuned model to evaluation mode
fine_tuned_model_for_testing.eval()

The code in this cell:

Loads the model with the method from_pre-trained() for inferencing it.
Loads, applies, and merges LoRA adapter to the base model for inference.
Loads the fine-tuned tokenizer and configures it for inference.
Sets the model to evaluation mode with the method eval(). This disables training-specific behaviors, ensuring consistent and deterministic outputs during inference.

Here we go! Everything is set up for inference.

Step #9: Inference the Model

In this last step, you will perform the inference. Previously, you have trained Llama 4 on Amazon-scraped products. Now, given some data that include the name and the features of office-like items, you want to see if the model is able to generate its description.

The following code allows you to manage the inference process:

# Define a list of synthetic product data items for testing the fine-tuned model
synthetic_test_items = [
  {
    "title": "Executive Ergonomic Office Chair", "brand": "ComfortLuxe", "category": "Office Chairs", "name": "ErgoPro-EL100",
    "features": ["High-back design", "Adjustable lumbar support", "Breathable mesh fabric", "Synchronized tilt mechanism", "Padded armrests", "Heavy-duty nylon base"]
  },
  {
    "title": "Adjustable Standing Desk Converter", "brand": "FlexiDesk", "category": "Desks & Workstations", "name": "HeightRise-FD20",
    "features": ["Spacious dual-tier surface", "Smooth gas spring lift", "Adjustable height range 6-17 inches", "Supports up to 35 lbs", "Keyboard tray included", "Non-slip rubber feet"]
  },
  {
    "title": "Wireless Keyboard and Mouse Combo", "brand": "TechGear", "category": "Computer Peripherals", "name": "SilentType-KM850",
    "features": ["Full-size keyboard with numeric keypad", "Quiet-click keys", "Ergonomic mouse with adjustable DPI", "2.4GHz wireless connectivity", "Long battery life", "Plug-and-play USB receiver"]
  },
  {
    "title": "Desktop Organizer with Drawers", "brand": "NeatOffice", "category": "Desk Accessories", "name": "SpaceSaver-DO3",
    "features": ["Multi-compartment design", "Two pull-out drawers", "Durable wooden construction", "Compact footprint", "Ideal for pens, notes, and small supplies"]
  },
  {
    "title": "LED Desk Lamp with USB Charging Port", "brand": "BrightSpark", "category": "Office Lighting", "name": "LumiCharge-LS50",
    "features": ["Adjustable brightness levels (5)", "Color temperature modes (3)", "Flexible gooseneck design", "Built-in USB charging port", "Eye-caring, flicker-free light", "Energy-efficient LED"]
  },
]

# System message and prompt structure for inference
system_message_inference = "You are an expert copywriter. Generate a concise and appealing product description based on the provided details."

print("n--- Generating Descriptions with Fine-Tuned Model using Synthetic Test Data ---")

# Iterate through each item in the synthetic_test_items list
for item_data in synthetic_test_items:
    # Construct the user prompt part based on the synthetic item's structure
    user_prompt_inference = (
        f"Generate a product description for the following office item:n"
        f"Title: {item_data["title"]}n"
        f"Brand: {item_data["brand"]}n"
        f"Category: {item_data["category"]}n"
        f"Name: {item_data["name"]}n"
        f"Features: {", ".join(item_data["features"])}n"
        f"Description:" # The model will generate text after this.
    )

    full_prompt_for_inference = (
        f"<|start_header_id|>system<|end_header_id|>nn{system_message_inference}<|eot_id|>"
        f"<|start_header_id|>user<|end_header_id|>nn{user_prompt_inference}<|eot_id|>"
        f"<|start_header_id|>assistant<|end_header_id|>nn"
    )

    print(f"nPROMPT for item: {item_data["name"]}")

    # Tokenize the full prompt string using the fine-tuned tokenizer.
    inputs = fine_tuned_tokenizer_for_testing(
        full_prompt_for_inference,
        return_tensors="pt",
        padding=False,
        truncation=True,
        max_length=max_seq_length_for_tokenization - 150
    ).to(fine_tuned_model_for_testing.device)

    # Perform inference
    with torch.no_grad():
        outputs = fine_tuned_model_for_testing.generate(
            **inputs,
            max_new_tokens=150,
            num_return_sequences=1,
            do_sample=True,
            temperature=0.6,
            top_k=50,
            top_p=0.9,
            pad_token_id=fine_tuned_tokenizer_for_testing.eos_token_id,
            eos_token_id=[
                fine_tuned_tokenizer_for_testing.eos_token_id,
                fine_tuned_tokenizer_for_testing.convert_tokens_to_ids("<|eot_id|>")
            ]
        )

    # Decode the generated token IDs back into a human-readable text string
    generated_text_full = fine_tuned_tokenizer_for_testing.decode(outputs[0], skip_special_tokens=False)
    # Define the marker that indicates the beginning of the assistant's response in the Llama chat format.
    assistant_marker = "<|start_header_id|>assistant<|end_header_id|>nn"
    # Find the last occurrence of the assistant marker in the generated text
    assistant_response_start_index = generated_text_full.rfind(assistant_marker)

    # Extract the actual generated description from the full model output
    if assistant_response_start_index != -1:
        # If the assistant marker is found, extract the text that comes after it
        generated_description = generated_text_full[assistant_response_start_index + len(assistant_marker):]
        # Define the end-of-turn token for Llama
        eot_token = "<|eot_id|>"
        # Check if the extracted description ends with the Llama end-of-turn token and remove it.
        if generated_description.endswith(eot_token):
            generated_description = generated_description[:-len(eot_token)]
        # Also check if it ends with the tokenizer's standard end-of-sequence token and remove it.
        if generated_description.endswith(fine_tuned_tokenizer_for_testing.eos_token):
             generated_description = generated_description[:-len(fine_tuned_tokenizer_for_testing.eos_token)]
        # Remove any leading or trailing whitespace from the cleaned description
        generated_description = generated_description.strip()
    else:
        # Fallback: If the assistant marker is not found, try to extract the generated part by assuming it's everything after the original input prompt.
        input_prompt_decoded_len = len(fine_tuned_tokenizer_for_testing.decode(inputs["input_ids"][0], skip_special_tokens=False))
        # Decode the input prompt tokens to get its length as a string.
        generated_description = generated_text_full[input_prompt_decoded_len:].strip()
        # Clean up any trailing Llama end-of-turn token from this fallback extraction.
        if generated_description.endswith("<|eot_id|>"):
            generated_description = generated_description[:-len("<|eot_id|>")]
        generated_description = generated_description.strip()

    # Print the extracted and cleaned generated description
    print(f"GENERATED (Fine-tuned):n{generated_description}")
    # Print a separator line for better readability between items.
    print("-" * 50)

This last Jupyter Notebook cell manages the inference process. That process is useful to see how good the training was during the fine-tuning process.

In particular, the above code:

Defines testing data as a list called synthetic_test_items. Each element in this list is a dictionary representing a product, containing details like its title, brand, category, name, and a list of features. This data serves as the input for the model and its structure must match the one of the fine-tuning dataset.
Sets up their rnference prompt structure with system_message_inference. This must match the prompt used during the training process.
The for item_data in synthetic_test_items loop creates a user prompt for each item_data. The structure of each item_data must match the one used in the training process.
Tokenizes and controls how the model produces the output text. The actual inference is made under the with statement. Particularly, thanks to the method generate() that is the core inference step.
Decodes back the raw output from the model (which is a sequence of token IDs) into a human-readable string (generated_text_full) using the tokenizer.
Uses an if-else block to clean up the raw output from the language model to extract only the assistant’s generated product description. The raw output (generated_text_full) typically includes the entire input prompt followed by the model’s response, all formatted with Llama’s special chat tokens.
Prints the results.

You can expect the result as follows:

--- Generating Descriptions with Fine-Tuned Model using Synthetic Test Data ---

PROMPT for item: ErgoPro-EL100
GENERATED (Fine-tuned):
**Introducing the ErgoPro-EL100: The Ultimate Executive Ergonomic Office Chair**

Experience the pinnacle of comfort and support with the ComfortLuxe ErgoPro-EL100, designed to elevate your work experience. This premium office chair boasts a high-back design that cradles your upper body, providing unparalleled lumbar support and promoting a healthy posture.

The breathable mesh fabric ensures a cool and comfortable seating experience, while the synchronized tilt mechanism allows for seamless adjustments to your preferred working position. The padded armrests provide additional support and comfort, reducing strain on your shoulders and wrists.

Built to last, the ErgoPro-EL100 features a heavy-duty nylon base that ensures stability and durability. Whether you're working long hours or simply
--------------------------------------------------

PROMPT for item: HeightRise-FD20
GENERATED (Fine-tuned):
**Elevate Your Productivity with FlexiDesk's HeightRise-FD20 Adjustable Standing Desk Converter**

Take your work to new heights with FlexiDesk's HeightRise-FD20, the ultimate adjustable standing desk converter. Designed to revolutionize your workspace, this innovative converter transforms any desk into a comfortable and ergonomic standing station.

**Experience the Benefits of Standing**

The HeightRise-FD20 features a spacious dual-tier surface, perfect for holding your laptop, monitor, and other essential work tools. The smooth gas spring lift allows for effortless height adjustments, ranging from 6 to 17 inches, ensuring a comfortable standing position that suits your needs.

**Durable and Reliable**

With a sturdy construction and non-slip rubber feet
--------------------------------------------------

Et voilà! You have fine-tuned Llama 4 with a fresh dataset retrieved using the Bright Data Scraper APIs.

Conclusion

In this article, you learned how to fine-tune Llama 4 with a dataset scraped from Amazon using Bright Data Scraper APIs. You have gone through all the process that consists in:

Retrieving the data from the web.
Setting up a Hugging Face account with a token.
Setting up the necessary cloud infrastructure.
Training and testing (inferencing) Llama 4.

The core of the fine-tuning process relies on having high-quality datasets. Luckily, Bright Data has you covered with numerous AI-ready services for dataset acquisition or creation:

Scraping Browser: A Playwright, Selenium-, Puppeter-compatible browser with built-in unlocking capabilities.
Web Scraper APIs: Pre-configured APIs for extracting structured data from 100+ major domains.
Web Unlocker: An all-in-one API that handles site unlocking on sites with anti-bot protections.
SERP API: A specialized API that unlocks search engine results and extracts complete SERP data.
Foundation models: Access compliant, web-scale datasets to power pre-training, evaluation, and fine-tuning.
Data providers: Connect with trusted providers to source high-quality, AI-ready datasets at scale.
Data packages: Get curated, ready-to-use datasets—structured, enriched, and annotated.

Create a Bright Data account for free to test our AI-ready data infrastructure!

Start free trial

Start free with Google

Federico Trotta

Technical Writer

3 years experience

Federico Trotta is a technical writer, editor, and data scientist. Expert in technical content management, data analysis, machine learning, and Python development.

Expertise

Data Analysis AI Web Scraping

View all articles

Fine-Tuning Llama 4 with Fresh Web Data for Better Results