Blog / AI
AI

How To Fine-Tune GPT-4o With a Web Scraper API Using n8n

Discover how to fine-tune GPT-4o with n8n and a web scraper API for optimized, data-driven LLM results.
17 min read
Fine-tune GPT-4o with Web Scraper using N8N

In this guide, you will see:

  • What fine-tuning is.
  • How to fine-tune GPT-4o with a web scraper API through n8n.
  • A comparison between fine-tuning approaches.
  • Why high-quality data is the heart of any fine-tuning process.

Let’s dive in!

What Is Fine-tuning?

Fine-tuning—also known as supervised fine-tuning (SFT)—is a procedure to improve specific knowledge or ability in a pre-trained LLM. In the context of LLMs, pre-training refers to training an AI model from scratch.

Fine-tuning is important because models mimic training data. This means that when you test an LLM after training, its output will somehow follow the training data. Since LLMs are generalistic models, if you want them to gain specific knowledge, you have to fine-tune them to specific data.

If you want to learn more about SFT, read our guide on supervised fine-tuning in LLMs.

How to Fine-Tune GPT-4o With the Bright Data n8n Integration

As we covered in a recent tutorial, you now know how to fine-tune Llama 4 using the cloud with data scraped using Web Scraper APIs. In this guided section, you will achieve the same result by fine-tuning GPT-4o using n8n—a popular workflow automation platform.

In detail, we will refer to the same target web page, which is the Amazon best-sellers office products page:

The Amazon best-seller products in the category “office products”

The goal of this project is to fine-tune GPT-4o-mini to create office-like product descriptions given some characteristics as input in a prompt.

Follow the steps below to learn how to fine-tune GPT-4o-mini using n8n with a training dataset scraped via Bright Data’s solutions!

Requirements

To reproduce this fine-tuning process, you need the following:

Great! You are ready to start fine-tuning GPT-4o.

Step #1: Create a New n8n Workflow and Install the Bright Data Node

After logging in to n8n, the dashboard looks like the following image:

The n8n dashboard

To create a new workflow, click on the “Create Workflow” button. Then, click on “Open nodes panel”:

n8n’s open nodes panel

In the nodes panel, search for Bright Data’s node. In n8n, a “node” is a building block of an automated workflow, representing a distinct step or action in the data processing pipeline.

Click on the Bright Data n8n node to install it:

Bright Data’s node in n8n’s open nodes paneך

For more information, refer to the official documentation page on how to set up Bright Data in n8n.

Very well! You initialized your first n8n workflow.

Step #2: Set the Bright Data Node Up and Scrape the Data

Click on “Add first step” in the UI, and select “Trigger manually”:

The node to manually trigger the workflow

This node allows you to manually trigger the whole workflow.

Click on the “+” on the right of the manual trigger node and search for Bright Data. From the “web scraper actions” section, click on “scrape data synchronously by URL”:

Selecting the Bright Data’s node in the n8n workflow

Below is how the node settings appear when you click on it:

Bright Data’s node settings

Set them up as follows:

  • “Credential to connect with”: Click on it and add your Bright Data API token. The credentials will be saved.
  • “Operation”: Select the “Scrape by URL” option. This allows you to pass a list of URLs that the Web Scraper API will use as target pages to extract the data.
  • “Dataset”: Choose the “Amazon best seller products” option. That is optimized to extract the data from Amazon’s best-selling products.
  • “URLs”: Go to the Amazon best-sellers office products page to copy and paste a list of at least 10 URLs. You need at least 10 URLs because the OpenAI Chat node needs at least 10. If you pass less than 10, the OpenAI node will return an error while fine-tuning the target LLM.
  • “Format”: Select the “JSON” data format, as Web Scraper API supports several output formats.

Below is how your workflow looks until now:

The n8n workflow so far

If you press the “Execute workflow” button, the scraped data will be available inside Bright Data’s node in the output section:

The scraped data in the JSON format

Fantastic! You scraped the targeted data you needed with Bright Data’s Web Scraper API without even writing a line of code.

Step #3: Set the Code Node Up

Connect the Code node of the Bright Data node and select JavaScript in the “Language” box:

The Code node

In the “JavaScript” field, paste the following code:

// get all incoming items
const allInputItems = $input.all();

let jsonlString = "";
// define the training prompt
const systemMessage = "You are an expert marketing assistant specializing in writing compelling and informative product descriptions.";

// loop through each item retrieved from the input
for (const item of allInputItems) {
  const product = item.json;

  // validate if the product data exists and is an object
  if (!product || typeof product !== 'object') {
    console.warn('Skipping an item because product data is missing or not an object:', item);
    continue;
  }

  // extract product data
  const title = product.title || "N/A";
  const brand = product.brand || "N/A";
  let featuresString = "Not specified";
  if (product.features && Array.isArray(product.features) && product.features.length > 0) {
    featuresString = product.features.slice(0, 5).join(', ');
  }
  // create a snippet of the original product description for training
  const originalDescSnippet = (product.description || "No original description available.").substring(0, 250) + "...";
  // create prompt with specific details about the product
  const userPrompt = `Generate a product description for the following item. Title: ${title}. Brand: ${brand}. Key Features: ${featuresString}. Original Description Snippet: ${originalDescSnippet}.`;

  // create template for the kind of description the AI should generate
  let idealDescription = `Discover the ${title} from ${brand}, a top-choice for discerning customers. `;
  idealDescription += `Key highlights include: ${featuresString}. `;
  if (product.rating) {
    idealDescription += `Boasting an impressive customer rating of ${product.rating} out of 5 stars! `;
  }
  idealDescription += `This product, originally described as "${originalDescSnippet}", is perfect for anyone seeking quality and reliability. `;
  idealDescription += `Don't miss out on the ${product.availability === "In Stock" ? "readily available" : "upcoming"} ${title} – enhance your collection today!`;

  // create a training example object in the format expected by OpenAI
  const trainingExample = {
    messages: [
      { role: "system", content: systemMessage },
      { role: "user", content: userPrompt },
      { role: "assistant", content: idealDescription }
    ]
  };
  jsonlString += JSON.stringify(trainingExample) + "\n";
}

// remove any leading or trailing whitespace
const fileContentString = jsonlString.trim();

// check if any product data was actually processed
if (fileContentString.length === 0) {
  console.warn("No product data was processed, outputting empty file content.");
  return [{
    json: { error: "No products processed", fileNameToUse: "data.jsonl" },
    binary: {}
  }];
}

// convert the final JSONL string into a Buffer (raw binary data)
const buffer = Buffer.from(fileContentString, 'utf-8');
// define the filename that will be used when this data is sent to OpenAI
const actualFileNameForOpenAI = "data.jsonl";
// define the MIME type for the file
const mimeType = 'application/jsonl';

// prepare the binary data for output
const binaryData = await this.helpers.prepareBinaryData(buffer, actualFileNameForOpenAI, mimeType);

// return the processed data
return [{
  json: {
    processedFileName: actualFileNameForOpenAI
  },
  binary: {
    // the "Input Data Field Name" in the OpenAI node
    "data.jsonl": binaryData
  }
}];

The input of this node is the JSON file containing the scraped data from Bright Data. However, the OpenAI node needs a JSONL file. This JavaScript code transforms the JSON into a JSONL as follows:

  • It retrieves all the data coming from the previous node with the method $input.all().
  • It iterates and processes products. In particular, for each product item, it:
    • Extracts product details such as title, brand, features, description, rating, and availability. It includes fallback values if certain data is missing.
    • Constructs a userPrompt by formatting these details into a request for the LLM to generate the product description.
    • Generates an idealDescription using a template that incorporates the product’s attributes. This serves as the desired “assistant” response in the training data.
    • Combines a system message, the userPrompt, and the idealDescription into a single trainingExample object, formatted for conversational LLM training.
    • Serializes this trainingExample into a JSON string and appends it to a growing string, with each JSON object on a new line (JSONL format).
  • After processing all items, it converts the accumulated JSONL string into a Buffer of binary data.
  • It returns the file named data.jsonl.

If you click on “Execute step” in the Code node, the JSONL will be available in the output section:

The data.jsonl file into the Code node

Below is how your workflow looks until now:

The n8n workflow until now

The green lines and ticks demonstrate that every step was successfully completed.

Hooray! You retrieved the data using Bright Data and saved it in the JSONL format. You are now ready to push it into the LLM.

Step #4: Push the Fine-tuning Data Into the OpenAI Chat Node

The fine-tuning JSONL file is ready to be uploaded to the OpenAI platform for fine-tuning. To do so, add an OpenAI node. Choose the “Upload a file” in the “File actions” section:

Adding an OpenAI node to the workflow

Below are the settings you need to configure:

The OpenAI node settings

The above node gives the input for the fine-tuning process. Set the parameters as follows:

  • “Credential to connect with”: Add your OpenAI API token. Once you set it, the credentials will be saved.
  • “Resource”: Choose “File”. This is because you will upload a JSONL file to the platform.
  • “Operation”: Select “Upload a File”.
  • “Input Data Field Name”: The name of the fine-tuning file is data.jsonl.
  • In the “Options” section, add “Purpose” and choose “Fine-tune.”

After executing the step, the output will look as follows:

The output of the OpenAI upload file node

Now, your workflow will look like this:

The workflow until now

Amazing! You prepared everything for the fine-tuning process. Time to go through the actual process.

Step #5: Fine-tune the LLM

For performing the actual fine-tuning, connect an HTTP Request node to the OpenAI one:

The HTTP Request node settings

The settings must be as follows:

  • The “Method” must be “POST” as you are uploading the training data file.
  • The “URL” field must be https://api.openai.com/v1/fine_tuning/jobs endpoint. This is the standard URL for fine-tuning jobs on the OpenAI platform.
  • For the “Authentication” field, choose “Predefined Credential Type” so that it will use your OpenAI API token.
  • For the “Credential Type,” select “OpenAi” so that the node will connect to OpenAI.
  • For the “OpenAI” box, choose your OpenAI account name.
  • The “Send Body” toggle must be enabled. Select “JSON” and “Using JSON” respectively for the fields “Body Content Type” and “Specify Body.”

The JSON field must contain the following:

{
  "training_file": "{{ $json.id }}",
  "model": "gpt-4o-mini-2024-07-18"
}

This JSON:

  • Specifies the name of the training data with $json.id.
  • Defines the model to use for the fine-tuning. In this case, you will use GPT-4o-mini as per the version released on 2024-07-18.

Below is the output you will receive:

The output of the HTTP request node

When the HTTP Request node is triggered, the fine-tuning process starts. You can see its advancements in the fine-tuning section of the OpenAI platform. When the fine-tuning process is successfully completed, OpenAI provides you with the fine-tuned model you will use in Step #7:

The successfully completed fine-tuning process in the OpenAI platform

The n8n workflow should now look as follows:

The workflow until now

Congratulation! You trained your first GPT model using data retrieved with Bright Data’s Scraper API through n8n.

This is the final node of the first half of the entire workflow.

Step #6: Add the Chat Node

The second half of the entire workflow must start with a Chat Trigger node. There, you will insert the prompt to test the fine-tuned LLM:

The Chat Trigger node

Below is the prompt you can insert in the chat:

You are an expert marketing assistant specializing in writing compelling and informative product descriptions. Generate a product description for the following office item:

Title: ErgoComfort Pro Executive Chair.

Brand: OfficeSolutions.

Key Features: Adjustable lumbar support, Breathable mesh back, Memory foam seat cushion, 360-degree swivel, Smooth-rolling casters.

As you can see, this prompt:

  • Reports the same phrase on being an expert marketing assistant used in the training phase.
  • Asks to generate a product description given the information on the needed office item defined by:
    • The title.
    • The brand.
    • Key features of the office product.

It is important that the structure of the prompt is like that. This is because the model, in this phase, mimics the training data. So you have to give it a prompt and data that are similar to the ones you used in the training phase. Then, the fine-tuned LLM will write the product’s description based on those factors.

You can insert the prompt in the chat section at the bottom of the UI:

The prompt in the chat

This is your current n8n workflow:

The two branches of the workflow so far

Terrific! You defined the prompt to test the fine-tuned model.

Step #7: Add the AI Agent and OpenAI Chat Nodes

You now have to connect an AI Agent node to Chat Trigger:

The AI Agent node settings

The settings must be:

  • “Agent”: Choose “Conversational agent.” This allows you to modify anything you want using the Chat Trigger node, as you do with any other conversational agent.
  • Set the “Source of Prompt (User Message)” as “Connected Chat Trigger Node” so that it can ingest the prompt directly from the chat.

Connect an OpenAI Chat Model node to the AI Agent one through its “Chat Model” connection option:

The OpenAI Chat Model and AI Agent nodes connected

The image below shows the settings of the OpenAI Chat Model node:

The settings of the OpenAI Chat Model node

Configure the node as follows:

Return to the AI Agent node, and click on the “Execute step” button. You will see the resulting description of the product:

The resulting description of the office item

Below is the resulting description in plain text:

Introducing the remarkable ErgoComfort Pro Executive Chair by OfficeSolutions, a standout solution designed to meet your office needs. This chair shines with key features including Adjustable lumbar support, Breathable mesh back, Memory foam seat cushion, 360-degree swivel, Smooth-rolling casters, and offers exceptional comfort and durability. Crafted for long-lasting performance, the ErgoComfort Pro Executive Chair offers great value and is built to withstand the demands of daily use. Whether you're looking to enhance your productivity or upgrade your current setup, the readily available ErgoComfort Pro Executive Chair is an excellent choice. Experience the difference today!

As you can see, the description leverages the name of the title of the object (“ErgoComfort Pro Executive Chair”), its brand (“OfficeSolutions”), and all of its features to generate the product’s description. In particular, the description does not just list the input data; it leverages it to create an engaging description. The last phrases are the key:

  • “Crafted for long-lasting performance, the ErgoComfort Pro Executive Chair offers great value and is built to withstand the demands of daily use.”
  • “Whether you’re looking to enhance your productivity or upgrade your current setup, the readily available ErgoComfort Pro Executive Chair is an excellent choice. Experience the difference today!”

Et voilà! You tested your fine-tuned GPT-4o-mini model, which generated a product description to answer the given prompt (defined in Step #6).

Step #8: Put It All Together

The final GTP-4o n8n fine-tuning workflow now looks as follows:

The entire workflow to fine-tune and test GPT-4o-mini using n8n

Now that the workflow is fully set up, if you click on “Execute workflow,” it will be executed again from the beginning. However, note that the results are saved at each step. This means that if you want to try different prompts to test the fine-tuned model, you only need to write them into the Chat Trigger node and execute that node and the AI agent node.

Comparing Fine-Tuning Approaches: Cloud Infrastructure vs Workflow Automation

This guide was made for two reasons:

  1. Teaching you how to fine-tune an LLM using a workflow automation tool like n8n
  2. Comparing this way of fine-tuning LLMs with the one used in our article “Fine-Tuning Llama 4 with Fresh Web Data for Better Results

Time to compare the two approaches!

Comparing the Fine-Tuning Methods

The approach we followed in our previous article to fine-tune Llama 4 requires:

  • The use of a cloud infrastructure, which takes time to set up and incurs costs.
  • Writing code to retrieve the data using Bright Data’s Scraper APIs.
  • Setting up Hugging Face.
  • The need to develop a notebook with the Python code for fine-tuning, which needs time and technical skills.

You can not estimate the technical abilities needed. However, you can estimate the whole time needed to set up the whole infrastructure, and the money spent:

  • Time: about a whole working day.
  • Money: $25. After spending $25 for the cloud service, the consumption will be per hour. At the same time, you need to pay $25 before starting. So, that is the minimum price for using the cloud.

The approach you learned in this guide requires:

  • n8n, which is free to use and does not require much technical expertise.
  • An OpenAI API token to access GPT-4o or other models.
  • Basic coding skills, specifically to write a JavaScript snippet for the Code node.

In this case, the technical abilities are way less. The JavaScript snippet can be easily created by any LLM—if you can not write it on your own. Other than that, you do not need to write any other code snippets in the entire workflow.

In this case, you can estimate the time needed to set up the infrastructure and the money needed as follows:

  • Time: about half a working day.
  • Money: $10 for an OpenAI API token. Even in this case, you will pay for each API request. Still, you can start with just $10. An n8n license currently costs $25/month for the basic plan, or completely free if you choose to use the self-hosted version. So, to start, you need about $10.

Which Approach Should You Choose?

Aspect Cloud Infrastructure Approach Workflow Automation Approach
Technical skills High (requires Python, cloud, and data retrieval coding skills) Low (basic JavaScript)
Time to set up About a full working day About half a working day
Initial cost ~$25 minimum for cloud service + hourly fees ~$10 for OpenAI API token + $24/month for n8n license or free self-hosted
Flexibility High (suitable for advanced customization and various use cases) Moderate (good for automating workflows and low-code customization)
Best for Teams with high technical skills needing powerful, flexible infrastructure Teams looking for a quick setup or with limited coding expertise
Additional Benefits Full control over fine-tuning environment and process Pre-built templates, low entry barrier, integrations with other workflows

The two approaches require an initial investment that is similar, both on the side of time and money. So, how do you choose between each other? Here are some guidelines:

  • n8n: Choose n8n—or any similar workflow automation tool—to fine-tune LLMs if you need to automate other workflows and if your team is not highly technically skilled. This low-code approach will help you automate any other workflows. It requires writing code only if you need customization. It also provides pre-built templates you can use for free, which lowers the entry barrier to using the tool.
  • Cloud services: Choose a cloud service to fine-tune LLMs if you need it for multiple purposes and have a highly skilled team. Setting up the cloud environment and developing the fine-tuning notebook require advanced technical expertise.

The Heart of the Fine-Tuning Process: High-Quality Data

No matter which approach you choose, Bright Data remains the key middleman in both. The reason is simple: high-quality data is the foundation of the fine-tuning process!

Bright Data has you covered with an AI infrastructure for data, offering a range of services and solutions to support your AI applications:

  • MCP Server: An open-source Node.js MCP server that exposes over 20 tools for data retrieval in AI agents.
  • Web Scraper APIs: Pre-configured APIs for extracting structured data from 100+ major domains.
  • Web Unlocker: An all-in-one API that handles site unlocking on sites with anti-bot protections.
  • SERP API: A specialized API that unlocks search engine results and extracts complete SERP data.
  • Foundation models: Access compliant, web-scale datasets to power LLM pre-training, evaluation, and fine-tuning.
  • Data providers: Connect with trusted providers to source high-quality, AI-ready datasets at scale.
  • Data packages: Get curated, ready-to-use datasets—structured, enriched, and annotated.

While this guide taught you how to fine-tune GPT-4o-mini scraping the data using the Web Scraper APIs, you can choose a different approach using one of our services.

Conclusion

In this article, you learned how to fine-tune GPT-4o-mini with data scraped from Amazon using n8n to automate the entire workflow. You have gone through the entire process, which consists of two branches:

  1. Performs the fine-tuning after scraping the data.
  2. Tests the fine-tuned model by inserting the prompt via a chat trigger.

You have also gone through the comparison of this approach, which uses a workflow automation tool, with respect to another that uses a cloud service.

Regardless of the approach that best suits your needs and team, remember that high-quality data remains at the core of the process. In this regard, Bright Data has you covered with several data services for AI.

Create a Bright Data account for free and test our AI-ready data infrastructure!

Federico Trotta

Technical Writer

3 years experience

Federico Trotta is a technical writer, editor, and data scientist. Expert in technical content management, data analysis, machine learning, and Python development.

Expertise
Data Analysis AI Web Scraping