Blog / AI
AI

What Is MoE? A Deep Dive Into a Popular AI Architecture

Discover the power of Mixture of Experts in machine learning with this in-depth guide, covering its architecture, benefits, and implementation steps.
18 min read
What is MoE blog image

In this guide on Mixture of Experts, you will learn:

  • What MoE is and how it differs from traditional models
  • Benefits of using it
  • A step-by-step tutorial on how to implement it

Let’s dive in!

What is MoE?

An MoE (Mixture of Experts) is a machine learning architecture that combines multiple specialized sub-models—the “experts”—within a larger system. Each expert learns to handle different aspects of a task or distinct types of data.

A fundamental component in this architecture is the “gating network” or “router”. This component decides which expert, or combination of experts, should process a specific input. The gating network also assigns weights to each expert’s output. Weights are like scores, as they show how much influence each expert’s result should have.

The MoE architecture

In simple terms, the gating network uses weights to adjust each expert’s contribution to the final answer. To do so, it considers the input’s specific features. This allows the system to handle many types of data better than a single model could.

Differences Between MoE and Traditional Dense Models

In the context of neural networks, a traditional dense model works in a different way compared to MoE. For any piece of information you feed into it, the dense model uses all of its internal parameters to perform calculations. Thus, every part of its computational machinery is engaged for every input.

The main point is that, in dense models, all parts are engaged for every task. This contrasts with MoE, which activates only relevant expert subsections.

Below are the key differences between Moe and dense models:

  • Parameter usage:
    • Dense model: For any given input, the model uses all the parameters in the computation.
    • MoE model: For any given input, the model uses only the parameters of the selected expert(s) and the gating network. Thus, if an MoE model has a large number of parameters, it activates only a fraction of these parameters for any single computation.
  • Computational cost:
    • Dense model: The amount of computation for a dense layer is fixed for every input, as all its parts are always engaged.
    • MoE model: The computational cost for processing an input through an MoE layer can be lower than a dense layer of comparable total parameter size. That is because only a subset of the model—the chosen experts—performs the work. This allows MoE models to scale to a much larger number of total parameters without a proportional increase in the computational cost for each individual input.
  • Specialization and learning:
    • Dense model: All parts of a dense layer learn to contribute to processing all types of inputs it encounters.
    • MoE model: Different expert networks can learn to become specialized. For example, one expert might become good at processing questions about history, while another specializes in scientific concepts. The gating network learns to identify the type of input and route it to the most appropriate experts. This can lead to more nuanced and effective processing.

Benefits of the Mixture of Experts Architecture

The MoE architecture is highly relevant in modern AI, particularly when dealing with LLMs. The reason is that it offers a way to increase a model’s capacity, which is its ability to learn and store information, without a proportional increase in computational cost during use.

The main advantages of MoE in AI include:

  • Reduced inference latency: MoE models can decrease the time required to generate a prediction or output—called inference latency. This happens thanks to its ability to activate only the most relevant experts.
  • Enhanced training scalability and efficiency: You can take advantage of the parallelism in MoE architectures during the AI training process. Different experts can be trained concurrently on diverse data subsets or specialized tasks. This can lead to faster convergence and training time.
  • Improved model modularity and maintainability: The discrete nature of expert subnetworks facilitates a modular approach to model development and maintenance. Individual experts can be independently updated, retrained, or replaced with improved versions without requiring a complete retraining of the entire model. This simplifies the integration of new knowledge or capabilities and allows for more targeted interventions if a specific expert’s performance degrades.
  • Potential for increased interpretability: The specialization of experts may offer clearer insights into the model’s decision-making processes. Analyzing which experts are consistently activated for specific inputs can provide clues about how the model has learned to partition the problem space and attribute relevance. This characteristic offers a potential way for better understanding complex model behaviors compared to monolithic dense networks.
  • Greater energy efficiency at scale: MoE-based models can achieve lower energy consumption per query compared to traditional dense models. That is due to the sparse activation of parameters during inference, as they use only a fraction of the available parameters per input.

How To Implement MoE: A Step-by-Step Guide

In this tutorial section, you will learn how to use MoE. In particular, you will use a dataset containing sports news. The MoE will leverage two experts based on the following models:

  1. sshleifer/distilbart-cnn-6-6: To summarize the content of each news.
  2. distilbert-base-uncased-finetuned-sst-2-english: To calculate the sentiment of each news. In sentiment analysis, “sentiment” refers to the emotional tone, opinion, or attitude expressed in a text. The output can be:
    • Positive: Expresses favorable opinions, happiness, or satisfaction.
    • Negative: Expresses unfavorable opinions, sadness, anger, or dissatisfaction.
    • Neutral: Expresses no strong emotion or opinion, often factual.

At the end of the process, each news item will be saved in a JSON file containing:

  • The ID, the headline, and the URL.
  • The summary of the content.
  • The sentiment of the content with the confidence score.

The dataset containing the news can be retrieved using Bright Data’s Web Scraper APIs, specialized scraping endpoints to retrieve structured web data from 100+ domains in real-time.

The dataset containing the input JSON data can be generated using the code in our guide “Understanding Vector Databases: The Engine Behind Modern AI.” Specifically, refer to step 1 in the “Practical Integration: A Step-by-Step Guide” chapter.

The input JSON dataset—called news-data.json—contains an array of news items as below:

[
  {
    "id": "c787dk9923ro",
    "url": "https://www.bbc.com/sport/tennis/articles/c787dk9923ro",
    "author": "BBC",
    "headline": "Wimbledon plans to increase 'Henman Hill' capacity and accessibility",
    "topics": [
      "Tennis"
    ],
    "publication_date": "2025-04-03T11:28:36.326Z",
    "content": "Wimbledon is planning to renovate its iconic 'Henman Hill' and increase capacity for the tournament's 150th anniversary. Thousands of fans have watched action on a big screen from the grass slope which is open to supporters without show-court tickets. The proposed revamp - which has not yet been approved - would increase the hill's capacity by 20% in time for the 2027 event and increase accessibility. It is the latest change planned for the All England Club, after a 39-court expansion was approved last year. Advertisement \"It's all about enhancing this whole area, obviously it's become extremely popular but accessibility is difficult for everyone,\" said four-time Wimbledon semi-finalist Tim Henman, after whom the hill was named. \"We are always looking to enhance wherever we are on the estate. This is going to be an exciting project.\"",
    "videos": [],
    "images": [
      {
        "image_url": "https://ichef.bbci.co.uk/ace/branded_sport/1200/cpsprodpb/31f9/live/0f5b2090-106f-11f0-b72e-6314f702e779.jpg",
        "image_description": "Main image"
      },
      {
        "image_url": "https://ichef.bbci.co.uk/ace/standard/2560/cpsprodpb/31f9/live/0f5b2090-106f-11f0-b72e-6314f702e779.jpg",
        "image_description": "A render of planned improvements to Wimbledon's Henman Hill"
      }
    ],
    "related_articles": [
      {
        "article_title": "Live scores, results and order of playLive scores, results and order of play",
        "article_url": "https://www.bbc.com/sport/tennis/scores-and-schedule"
      },
      {
        "article_title": "Get tennis news sent straight to your phoneGet tennis news sent straight to your phone",
        "article_url": "https://www.bbc.com/sport/articles/cl5q9dk9jl3o"
      }
    ],
    "keyword": null,
    "timestamp": "2025-05-19T15:03:16.568Z",
    "input": {
      "url": "https://www.bbc.com/sport/tennis/articles/c787dk9923ro",
      "keyword": ""
    }
  },
  // omitted for brevity...
]

Follow the instructions below and build your MoE example!

Prerequisites and Dependencies

To replicate this tutorial, you must have Python 3.10.1 or higher installed on your machine.

Suppose you call the main folder of your project moe_project/. At the end of this step, the folder will have the following structure:

moe_project/
├── venv/
├── news-data.json
└── moe_analysis.py

Where:

  • venv/ contains the Python virtual environment.
  • news-data.json is the input JSON file containing the news data you scraped with Web Scraper API.
  • moe_analysis.py is the Python file that contains the coding logic.

You can create the venv/ virtual environment directory like so:

python -m venv venv

To activate it, on Windows, run:

venv\Scripts\activate

Equivalently, on macOS and Linux, execute:

source venv/bin/activate

In the activated virtual environment, install the dependencies with:

pip install transformers torch

These libraries are:

  • transformers: Hugging Face’s library for state-of-the-art machine learning models.
  • torch: PyTorch, an open-source machine learning framework.

Step #1: Setup and Configuration

Initialize the moe_analysis.py file by importing the required libraries and setting up some constants:

import json
from transformers import pipeline

# Define the input JSON file
JSON_FILE = "news-data.json"
# Specify the model for generating summaries
SUMMARIZATION_MODEL = "sshleifer/distilbart-cnn-6-6"
# Specify the model for analyzing sentiment
SENTIMENT_MODEL = "distilbert-base-uncased-finetuned-sst-2-english"

This code defines:

  • The name of input JSON file with the news scraped.
  • The models to ues for the experts.

Perfect! You have what it takes to get started with MoE in Python.

Step #2: Define the News Summarization Expert

This step involves creating a class that encapsulates the functionality of the expert for summarizing the news:

class NewsSummarizationLLMExpert:
    def __init__(self, model_name=SUMMARIZATION_MODEL):
        self.model_name = model_name
        self.summarizer = None

        # Initialize the summarization pipeline
        self.summarizer = pipeline(
            "summarization",
            model=self.model_name,
            tokenizer=self.model_name,
        )

    def analyze(self, article_content, article_headline=""):
        # Call the summarizer pipeline with the article content
        summary_outputs = self.summarizer(
            article_content,
            max_length=300,
            min_length=30,
            do_sample=False
        )
        # Extract the summary text from the pipeline's output
        summary = summary_outputs[0]["summary_text"]
        return { "summary": summary }

The above code:

  • Initializes the summarization pipeline with the method pipeline() from Hugging Face.
  • Defines how the summarization expert has to process an article with the method analyze().

Good! You just created the first expert in the MoE architecture that takes care of summarizing the news.

Step #3: Define the Sentiment Analysis Expert

Similar to the summarization expert, define a specialized class for performing sentiment analysis on the news:

class SentimentAnalysisLLMExpert:
    def __init__(self, model_name=SENTIMENT_MODEL):
        self.model_name = model_name
        self.sentiment_analyzer = None

        # Initialize the sentiment analysis pipeline
        self.sentiment_analyzer = pipeline(
            "sentiment-analysis",
            model=self.model_name,
            tokenizer=self.model_name,
        )

    def analyze(self, article_content, article_headline=""):
        # Define max tokens
        max_chars_for_sentiment = 2000
        # Truncate the content if it exceeds the maximum limit
        truncated_content = article_content[:max_chars_for_sentiment]
        # Call the sentiment analyzer pipeline
        sentiment_outputs = self.sentiment_analyzer(truncated_content)
        # Extract the sentiment label
        label = sentiment_outputs[0]["label"]
        # Extract the sentiment score
        score = sentiment_outputs[0]["score"]
        return { "sentiment_label": label, "sentiment_score": score }

This snippet:

  • Initializes the sentiment analysis pipeline with the method pipeline().
  • Defines the method analyze() to perform sentiment analysis. It also returns the sentiment label—negative or positive—and the confidence score.

Very well! You now have another expert that calculates and expresses the sentiment of the text in the news.

Step #4: Implement the Gating Network

Now, you have to define the logic behind the gating network to route the experts:

def route_to_experts(item_data, experts_registry):
    chosen_experts = []
    # Select the summarizer and sentiment analyzer
    chosen_experts.append(experts_registry["summarizer"])
    chosen_experts.append(experts_registry["sentiment_analyzer"])
    return chosen_experts

In this implementation, the gating network is simple. It always uses both experts for every news item, but it does so sequentially:

  1. It summarizes the text.
  2. It calculates the sentiment.

Note: The gating network is quite simple in this example. At the same time, if you wanted to achieve the same goal using a single, larger model, it would have required significantly more computation. In contrast, the two experts are leveraged only for the tasks that are relevant to them. This makes it a simple yet effective application of the Mixture of Experts architecture.

In other scenarios, this part of the process could be improved by training an ML model to learn how and when to activate a specific expert. This would allow the gating network to respond dynamically.

Fantastic! The gating network logic is set up and ready to operate.

Step #5: Main Orchestration Logic for Processing News Data

Define the core function that manages the entire workflow defined by the following task:

  1. Load the JSON dataset.
  2. Initialize the two experts.
  3. Iterate through the news items.
  4. Route them to the chosen experts.
  5. Collect the results.

You can do it with the following code:

def process_news_json_with_moe(json_filepath):
    # Open and load news items from the JSON file
    with open(json_filepath, "r", encoding="utf-8") as f:
        news_items = json.load(f)

    # Create a dictionary to hold instances of expert classes
    experts_registry = {
        "summarizer": NewsSummarizationLLMExpert(),
        "sentiment_analyzer": SentimentAnalysisLLMExpert()
    }

    # List to store the analysis results
    all_results = []

    # Iterate through each news item in the loaded data
    for i, news_item in enumerate(news_items):
        print(f"\n--- Processing Article {i+1}/{len(news_items)} ---")
        # Extract relevant data from the news item
        id = news_item.get("id")
        headline = news_item.get("headline")
        content = news_item.get("content")
        url = news_item.get("url")

        # Print progress
        print(f"ID: {id}, Headline: {headline[:70]}...")

        # Use the gating network to determine the expert to use
        active_experts = route_to_experts(news_item, experts_registry)

        # Prepare a dictionary to store the analysis results
        news_item_analysis_results = {
            "id": id,
            "headline": headline,
            "url": url,
            "analyses": {}
        }

        # Iterate through the experts and apply their analysis
        for expert_instance in active_experts:
            expert_name = expert_instance.__class__.__name__ # Get the class name of the expert
            try:
                # Call the expert's analyze method
                analysis_result = expert_instance.analyze(article_content=content, article_headline=headline)
                # Store the result under the expert's name
                news_item_analysis_results["analyses"][expert_name] = analysis_result

            except Exception as e:
                # Handle any errors during analysis by a specific expert
                print(f"Error during analysis with {expert_name}: {e}")
                news_item_analysis_results["analyses"][expert_name] = { "error": str(e) }

        # Add the current item's results to the overall list
        all_results.append(news_item_analysis_results)

    return all_results

In this snippet:

  • The for loop iterates over all the loaded news.
  • The try-except block performs the analysis and manages errors that can occur. In this case, the errors that can occur are mainly due to the parameters max_length and max_chars_for_sentiment defined in the previous functions. Since not all the content retrieved has the same length, error management is fundamental for handling exceptions effectively.

Here we go! You defined the orchestration function of the whole process.

Step #6: Launch the Processing Function

As a final part of the script, you have to execute the main processing function and then save the analyses to an output JSON file as follows:

# Call the main processing function with the input JSON file
final_analyses = process_news_json_with_moe(JSON_FILE)

print("\n\n--- MoE Analysis Complete ---")

# Write the final analysis results to a new JSON file
with open("analyzed_news_data.json", "w", encoding="utf-8") as f_out:
    json.dump(final_analyses, f_out, indent=4, ensure_ascii=False)

In the above code:

  • The final_analyses variable calls the function to process the data with MoE.
  • The analyzed data is stored in the analyzed_news_data.json output file.

Et voilà! The whole script is finalized, the data is analyzed, and saved.

Step #7: Put It All Together and Run the Code

Below is what the moe_analysis.py file should now contain:

import json
from transformers import pipeline

# Define the input JSON file
JSON_FILE = "news-data.json"
# Specify the model for generating summaries
SUMMARIZATION_MODEL = "sshleifer/distilbart-cnn-6-6"
# Specify the model for analyzing sentiment
SENTIMENT_MODEL = "distilbert-base-uncased-finetuned-sst-2-english"

# Define a class representing an expert for news summarization
class NewsSummarizationLLMExpert:
    def __init__(self, model_name=SUMMARIZATION_MODEL):
        self.model_name = model_name
        self.summarizer = None

        # Initialize the summarization pipeline
        self.summarizer = pipeline(
            "summarization",
            model=self.model_name,
            tokenizer=self.model_name,
        )

    def analyze(self, article_content, article_headline=""):
        # Call the summarizer pipeline with the article content
        summary_outputs = self.summarizer(
            article_content,
            max_length=300,
            min_length=30,
            do_sample=False
        )
        # Extract the summary text from the pipeline's output
        summary = summary_outputs[0]["summary_text"]
        return { "summary": summary }


# Define a class representing an expert for sentiment analysis
class SentimentAnalysisLLMExpert:
    def __init__(self, model_name=SENTIMENT_MODEL):
        self.model_name = model_name
        self.sentiment_analyzer = None

        # Initialize the sentiment analysis pipeline
        self.sentiment_analyzer = pipeline(
            "sentiment-analysis",
            model=self.model_name,
            tokenizer=self.model_name,
        )


    def analyze(self, article_content, article_headline=""):
        # Define max tokens
        max_chars_for_sentiment = 2000
        # Truncate the content if it exceeds the maximum limit
        truncated_content = article_content[:max_chars_for_sentiment]
        # Call the sentiment analyzer pipeline
        sentiment_outputs = self.sentiment_analyzer(truncated_content)
        # Extract the sentiment label
        label = sentiment_outputs[0]["label"]
        # Extract the sentiment score
        score = sentiment_outputs[0]["score"]
        return { "sentiment_label": label, "sentiment_score": score }


# Define a gating network
def route_to_experts(item_data, experts_registry):
    chosen_experts = []
    # Select the summarizer and sentiment analyzer
    chosen_experts.append(experts_registry["summarizer"])
    chosen_experts.append(experts_registry["sentiment_analyzer"])
    return chosen_experts


# Main function to manage the orchestration process
def process_news_json_with_moe(json_filepath):
    # Open and load news items from the JSON file
    with open(json_filepath, "r", encoding="utf-8") as f:
        news_items = json.load(f)

    # Create a dictionary to hold instances of expert classes
    experts_registry = {
        "summarizer": NewsSummarizationLLMExpert(),
        "sentiment_analyzer": SentimentAnalysisLLMExpert()
    }

    # List to store the analysis results
    all_results = []

    # Iterate through each news item in the loaded data
    for i, news_item in enumerate(news_items):
        print(f"\n--- Processing Article {i+1}/{len(news_items)} ---")
        # Extract relevant data from the news item
        id = news_item.get("id")
        headline = news_item.get("headline")
        content = news_item.get("content")
        url = news_item.get("url")

        # Print progress
        print(f"ID: {id}, Headline: {headline[:70]}...")

        # Use the gating network to determine the expert to use
        active_experts = route_to_experts(news_item, experts_registry)

        # Prepare a dictionary to store the analysis results
        news_item_analysis_results = {
            "id": id,
            "headline": headline,
            "url": url,
            "analyses": {}
        }

        # Iterate through the experts and apply their analysis
        for expert_instance in active_experts:
            expert_name = expert_instance.__class__.__name__ # Get the class name of the expert
            try:
                # Call the expert's analyze method
                analysis_result = expert_instance.analyze(article_content=content, article_headline=headline)
                # Store the result under the expert's name
                news_item_analysis_results["analyses"][expert_name] = analysis_result

            except Exception as e:
                # Handle any errors during analysis by a specific expert
                print(f"Error during analysis with {expert_name}: {e}")
                news_item_analysis_results["analyses"][expert_name] = { "error": str(e) }

        # Add the current item's results to the overall list
        all_results.append(news_item_analysis_results)

    return all_results

# Call the main processing function with the input JSON file
final_analyses = process_news_json_with_moe(JSON_FILE)

print("\n\n--- MoE Analysis Complete ---")

# Write the final analysis results to a new JSON file
with open("analyzed_news_data.json", "w", encoding="utf-8") as f_out:
    json.dump(final_analyses, f_out, indent=4, ensure_ascii=False)

Great! In around 130 lines of code, you have just completed your first MoE project.

Run the code with the following command:

python moe_analysis.py

The output in the terminal should contain:

# Omitted for brevity...

--- Processing Article 6/10 ---
ID: cdrgdm4ye53o, Headline: Japanese Grand Prix: Lewis Hamilton says he has 'absolute 100% faith' ...

--- Processing Article 7/10 ---
ID: czed4jk7eeeo, Headline: F1 engines: A return to V10 or hybrid - what's the future?...
Error during analysis with NewsSummarizationLLMExpert: index out of range in self

--- Processing Article 8/10 ---
ID: cy700xne614o, Headline: Monte Carlo Masters: Novak Djokovic beaten as wait for 100th title con...
Error during analysis with NewsSummarizationLLMExpert: index out of range in self

# Omitted for brevity...

--- MoE Analysis Complete ---

When the execution completes, an analyzed_news_data.json output file will appear in the project folder. Open it, and focus on one of the news items. The analyses field will contain the summary and sentiment analysis results produced by the two experts:

The result of the MoE approach in the JSON file

As you can see, the MoE approach has:

  • Summarized the content of the article and reported it under summary.
  • Defined a positive sentiment with a 0.99 confidence.

Mission complete!

Conclusion

In this article, you learned about MoE and how to implement it in a real-world scenario through a step-by-step section.

If you want to explore more MoE scenarios and you need some fresh data to do so, Bright Data offers a suite of powerful tools and services designed to retrieve updated, real-time data from web pages while overcoming scraping obstacles.

These solutions include:

  • Web Unlocker: An API that bypasses anti-scraping protections and delivers clean HTML from any webpage with minimal effort.
  • Scraping Browser: A cloud-based, controllable browser with JavaScript rendering. It automatically handles CAPTCHAs, browser fingerprinting, retries, and more for you.
  • Web Scraper APIs: Endpoints for programmatic access to structured web data from dozens of popular domains.

For other machine learning scenarios, also explore our AI hub.

Sign up for Bright Data now and start your free trial to test our scraping solutions!

No credit card required