In this guide on Mixture of Experts, you will learn:
- What MoE is and how it differs from traditional models
- Benefits of using it
- A step-by-step tutorial on how to implement it
Let’s dive in!
What is MoE?
An MoE (Mixture of Experts) is a machine learning architecture that combines multiple specialized sub-models—the “experts”—within a larger system. Each expert learns to handle different aspects of a task or distinct types of data.
A fundamental component in this architecture is the “gating network” or “router”. This component decides which expert, or combination of experts, should process a specific input. The gating network also assigns weights to each expert’s output. Weights are like scores, as they show how much influence each expert’s result should have.
In simple terms, the gating network uses weights to adjust each expert’s contribution to the final answer. To do so, it considers the input’s specific features. This allows the system to handle many types of data better than a single model could.
Differences Between MoE and Traditional Dense Models
In the context of neural networks, a traditional dense model works in a different way compared to MoE. For any piece of information you feed into it, the dense model uses all of its internal parameters to perform calculations. Thus, every part of its computational machinery is engaged for every input.
The main point is that, in dense models, all parts are engaged for every task. This contrasts with MoE, which activates only relevant expert subsections.
Below are the key differences between Moe and dense models:
- Parameter usage:
- Dense model: For any given input, the model uses all the parameters in the computation.
- MoE model: For any given input, the model uses only the parameters of the selected expert(s) and the gating network. Thus, if an MoE model has a large number of parameters, it activates only a fraction of these parameters for any single computation.
- Computational cost:
- Dense model: The amount of computation for a dense layer is fixed for every input, as all its parts are always engaged.
- MoE model: The computational cost for processing an input through an MoE layer can be lower than a dense layer of comparable total parameter size. That is because only a subset of the model—the chosen experts—performs the work. This allows MoE models to scale to a much larger number of total parameters without a proportional increase in the computational cost for each individual input.
- Specialization and learning:
- Dense model: All parts of a dense layer learn to contribute to processing all types of inputs it encounters.
- MoE model: Different expert networks can learn to become specialized. For example, one expert might become good at processing questions about history, while another specializes in scientific concepts. The gating network learns to identify the type of input and route it to the most appropriate experts. This can lead to more nuanced and effective processing.
Benefits of the Mixture of Experts Architecture
The MoE architecture is highly relevant in modern AI, particularly when dealing with LLMs. The reason is that it offers a way to increase a model’s capacity, which is its ability to learn and store information, without a proportional increase in computational cost during use.
The main advantages of MoE in AI include:
- Reduced inference latency: MoE models can decrease the time required to generate a prediction or output—called inference latency. This happens thanks to its ability to activate only the most relevant experts.
- Enhanced training scalability and efficiency: You can take advantage of the parallelism in MoE architectures during the AI training process. Different experts can be trained concurrently on diverse data subsets or specialized tasks. This can lead to faster convergence and training time.
- Improved model modularity and maintainability: The discrete nature of expert subnetworks facilitates a modular approach to model development and maintenance. Individual experts can be independently updated, retrained, or replaced with improved versions without requiring a complete retraining of the entire model. This simplifies the integration of new knowledge or capabilities and allows for more targeted interventions if a specific expert’s performance degrades.
- Potential for increased interpretability: The specialization of experts may offer clearer insights into the model’s decision-making processes. Analyzing which experts are consistently activated for specific inputs can provide clues about how the model has learned to partition the problem space and attribute relevance. This characteristic offers a potential way for better understanding complex model behaviors compared to monolithic dense networks.
- Greater energy efficiency at scale: MoE-based models can achieve lower energy consumption per query compared to traditional dense models. That is due to the sparse activation of parameters during inference, as they use only a fraction of the available parameters per input.
How To Implement MoE: A Step-by-Step Guide
In this tutorial section, you will learn how to use MoE. In particular, you will use a dataset containing sports news. The MoE will leverage two experts based on the following models:
sshleifer/distilbart-cnn-6-6
: To summarize the content of each news.distilbert-base-uncased-finetuned-sst-2-english
: To calculate the sentiment of each news. In sentiment analysis, “sentiment” refers to the emotional tone, opinion, or attitude expressed in a text. The output can be:- Positive: Expresses favorable opinions, happiness, or satisfaction.
- Negative: Expresses unfavorable opinions, sadness, anger, or dissatisfaction.
- Neutral: Expresses no strong emotion or opinion, often factual.
At the end of the process, each news item will be saved in a JSON file containing:
- The ID, the headline, and the URL.
- The summary of the content.
- The sentiment of the content with the confidence score.
The dataset containing the news can be retrieved using Bright Data’s Web Scraper APIs, specialized scraping endpoints to retrieve structured web data from 100+ domains in real-time.
The dataset containing the input JSON data can be generated using the code in our guide “Understanding Vector Databases: The Engine Behind Modern AI.” Specifically, refer to step 1 in the “Practical Integration: A Step-by-Step Guide” chapter.
The input JSON dataset—called news-data.json
—contains an array of news items as below:
Follow the instructions below and build your MoE example!
Prerequisites and Dependencies
To replicate this tutorial, you must have Python 3.10.1 or higher installed on your machine.
Suppose you call the main folder of your project moe_project/
. At the end of this step, the folder will have the following structure:
Where:
venv/
contains the Python virtual environment.news-data.json
is the input JSON file containing the news data you scraped with Web Scraper API.moe_analysis.py
is the Python file that contains the coding logic.
You can create the venv/
virtual environment directory like so:
To activate it, on Windows, run:
Equivalently, on macOS and Linux, execute:
In the activated virtual environment, install the dependencies with:
These libraries are:
transformers
: Hugging Face’s library for state-of-the-art machine learning models.torch
: PyTorch, an open-source machine learning framework.
Step #1: Setup and Configuration
Initialize the moe_analysis.py
file by importing the required libraries and setting up some constants:
This code defines:
- The name of input JSON file with the news scraped.
- The models to ues for the experts.
Perfect! You have what it takes to get started with MoE in Python.
Step #2: Define the News Summarization Expert
This step involves creating a class that encapsulates the functionality of the expert for summarizing the news:
The above code:
- Initializes the summarization pipeline with the method
pipeline()
from Hugging Face. - Defines how the summarization expert has to process an article with the method
analyze()
.
Good! You just created the first expert in the MoE architecture that takes care of summarizing the news.
Step #3: Define the Sentiment Analysis Expert
Similar to the summarization expert, define a specialized class for performing sentiment analysis on the news:
This snippet:
- Initializes the sentiment analysis pipeline with the method
pipeline()
. - Defines the method
analyze()
to perform sentiment analysis. It also returns the sentiment label—negative or positive—and the confidence score.
Very well! You now have another expert that calculates and expresses the sentiment of the text in the news.
Step #4: Implement the Gating Network
Now, you have to define the logic behind the gating network to route the experts:
In this implementation, the gating network is simple. It always uses both experts for every news item, but it does so sequentially:
- It summarizes the text.
- It calculates the sentiment.
Note: The gating network is quite simple in this example. At the same time, if you wanted to achieve the same goal using a single, larger model, it would have required significantly more computation. In contrast, the two experts are leveraged only for the tasks that are relevant to them. This makes it a simple yet effective application of the Mixture of Experts architecture.
In other scenarios, this part of the process could be improved by training an ML model to learn how and when to activate a specific expert. This would allow the gating network to respond dynamically.
Fantastic! The gating network logic is set up and ready to operate.
Step #5: Main Orchestration Logic for Processing News Data
Define the core function that manages the entire workflow defined by the following task:
- Load the JSON dataset.
- Initialize the two experts.
- Iterate through the news items.
- Route them to the chosen experts.
- Collect the results.
You can do it with the following code:
In this snippet:
- The
for
loop iterates over all the loaded news. - The
try-except
block performs the analysis and manages errors that can occur. In this case, the errors that can occur are mainly due to the parametersmax_length
andmax_chars_for_sentiment
defined in the previous functions. Since not all the content retrieved has the same length, error management is fundamental for handling exceptions effectively.
Here we go! You defined the orchestration function of the whole process.
Step #6: Launch the Processing Function
As a final part of the script, you have to execute the main processing function and then save the analyses to an output JSON file as follows:
In the above code:
- The
final_analyses
variable calls the function to process the data with MoE. - The analyzed data is stored in the
analyzed_news_data.json
output file.
Et voilà! The whole script is finalized, the data is analyzed, and saved.
Step #7: Put It All Together and Run the Code
Below is what the moe_analysis.py
file should now contain:
Great! In around 130 lines of code, you have just completed your first MoE project.
Run the code with the following command:
The output in the terminal should contain:
When the execution completes, an analyzed_news_data.json output
file will appear in the project folder. Open it, and focus on one of the news items. The analyses
field will contain the summary and sentiment analysis results produced by the two experts:
As you can see, the MoE approach has:
- Summarized the content of the article and reported it under
summary
. - Defined a positive sentiment with a 0.99 confidence.
Mission complete!
Conclusion
In this article, you learned about MoE and how to implement it in a real-world scenario through a step-by-step section.
If you want to explore more MoE scenarios and you need some fresh data to do so, Bright Data offers a suite of powerful tools and services designed to retrieve updated, real-time data from web pages while overcoming scraping obstacles.
These solutions include:
- Web Unlocker: An API that bypasses anti-scraping protections and delivers clean HTML from any webpage with minimal effort.
- Scraping Browser: A cloud-based, controllable browser with JavaScript rendering. It automatically handles CAPTCHAs, browser fingerprinting, retries, and more for you.
- Web Scraper APIs: Endpoints for programmatic access to structured web data from dozens of popular domains.
For other machine learning scenarios, also explore our AI hub.
Sign up for Bright Data now and start your free trial to test our scraping solutions!
No credit card required