design pattern 2025-01-10 13 min read

Mixture of Experts: How DeepSeek and Mistral Scale LLMs Efficiently

A technical deep dive into Mixture of Experts (MoE) architecture — how sparse activation, expert routing, and load balancing enable trillion-parameter models that cost less to serve than dense alternatives.

mixture of experts MoE DeepSeek Mistral sparse activation routing scaling

Introduction

Mixture of Experts (MoE) has gone from an obscure technique to one of the most important architectures in large-scale ML. Mixtral 8x7B outperformed Llama 2 70B while using the compute of a 13B model. DeepSeek's MoE architecture underlies models with hundreds of billions of parameters that serve at fraction of the cost of equivalently capable dense models.

This post explains how MoE works, why it matters for production systems, and the engineering challenges it introduces.

Dense vs. Sparse Models

Dense Models

In a standard transformer, every parameter participates in every forward pass:

  • 70B parameters × every token = 70B operations per token
  • FLOP count scales linearly with parameters

Sparse MoE Models

In MoE, the model has many "expert" FFN layers, but only a subset are activated per token:

Input token → Router → [Expert 1] (selected)
                       [Expert 2] (not selected)
                       [Expert 3] (selected)
                       ...
                       [Expert N] (not selected)

If you have 8 experts but activate 2 per token: you have 8x the parameters but 2x the compute of a single-expert model. This is the fundamental MoE bargain: more parameters, more capacity, same compute per token.

Router Architecture

The router is a small learned linear layer that maps the token representation to expert selection probabilities:

class TopKRouter(nn.Module):
    def __init__(self, d_model, n_experts, top_k):
        self.gate = nn.Linear(d_model, n_experts)
        self.top_k = top_k

    def forward(self, x):
        logits = self.gate(x)                    # [batch, seq, n_experts]
        scores = F.softmax(logits, dim=-1)        # probabilities
        top_k_scores, top_k_indices = torch.topk(scores, self.top_k)
        return top_k_scores, top_k_indices

The output of selected experts is weighted by their router scores and summed:

output = sum(router_score[i] × expert[i](x) for i in top_k_experts)

Load Balancing: The Critical Challenge

Without explicit balancing, routers collapse to using 1-2 experts for most tokens (expert collapse). This wastes capacity and hurts training stability.

Auxiliary Load Balancing Loss

Add a loss term that encourages uniform expert utilization:

def load_balancing_loss(router_probs, expert_indices, n_experts):
    # Fraction of tokens routed to each expert
    fraction_routed = (expert_indices == torch.arange(n_experts)).float().mean()
    # Mean routing probability for each expert
    mean_probs = router_probs.mean(dim=0)
    # Loss: penalty for non-uniform routing
    loss = n_experts * (fraction_routed * mean_probs).sum()
    return loss

This encourages both uniform token distribution across experts and uniform probability mass.

Expert Capacity

Each expert processes at most C tokens per batch (capacity factor C). Tokens beyond capacity are dropped or passed through as residuals:

capacity = tokens_per_batch × capacity_factor / n_experts

Setting C too low: tokens get dropped → quality loss Setting C too high: GPU memory wasted → inefficiency

Mixtral and DeepSeek use capacity factors of 1.0-1.25 in practice.

DeepSeek's MoE Innovations

DeepSeek-V2 and V3 introduced several improvements to standard MoE:

1. Fine-Grained Expert Segmentation

Instead of 8 large experts, use 64 small experts with 4 selected. Finer granularity gives:

  • More combinations (C(64,4) >> C(8,2))
  • Better specialization without increasing compute
  • Improved parameter efficiency

2. Shared Expert Isolation

Designate a subset of experts as "shared" — always activated for every token — alongside the routed experts:

Every token → [Shared Expert 1] (always)
            → [Shared Expert 2] (always)
            → Router → [Routed Expert A] (selected)
                       [Routed Expert B] (selected)

Shared experts handle common knowledge; routed experts handle specialization. This improved quality on general benchmarks.

3. Device-Level Routing Constraints

In distributed deployments, expert routing without constraints causes heavy all-to-all communication (tokens routed to experts on different GPUs). DeepSeek constrains routing so each token is mostly routed to experts on the same device — massively reducing communication overhead.

Serving MoE Models

Memory Challenges

MoE models have many more parameters than their compute-equivalent dense counterparts:

  • Mixtral 8x7B: ~47B parameters (sparse), compute ~= 13B dense
  • DeepSeek-V3: ~671B parameters, compute ~= 37B dense

This means:

  • Higher GPU memory requirements for weights
  • Larger model parallelism needs
  • More complex KV cache sharing in distributed serving

Expert Parallelism

The standard distribution strategy for MoE serving:

GPU 0: [Expert 0, Expert 1]
GPU 1: [Expert 2, Expert 3]
GPU 2: [Expert 4, Expert 5]
GPU 3: [Expert 6, Expert 7]

Each GPU hosts a subset of experts. Routing requires all-to-all communication to send tokens to the right GPU. This is the main communication bottleneck.

Batching for Expert Efficiency

MoE models achieve their efficiency gains best when tokens from many requests pass through each expert in large batches. Small batch sizes → poor GPU utilization.

n_experts = 8, top_k = 2
Effective batch per expert = (total_batch × top_k) / n_experts

For 100 requests × 512 tokens = 51,200 tokens:

  • Each of 8 experts processes ~12,800 tokens per forward pass
  • Efficient batch size for GPU utilization

For a single request, each expert only sees ~128 tokens — poor utilization. This is why MoE serving latency is often higher than dense model serving for low-QPS workloads.

When to Use MoE

Good fits:

  • High-throughput serving with many concurrent users
  • Tasks where specialization matters (coding, multilingual, domain-specific)
  • When you want more total model capacity without increasing serving cost per token

Poor fits:

  • Low-QPS applications where single-request latency matters most
  • Memory-constrained environments (MoE needs lots of GPU memory)
  • Applications where model size distribution is expensive

Practical Considerations for ML Teams

Model selection

If you need to serve a high-quality model at scale, a MoE model (like Mixtral or DeepSeek-V3) at its actual compute level may be more cost-effective than a dense model of equivalent capability.

Quantization with MoE

Quantizing MoE models is more complex than dense models:

  • Shared experts often need higher precision
  • Expert weight quantization can be done per-expert (better quality)
  • Router should generally stay in FP16/BF16

Monitoring Expert Utilization

In production, monitor the distribution of tokens across experts:

  • Imbalanced utilization → some GPUs hot, others idle
  • Expert collapse → quality degradation
  • Routing anomalies on specific inputs → potential attack surface

Conclusion

Mixture of Experts is now a fundamental tool for building large, capable models that can be served efficiently at scale. The DeepSeek and Mistral teams have shown that MoE can rival or beat dense models of much larger compute size, while the engineering innovations in routing, load balancing, and distributed serving have made these models practical to operate.

For teams building ML systems at scale, understanding MoE is increasingly essential — both for selecting the right models and for understanding the infrastructure decisions that affect performance and cost.


Learn more about serving large models efficiently in our LLM Inference at Scale course.

Want to Go Deeper?

This article is part of our comprehensive curriculum on building ML systems at scale. Explore our full courses for hands-on learning.