Introduction
Mixture of Experts (MoE) has gone from an obscure technique to one of the most important architectures in large-scale ML. Mixtral 8x7B outperformed Llama 2 70B while using the compute of a 13B model. DeepSeek's MoE architecture underlies models with hundreds of billions of parameters that serve at fraction of the cost of equivalently capable dense models.
This post explains how MoE works, why it matters for production systems, and the engineering challenges it introduces.
Dense vs. Sparse Models
Dense Models
In a standard transformer, every parameter participates in every forward pass:
- 70B parameters × every token = 70B operations per token
- FLOP count scales linearly with parameters
Sparse MoE Models
In MoE, the model has many "expert" FFN layers, but only a subset are activated per token:
Input token → Router → [Expert 1] (selected)
[Expert 2] (not selected)
[Expert 3] (selected)
...
[Expert N] (not selected)
If you have 8 experts but activate 2 per token: you have 8x the parameters but 2x the compute of a single-expert model. This is the fundamental MoE bargain: more parameters, more capacity, same compute per token.
Router Architecture
The router is a small learned linear layer that maps the token representation to expert selection probabilities:
class TopKRouter(nn.Module):
def __init__(self, d_model, n_experts, top_k):
self.gate = nn.Linear(d_model, n_experts)
self.top_k = top_k
def forward(self, x):
logits = self.gate(x) # [batch, seq, n_experts]
scores = F.softmax(logits, dim=-1) # probabilities
top_k_scores, top_k_indices = torch.topk(scores, self.top_k)
return top_k_scores, top_k_indices
The output of selected experts is weighted by their router scores and summed:
output = sum(router_score[i] × expert[i](x) for i in top_k_experts)
Load Balancing: The Critical Challenge
Without explicit balancing, routers collapse to using 1-2 experts for most tokens (expert collapse). This wastes capacity and hurts training stability.
Auxiliary Load Balancing Loss
Add a loss term that encourages uniform expert utilization:
def load_balancing_loss(router_probs, expert_indices, n_experts):
# Fraction of tokens routed to each expert
fraction_routed = (expert_indices == torch.arange(n_experts)).float().mean()
# Mean routing probability for each expert
mean_probs = router_probs.mean(dim=0)
# Loss: penalty for non-uniform routing
loss = n_experts * (fraction_routed * mean_probs).sum()
return loss
This encourages both uniform token distribution across experts and uniform probability mass.
Expert Capacity
Each expert processes at most C tokens per batch (capacity factor C). Tokens beyond capacity are dropped or passed through as residuals:
capacity = tokens_per_batch × capacity_factor / n_experts
Setting C too low: tokens get dropped → quality loss Setting C too high: GPU memory wasted → inefficiency
Mixtral and DeepSeek use capacity factors of 1.0-1.25 in practice.
DeepSeek's MoE Innovations
DeepSeek-V2 and V3 introduced several improvements to standard MoE:
1. Fine-Grained Expert Segmentation
Instead of 8 large experts, use 64 small experts with 4 selected. Finer granularity gives:
- More combinations (C(64,4) >> C(8,2))
- Better specialization without increasing compute
- Improved parameter efficiency
2. Shared Expert Isolation
Designate a subset of experts as "shared" — always activated for every token — alongside the routed experts:
Every token → [Shared Expert 1] (always)
→ [Shared Expert 2] (always)
→ Router → [Routed Expert A] (selected)
[Routed Expert B] (selected)
Shared experts handle common knowledge; routed experts handle specialization. This improved quality on general benchmarks.
3. Device-Level Routing Constraints
In distributed deployments, expert routing without constraints causes heavy all-to-all communication (tokens routed to experts on different GPUs). DeepSeek constrains routing so each token is mostly routed to experts on the same device — massively reducing communication overhead.
Serving MoE Models
Memory Challenges
MoE models have many more parameters than their compute-equivalent dense counterparts:
- Mixtral 8x7B: ~47B parameters (sparse), compute ~= 13B dense
- DeepSeek-V3: ~671B parameters, compute ~= 37B dense
This means:
- Higher GPU memory requirements for weights
- Larger model parallelism needs
- More complex KV cache sharing in distributed serving
Expert Parallelism
The standard distribution strategy for MoE serving:
GPU 0: [Expert 0, Expert 1]
GPU 1: [Expert 2, Expert 3]
GPU 2: [Expert 4, Expert 5]
GPU 3: [Expert 6, Expert 7]
Each GPU hosts a subset of experts. Routing requires all-to-all communication to send tokens to the right GPU. This is the main communication bottleneck.
Batching for Expert Efficiency
MoE models achieve their efficiency gains best when tokens from many requests pass through each expert in large batches. Small batch sizes → poor GPU utilization.
n_experts = 8, top_k = 2
Effective batch per expert = (total_batch × top_k) / n_experts
For 100 requests × 512 tokens = 51,200 tokens:
- Each of 8 experts processes ~12,800 tokens per forward pass
- Efficient batch size for GPU utilization
For a single request, each expert only sees ~128 tokens — poor utilization. This is why MoE serving latency is often higher than dense model serving for low-QPS workloads.
When to Use MoE
Good fits:
- High-throughput serving with many concurrent users
- Tasks where specialization matters (coding, multilingual, domain-specific)
- When you want more total model capacity without increasing serving cost per token
Poor fits:
- Low-QPS applications where single-request latency matters most
- Memory-constrained environments (MoE needs lots of GPU memory)
- Applications where model size distribution is expensive
Practical Considerations for ML Teams
Model selection
If you need to serve a high-quality model at scale, a MoE model (like Mixtral or DeepSeek-V3) at its actual compute level may be more cost-effective than a dense model of equivalent capability.
Quantization with MoE
Quantizing MoE models is more complex than dense models:
- Shared experts often need higher precision
- Expert weight quantization can be done per-expert (better quality)
- Router should generally stay in FP16/BF16
Monitoring Expert Utilization
In production, monitor the distribution of tokens across experts:
- Imbalanced utilization → some GPUs hot, others idle
- Expert collapse → quality degradation
- Routing anomalies on specific inputs → potential attack surface
Conclusion
Mixture of Experts is now a fundamental tool for building large, capable models that can be served efficiently at scale. The DeepSeek and Mistral teams have shown that MoE can rival or beat dense models of much larger compute size, while the engineering innovations in routing, load balancing, and distributed serving have made these models practical to operate.
For teams building ML systems at scale, understanding MoE is increasingly essential — both for selecting the right models and for understanding the infrastructure decisions that affect performance and cost.
Learn more about serving large models efficiently in our LLM Inference at Scale course.