Mixture of Experts

What is Mixture of Experts?

Mixture of Experts (MoE) is a neural network architecture that routes each input to a subset of specialized sub-networks, enabling massive model capacity with efficient per-token computation. Instead of activating all parameters for every input, MoE models use a gating network to select the most relevant experts for each token. This allows models like Mixtral and GPT-4 to have trillions of total parameters while only using a fraction per forward pass, dramatically improving the performance-to-compute ratio.

How does Mixture of Experts work?

MoE replaces dense feed-forward layers in transformers with multiple parallel expert networks and a learned router. For each input token, the router produces probability scores across all experts and selects the top-k (typically 2) for activation. Only the selected experts process the token, and their outputs are combined using the router weights.

The router is trained jointly with the experts using a combination of the main task loss and auxiliary load-balancing losses that prevent expert collapse — a failure mode where most tokens route to the same few experts while others remain unused.

Expert networks are typically standard feed-forward layers, though they can be any differentiable function. The total parameter count equals num_experts times expert_size, but active parameters per token equal top_k times expert_size. A model with 8 experts using top-2 routing has 8x the parameters but only 2x the compute per token compared to a single dense layer.

Why does Mixture of Experts matter?

MoE enables training larger, more capable models within fixed compute budgets. Sparse activation means inference costs scale with active parameters rather than total parameters, making MoE models more economical to serve at a given quality level. This architecture is critical for reaching frontier model capabilities without proportional increases in hardware costs.

Best practices for Mixture of Experts

Implement auxiliary load-balancing losses to ensure even expert utilization and prevent routing collapse
Use expert parallelism across GPUs where each device holds a subset of experts to manage memory requirements
Monitor per-expert utilization metrics to detect imbalanced routing that wastes capacity
Consider shared experts (always active) alongside routed experts for stable baseline performance
Apply capacity factors to limit maximum tokens per expert, preventing memory spikes from uneven routing

What is Mixture of Experts?

How does Mixture of Experts work?

Why does Mixture of Experts matter?

Best practices for Mixture of Experts

Related Terms

About the Author