Mixture of Experts (MoE) is a neural network architecture that routes each input to a subset of specialized sub-networks, enabling massive model capacity with efficient per-token computation.
Mixture of Experts (MoE) is a neural network architecture that routes each input to a subset of specialized sub-networks, enabling massive model capacity with efficient per-token computation. Instead of activating all parameters for every input, MoE models use a gating network to select the most relevant experts for each token. This allows models like Mixtral and GPT-4 to have trillions of total parameters while only using a fraction per forward pass, dramatically improving the performance-to-compute ratio.
MoE replaces dense feed-forward layers in transformers with multiple parallel expert networks and a learned router. For each input token, the router produces probability scores across all experts and selects the top-k (typically 2) for activation. Only the selected experts process the token, and their outputs are combined using the router weights.
The router is trained jointly with the experts using a combination of the main task loss and auxiliary load-balancing losses that prevent expert collapse — a failure mode where most tokens route to the same few experts while others remain unused.
Expert networks are typically standard feed-forward layers, though they can be any differentiable function. The total parameter count equals num_experts times expert_size, but active parameters per token equal top_k times expert_size. A model with 8 experts using top-2 routing has 8x the parameters but only 2x the compute per token compared to a single dense layer.
MoE enables training larger, more capable models within fixed compute budgets. Sparse activation means inference costs scale with active parameters rather than total parameters, making MoE models more economical to serve at a given quality level. This architecture is critical for reaching frontier model capabilities without proportional increases in hardware costs.
Aaron is an engineering leader, software architect, and founder with 18 years building distributed systems and cloud infrastructure. Now focused on LLM-powered platforms, agent orchestration, and production AI. He shares hands-on technical guides and framework comparisons at fp8.co.