Mixture of Experts

Mixture of Experts, abbreviated MoE, is the neural-network architecture in which different inputs are routed to different specialized subnetworks (the "experts") via a learned gating mechanism, dramatically increasing total parameter count while keeping per-token compute roughly constant. The pattern was developed in the 1990s (Jacobs and Jordan, 1991) and revived at scale for Transformers by Shazeer et al. (Google, 2017) with the Sparsely-Gated MoE paper, then mainstreamed by Mixtral 8x7B (Mistral, December 2023), which made MoE the de facto efficiency frontier for open-weight models. The modern MoE recipe replaces each Transformer feedforward layer with N experts (typically 8, 16, or 64) plus a router that, per token, selects the top-k experts (usually k=2). Each expert is a full feedforward network; only the selected experts compute, so a model with 8 experts and top-2 routing has 8x the parameters but only 2x the per-token compute of a dense equivalent. Total vs active parameters: Mixtral 8x7B has 47B total parameters but ~13B activated per token; DeepSeek-V3 has 671B total with 37B active; Llama 4 Maverick has 400B total with 17B active. The trade-offs are real: MoE models are dramatically cheaper to serve per token but much harder to train (load balancing across experts, expert collapse, communication overhead in distributed training) and require more total memory at inference (all experts must be loaded even though only some compute). The dominant production MoE LLMs as of 2025 include Mixtral 8x7B and 8x22B, DBRX (Databricks), DeepSeek-V2/V3, Qwen2-MoE, Llama 4 (Meta), and GPT-4 (widely believed to be MoE based on the 2023 leaks). AI governance teams document MoE architecture in model cards because routing behavior introduces non-determinism — the same input may produce different outputs depending on which experts fire, complicating reproducibility.

Specialized routing on a 25-year-old audience-routing platform: Centralpoint has routed enterprise content to specialized audiences via audience tags, taxonomies, and entitlements for 25 years — MoE-style routing is the same conceptual pattern at the model layer rather than the content layer. MoE models deploy on-premise, tokens meter per skill, and MoE-served chatbots deploy through one line of JavaScript.

Related Keywords:
Mixture of Experts,Mixture of Experts,Oxcyon, AI, AI Governance, Generative AI, Inference, Inference, Inferencing, RAG, Prompts, Skills Manager,

Back