Module Overview
This module explains how Mixture-of-Experts architectures decouple model capacity from per-token compute. It covers the routing mechanism, the load-balancing problem that destabilises naive MoE training, the practical trade-offs at train and inference time, and decision criteria for MoE versus dense models.
Learning Objectives
- Explain the dense-model scaling problem MoE addresses.
- Describe top-k gating and sparse expert activation.
- Identify expert collapse and the role of load-balancing loss.
- Compare MoE training and inference trade-offs against dense models.
- Decide when MoE is appropriate for a production deployment.
Topics Covered
Mixture of Experts
- The dense model scaling problem
- The MoE idea
- Architecture deep dive (gating network and experts)
- Load balancing
- Training MoE models
- Inference with MoE
- MoE variants (sparse MoE, soft MoE)
- Production MoE models
- MoE vs dense: when to use each
Key Concepts & Terminology
Sparse activation, router/gating network, top-k routing, active vs total parameters, auxiliary load-balancing loss, expert parallelism, capacity factor.
Tools & Frameworks Referenced
Production sparse MoE model families (conceptual); expert-parallel training and MoE-aware serving.
Prerequisites
Modules 01–03 (architecture and inference) and Module 06 (fine-tuning).