Module 08: Mixture of Experts

Module Overview

This module explains how Mixture-of-Experts architectures decouple model capacity from per-token compute. It covers the routing mechanism, the load-balancing problem that destabilises naive MoE training, the practical trade-offs at train and inference time, and decision criteria for MoE versus dense models.

Learning Objectives

Explain the dense-model scaling problem MoE addresses.
Describe top-k gating and sparse expert activation.
Identify expert collapse and the role of load-balancing loss.
Compare MoE training and inference trade-offs against dense models.
Decide when MoE is appropriate for a production deployment.

Topics Covered

Mixture of Experts

The dense model scaling problem
The MoE idea
Architecture deep dive (gating network and experts)
Load balancing
Training MoE models
Inference with MoE
MoE variants (sparse MoE, soft MoE)
Production MoE models
MoE vs dense: when to use each

Key Concepts & Terminology

Sparse activation, router/gating network, top-k routing, active vs total parameters, auxiliary load-balancing loss, expert parallelism, capacity factor.

Tools & Frameworks Referenced

Production sparse MoE model families (conceptual); expert-parallel training and MoE-aware serving.

Prerequisites

Modules 01-03 (architecture and inference) and Module 06 (fine-tuning).

Module 08: Mixture of Experts

Module Overview

Learning Objectives

Topics Covered

Mixture of Experts

Key Concepts & Terminology

Tools & Frameworks Referenced

Prerequisites

Found this useful? Keep building with me.

Latest recommendations you might like

Module 04: LLM Lifecycle and Pre-Training

Module 05: Datasets and Synthetic Data

Module 06: SFT, PEFT and Preference Alignment

Module 07: Evaluation, Quantization and Deployment

Find this tutorial useful?

Discussion & Comments