Module Overview
This module focuses on what makes Transformer inference efficient and how models are sized for a given compute budget. It covers memory-side optimisations (KV cache, Flash Attention), the evolution of attention variants designed to shrink the KV cache, the RoPE positional scheme, and the scaling laws that guide model and data sizing.
Learning Objectives
- Explain the naive autoregressive decoding problem and the role of the KV cache.
- Compute the memory footprint of a KV cache and relate it to context length.
- Describe how Flash Attention reduces memory traffic and why SDPA unifies attention backends.
- Contrast MHA, MQA, GQA, and MLA in terms of quality and KV-cache savings.
- Summarise the Chinchilla scaling laws and their implications for compute-optimal training.
Topics Covered
KV Cache & Memory Optimization
- The naive decoding problem
- KV cache fundamentals
- KV cache memory mathematics
- Flash Attention
- PyTorch SDPA — the unified attention API
Attention Mechanism Variants
- Multi-Head Attention (MHA)
- Multi-Query Attention (MQA)
- Grouped-Query Attention (GQA)
- Multi-Head Latent Attention (MLA)
- PagedAttention and the vLLM serving model
Positional Encoding Schemes
- Rotary Position Embedding (RoPE) and context-length behaviour
Scaling Laws for LLMs
- Scaling laws for neural language models
- Training compute-optimal large language models
- The Chinchilla scaling laws and their practical guidance
Key Concepts & Terminology
Prefill vs decode, time-to-first-token, tokens-per-second, KV cache eviction, latent attention compression, compute-optimal token-to-parameter ratio.
Tools & Frameworks Referenced
vLLM (PagedAttention), PyTorch SDPA / FlashAttention, RoPE-based open models.
Prerequisites
Modules 01–02.