Module 03: Fast Inference and Scaling Laws

Module Overview

This module focuses on what makes Transformer inference efficient and how models are sized for a given compute budget. It covers memory-side optimisations (KV cache, Flash Attention), the evolution of attention variants designed to shrink the KV cache, the RoPE positional scheme, and the scaling laws that guide model and data sizing.

Learning Objectives

Explain the naive autoregressive decoding problem and the role of the KV cache.
Compute the memory footprint of a KV cache and relate it to context length.
Describe how Flash Attention reduces memory traffic and why SDPA unifies attention backends.
Contrast MHA, MQA, GQA, and MLA in terms of quality and KV-cache savings.
Summarise the Chinchilla scaling laws and their implications for compute-optimal training.

Topics Covered

KV Cache & Memory Optimization

The naive decoding problem
KV cache fundamentals
KV cache memory mathematics
Flash Attention
PyTorch SDPA: the unified attention API

Attention Mechanism Variants

Multi-Head Attention (MHA)
Multi-Query Attention (MQA)
Grouped-Query Attention (GQA)
Multi-Head Latent Attention (MLA)
PagedAttention and the vLLM serving model

Positional Encoding Schemes

Rotary Position Embedding (RoPE) and context-length behaviour

Scaling Laws for LLMs

Scaling laws for neural language models
Training compute-optimal large language models
The Chinchilla scaling laws and their practical guidance

Key Concepts & Terminology

Prefill vs decode, time-to-first-token, tokens-per-second, KV cache eviction, latent attention compression, compute-optimal token-to-parameter ratio.

Tools & Frameworks Referenced

vLLM (PagedAttention), PyTorch SDPA / FlashAttention, RoPE-based open models.

Prerequisites

Modules 01-02.

Module 03: Fast Inference and Scaling Laws

Module Overview

Learning Objectives

Topics Covered

KV Cache & Memory Optimization

Attention Mechanism Variants

Positional Encoding Schemes

Scaling Laws for LLMs

Key Concepts & Terminology

Tools & Frameworks Referenced

Prerequisites

Found this useful? Keep building with me.

Latest recommendations you might like

Module 01: Transformers and Tokenization

Module 02: Hands-On Fine-Tuning of Transformers

Module 04: LLM Lifecycle and Pre-Training

Module 05: Datasets and Synthetic Data

Find this tutorial useful?

Discussion & Comments