Module 03: Fast Inference and Scaling Laws

Inference efficiency for Transformers — KV caching, Flash Attention, MQA/GQA/MLA/PagedAttention, RoPE encoding, and Chinchilla scaling laws.

May 28, 20261 min readFollow

Topics You Will Master

The naive decoding bottleneck and how KV caching resolves it
KV cache memory mathematics and its impact on serving cost
Flash Attention and the unified attention API (PyTorch SDPA)
Modern attention variants: MHA, MQA, GQA, MLA, and PagedAttention

Module Overview

This module focuses on what makes Transformer inference efficient and how models are sized for a given compute budget. It covers memory-side optimisations (KV cache, Flash Attention), the evolution of attention variants designed to shrink the KV cache, the RoPE positional scheme, and the scaling laws that guide model and data sizing.

Learning Objectives

  • Explain the naive autoregressive decoding problem and the role of the KV cache.
  • Compute the memory footprint of a KV cache and relate it to context length.
  • Describe how Flash Attention reduces memory traffic and why SDPA unifies attention backends.
  • Contrast MHA, MQA, GQA, and MLA in terms of quality and KV-cache savings.
  • Summarise the Chinchilla scaling laws and their implications for compute-optimal training.

Topics Covered

KV Cache & Memory Optimization

  • The naive decoding problem
  • KV cache fundamentals
  • KV cache memory mathematics
  • Flash Attention
  • PyTorch SDPA — the unified attention API

Attention Mechanism Variants

  • Multi-Head Attention (MHA)
  • Multi-Query Attention (MQA)
  • Grouped-Query Attention (GQA)
  • Multi-Head Latent Attention (MLA)
  • PagedAttention and the vLLM serving model

Positional Encoding Schemes

  • Rotary Position Embedding (RoPE) and context-length behaviour

Scaling Laws for LLMs

  • Scaling laws for neural language models
  • Training compute-optimal large language models
  • The Chinchilla scaling laws and their practical guidance

Key Concepts & Terminology

Prefill vs decode, time-to-first-token, tokens-per-second, KV cache eviction, latent attention compression, compute-optimal token-to-parameter ratio.

Tools & Frameworks Referenced

vLLM (PagedAttention), PyTorch SDPA / FlashAttention, RoPE-based open models.

Prerequisites

Modules 01–02.

Found this useful? Keep building with me.

New tutorials every week on YouTube — or go deeper with a full structured course.

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments