Module 09: Reasoning Models and Chain-of-Thought

Module Overview

This module covers the reasoning-model paradigm that emerged with long chain-of-thought training. It explains how reasoning differs from standard generation, the role of chain-of-thought, and the reinforcement-learning recipe, including RL-only training that skips supervised fine-tuning entirely, before covering distillation of reasoning into smaller models.

Learning Objectives

Define a reasoning model and contrast it with a standard instruction-tuned LLM.
Explain why chain-of-thought is the foundation of reasoning behaviour.
Describe the training recipe for reasoning models.
Explain how Group Relative Policy Optimization (GRPO) enables RL-only training without a learned critic, and how R1-Zero-style training skips SFT entirely.
Describe how reasoning capability is distilled into smaller models.

Topics Covered

Reasoning Models

What is a reasoning model?
Chain-of-thought: the foundation
The training recipe
RL-only training and skipping SFT entirely (R1-Zero approach)
Group Relative Policy Optimization (GRPO): group-based reward normalisation, no separate critic model
Emergent behaviours: extended multi-step traces, self-verification, and self-correction
Distillation: transferring reasoning without running RL

Key Concepts & Terminology

Chain-of-thought, inference-time scaling, reward signal, GRPO group normalisation, critic-free RL, emergent self-reflection, reasoning distillation.

Tools & Frameworks Referenced

RL-for-reasoning training stacks (GRPO-based); reasoning-distillation datasets and recipes.

Prerequisites

Module 06 (preference alignment / RLHF) and Module 05 (data generation).

Module 09: Reasoning Models and Chain-of-Thought

Module Overview

Learning Objectives

Topics Covered

Reasoning Models

Key Concepts & Terminology

Tools & Frameworks Referenced

Prerequisites

Further Reading

Found this useful? Keep building with me.

Latest recommendations you might like

Module 04: LLM Lifecycle and Pre-Training

Module 05: Datasets and Synthetic Data

Module 06: SFT, PEFT and Preference Alignment

Module 07: Evaluation, Quantization and Deployment

Find this tutorial useful?

Discussion & Comments