Module Overview
This module covers the reasoning-model paradigm that emerged with long chain-of-thought training. It explains how reasoning differs from standard generation, the role of chain-of-thought, and the reinforcement-learning recipe — including RL-only training that skips supervised fine-tuning entirely — before covering distillation of reasoning into smaller models.
Learning Objectives
- Define a reasoning model and contrast it with a standard instruction-tuned LLM.
- Explain why chain-of-thought is the foundation of reasoning behaviour.
- Describe the training recipe for reasoning models.
- Explain how Group Relative Policy Optimization (GRPO) enables RL-only training without a learned critic, and how R1-Zero-style training skips SFT entirely.
- Describe how reasoning capability is distilled into smaller models.
Topics Covered
Reasoning Models
- What is a reasoning model?
- Chain-of-thought: the foundation
- The training recipe
- RL-only training and skipping SFT entirely (R1-Zero approach)
- Group Relative Policy Optimization (GRPO) — group-based reward normalisation, no separate critic model
- Emergent behaviours: extended multi-step traces, self-verification, and self-correction
- Distillation: transferring reasoning without running RL
Key Concepts & Terminology
Chain-of-thought, inference-time scaling, reward signal, GRPO group normalisation, critic-free RL, emergent self-reflection, reasoning distillation.
Tools & Frameworks Referenced
RL-for-reasoning training stacks (GRPO-based); reasoning-distillation datasets and recipes.
Prerequisites
Module 06 (preference alignment / RLHF) and Module 05 (data generation).