Module 09: Reasoning Models and Chain-of-Thought

Reasoning models — what sets them apart from standard LLMs, chain-of-thought training, RL-only reasoning (GRPO, DeepSeek-R1-Zero), and distillation.

May 28, 20261 min readFollow

Topics You Will Master

What defines a reasoning model versus a standard LLM
Chain-of-thought as the foundation of reasoning
The reasoning-model training recipe
RL-only reasoning without supervised fine-tuning (GRPO, R1-Zero)

Module Overview

This module covers the reasoning-model paradigm that emerged with long chain-of-thought training. It explains how reasoning differs from standard generation, the role of chain-of-thought, and the reinforcement-learning recipe — including RL-only training that skips supervised fine-tuning entirely — before covering distillation of reasoning into smaller models.

Learning Objectives

  • Define a reasoning model and contrast it with a standard instruction-tuned LLM.
  • Explain why chain-of-thought is the foundation of reasoning behaviour.
  • Describe the training recipe for reasoning models.
  • Explain how Group Relative Policy Optimization (GRPO) enables RL-only training without a learned critic, and how R1-Zero-style training skips SFT entirely.
  • Describe how reasoning capability is distilled into smaller models.

Topics Covered

Reasoning Models

  • What is a reasoning model?
  • Chain-of-thought: the foundation
  • The training recipe
  • RL-only training and skipping SFT entirely (R1-Zero approach)
  • Group Relative Policy Optimization (GRPO) — group-based reward normalisation, no separate critic model
  • Emergent behaviours: extended multi-step traces, self-verification, and self-correction
  • Distillation: transferring reasoning without running RL

Key Concepts & Terminology

Chain-of-thought, inference-time scaling, reward signal, GRPO group normalisation, critic-free RL, emergent self-reflection, reasoning distillation.

Tools & Frameworks Referenced

RL-for-reasoning training stacks (GRPO-based); reasoning-distillation datasets and recipes.

Prerequisites

Module 06 (preference alignment / RLHF) and Module 05 (data generation).

Further Reading

Found this useful? Keep building with me.

New tutorials every week on YouTube — or go deeper with a full structured course.

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments