#Reasoning Models#Chain-of-Thought#GRPO#DeepSeek-R1#Reinforcement Learning#Syllabus

Module 09: Reasoning Models and Chain-of-Thought

Syllabus on reasoning models — what distinguishes them from standard LLMs, chain-of-thought training, RL-only reasoning (GRPO, DeepSeek-R1-Zero), and distilling reasoning into smaller models.

May 28, 2026 at 12:16 PM1 min readFollowFollow (Hindi)

Topics You Will Master

What defines a reasoning model versus a standard LLM
Chain-of-thought as the foundation of reasoning
The reasoning-model training recipe
RL-only reasoning without supervised fine-tuning (GRPO, R1-Zero)
Distilling reasoning capability into smaller, efficient models
Best For

Engineers who want to understand the reasoning-model paradigm and how RL-driven reasoning emerges.

Expected Outcome

A working understanding of how reasoning behaviour is incentivised through reinforcement learning and transferred via distillation.

Module Overview

This module covers the reasoning-model paradigm that emerged with long chain-of-thought training. It explains how reasoning differs from standard generation, the role of chain-of-thought, and the reinforcement-learning recipe — including RL-only training that skips supervised fine-tuning entirely — before covering distillation of reasoning into smaller models.

Learning Objectives

  • Define a reasoning model and contrast it with a standard instruction-tuned LLM.
  • Explain why chain-of-thought is the foundation of reasoning behaviour.
  • Describe the training recipe for reasoning models.
  • Explain how Group Relative Policy Optimization (GRPO) enables RL-only training without a learned critic, and how R1-Zero-style training skips SFT entirely.
  • Describe how reasoning capability is distilled into smaller models.

Topics Covered

Reasoning Models

  • What is a reasoning model?
  • Chain-of-thought: the foundation
  • The training recipe
  • RL-only training and skipping SFT entirely (R1-Zero approach)
  • Group Relative Policy Optimization (GRPO) — group-based reward normalisation, no separate critic model
  • Emergent behaviours: extended multi-step traces, self-verification, and self-correction
  • Distillation: transferring reasoning without running RL

Key Concepts & Terminology

Chain-of-thought, inference-time scaling, reward signal, GRPO group normalisation, critic-free RL, emergent self-reflection, reasoning distillation.

Tools & Frameworks Referenced

RL-for-reasoning training stacks (GRPO-based); reasoning-distillation datasets and recipes.

Prerequisites

Module 06 (preference alignment / RLHF) and Module 05 (data generation).

Further Reading

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments