#Capstone#Knowledge Distillation#GGUF#llama.cpp#Reasoning#Syllabus

Project 02: TinyReason — Distilling Reasoning to a CPU Model

Compress a larger reasoning teacher into a small student using KL divergence and attention transfer, then quantize to GGUF for cost-efficient CPU inference.

May 28, 2026 at 12:00 PM1 min readFollowFollow (Hindi)

Topics You Will Master

Teacher soft-label generation and student architecture design
A from-scratch combined distillation loss (KL divergence + attention transfer + CE)
Ablation studies against the teacher baseline
GGUF quantization and OpenAI-compatible CPU serving
Best For

Engineers who want a small model that reasons — fast, cheap, and runnable on CPU.

Expected Outcome

A quantized student model and a CPU-served endpoint with a measured quality and latency gap against the teacher.

Project Overview

TinyReason takes a larger teacher model with strong reasoning behavior and distils its knowledge into a much smaller student that can run on CPU at production cost.

Objective

Train a small student model using a combined distillation loss against a larger teacher on a math-reasoning corpus, then ship it as a quantized CPU-served endpoint.

Scope

  • Teacher soft-label generation across the training set.
  • Student architecture choice and a custom combined distillation loss.
  • Ablation study across loss configurations vs. the teacher baseline.
  • Conversion to GGUF and CPU inference benchmarking.

Datasets

  • A mathematical-reasoning problem set (e.g., GSM8K-style) for primary distillation and evaluation.
  • An extended-reasoning set for stress testing.

Stack

  • transformers + PyTorch for logit extraction from the teacher.
  • A custom training loop with loss-curve tracking.
  • KL divergence and attention transfer implemented from scratch.
  • llama.cpp with GGUF conversion for CPU serving.
  • llama-server for an OpenAI-compatible CPU inference API.

Evaluation

  • Quality gap vs. the teacher on held-out problems.
  • Tokens-per-second and first-token-latency benchmarks.

Deliverables

  • Saved soft labels for all training problems.
  • Trained student checkpoint with the combined loss.
  • Ablation comparison table across loss configurations.
  • A quantized GGUF model file.
  • A CPU-served endpoint with a benchmark report.

Prerequisites

Modules 09–10 (reasoning models, SLMs and distillation), Module 07 (quantization and serving).

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments