Project Overview
TinyReason takes a larger teacher model with strong reasoning behavior and distils its knowledge into a much smaller student that can run on CPU at production cost.
Objective
Train a small student model using a combined distillation loss against a larger teacher on a math-reasoning corpus, then ship it as a quantized CPU-served endpoint.
Scope
- Teacher soft-label generation across the training set.
- Student architecture choice and a custom combined distillation loss.
- Ablation study across loss configurations vs. the teacher baseline.
- Conversion to GGUF and CPU inference benchmarking.
Datasets
- A mathematical-reasoning problem set (e.g., GSM8K-style) for primary distillation and evaluation.
- An extended-reasoning set for stress testing.
Stack
transformers+ PyTorch for logit extraction from the teacher.- A custom training loop with loss-curve tracking.
- KL divergence and attention transfer implemented from scratch.
llama.cppwith GGUF conversion for CPU serving.llama-serverfor an OpenAI-compatible CPU inference API.
Evaluation
- Quality gap vs. the teacher on held-out problems.
- Tokens-per-second and first-token-latency benchmarks.
Deliverables
- Saved soft labels for all training problems.
- Trained student checkpoint with the combined loss.
- Ablation comparison table across loss configurations.
- A quantized GGUF model file.
- A CPU-served endpoint with a benchmark report.
Prerequisites
Modules 09–10 (reasoning models, SLMs and distillation), Module 07 (quantization and serving).