Module Overview
This module addresses model compression for production. It motivates Small Language Models, surveys the techniques that produce them (distillation, pruning, quantization), and then details the knowledge-distillation process — the core mechanism for transferring a large model's behaviour into a small one.
Learning Objectives
- Explain why SLMs are critical for production (cost, latency, privacy).
- Compare the key SLM-building techniques and when each applies.
- Describe the student-teacher paradigm and what "knowledge" is transferred.
- Explain soft labels, temperature scaling, and the KL divergence loss.
- Outline the stages of building a distillation pipeline.
Topics Covered
Small Language Models
- What is a Small Language Model?
- Why SLMs matter
- The SLM design philosophy
- Key techniques: knowledge distillation, pruning, quantization, Liquid Foundation Models
- Training SLMs from scratch
- Fine-tuning SLMs
- Reasoning in SLMs
- SLM vs LLM: choosing the right scale
Knowledge Distillation
- The student-teacher paradigm
- The core problem: what is knowledge?
- Hard labels vs soft labels
- Temperature scaling
- The KL divergence loss
- Attention transfer
- Building a distillation pipeline: teacher data generation, student architecture design, training the student
Key Concepts & Terminology
Soft targets, distillation temperature, logit matching, attention map transfer, structured/unstructured pruning, compression-quality trade-off.
Tools & Frameworks Referenced
PyTorch / Transformers (logit extraction and custom training loops), GGUF + llama.cpp for compressed CPU serving.
Prerequisites
Module 07 (quantization) and Module 06 (fine-tuning).