Full BERT is accurate but heavy — 110M parameters is a lot to run on a phone or in a high-traffic API. Knowledge distillation solves this by training a smaller, faster student model to mimic a larger, more accurate teacher model. The result keeps most of the teacher's accuracy at a fraction of the size.
This article explains the distillation idea and the three most popular distilled BERT models — DistilBERT, MobileBERT, and TinyBERT — so you know which to reach for when speed and size matter. You will fine-tune these models in the next tutorial.
Prerequisites: Familiarity with BERT's architecture and the encoder stack.
What Is Knowledge Distillation?
Knowledge distillation is a technique where a smaller, simpler model (the student) is trained to reproduce the behavior of a larger, more complex model (the teacher). Instead of learning only from hard labels, the student also learns from the teacher's full output distribution — the "soft" probabilities the teacher assigns across all classes.
Those soft probabilities carry extra information. A teacher that is 70% confident an image is a husky, 25% a wolf, and 5% a cat tells the student that huskies look more like wolves than cats — a nuance the hard label "husky" alone never conveys.
Distillation is done for three reasons:
- Model compression — far fewer parameters
- Inference speedup — faster predictions
- Deployment efficiency — fits on mobile and edge devices

A small student model learns to mimic the soft output distribution of a large teacher model.
DistilBERT
DistilBERT is a distilled version of BERT: smaller, faster, cheaper, and lighter.
Its headline numbers are striking:
- 60% faster at inference than BERT
- 44M fewer parameters — about 40% smaller overall
- Retains 97% of BERT's performance on the GLUE benchmark
Architecture
DistilBERT keeps BERT's design but halves the depth — 6 encoder layers instead of 12. The student is trained to predict the same probability distribution over the vocabulary as the teacher for the same input, and to replicate the teacher's attention patterns. During training, temperature scaling is applied to the softmax outputs to soften the distributions and expose more of the teacher's knowledge.
The Triple Loss
DistilBERT updates its weights using a loss made of three components:
| Loss component | What it teaches |
|---|---|
| Masked language modeling (MLM) loss | The standard BERT objective — predict masked tokens |
| Distillation loss | Match the teacher's soft output distribution |
| Similarity (cosine) loss | Align the student's hidden-state directions with the teacher's |
Together these form the triple loss that lets a 6-layer model recover almost all of a 12-layer model's quality.

DistilBERT trains with three combined losses: MLM, distillation, and cosine similarity.
MobileBERT
MobileBERT compresses and accelerates BERT specifically for mobile and resource-limited devices, while keeping accuracy high.
Its key properties:
- It is task-agnostic — fine-tune it for any NLP task without task-specific modifications, just like BERT.
- It uses a "thin" version of BERT-large with bottleneck structures plus balanced self-attention and feed-forward blocks to cut the computational load.
- It is trained by transferring knowledge from an inverted-bottleneck BERT-large (IB-BERT) teacher, so the small model retains the larger model's quality.
- It is 4.3× smaller and 5.5× faster than BERT-base, with competitive results on GLUE and SQuAD.
Note
Because MobileBERT is task-agnostic, you distill it once and then fine-tune the same checkpoint for many downstream tasks — no need to re-distill per task.
TinyBERT
TinyBERT reduces BERT's size and improves its speed using a custom Transformer distillation method that transfers knowledge from a larger BERT to a much smaller one.
Its distinctive feature is a two-stage learning framework:
- General distillation from a non-fine-tuned BERT, giving the student broad language ability.
- Task-specific distillation from a fine-tuned BERT, sharpening it for the target task.
TinyBERT fits different representations from the teacher's layers using three loss types:
- The output of the embedding layer
- The hidden states and attention matrices from the Transformer layers
- The logits from the prediction layer
The payoff: a 4-layer TinyBERT achieves over 96.8% of BERT-base's GLUE performance while being 7.5× smaller and 9.4× faster at inference.

TinyBERT distills in two stages — general first, then task-specific — across embedding, attention, and logit layers.
How Similar Are These Models?
All three share the same goal and the same core technique, with different engineering choices:
| Model | Compression | Speed | Distillation approach |
|---|---|---|---|
| DistilBERT | ~40% smaller | 60% faster | Triple loss, 6 layers |
| MobileBERT | 4.3× smaller | 5.5× faster | Bottleneck structures, IB-BERT teacher |
| TinyBERT | 7.5× smaller | 9.4× faster | Two-stage transformer distillation |
They have more in common than not:
- All aim to shrink BERT for efficient, deployable models.
- All use knowledge distillation to transfer a teacher's capabilities to a smaller student.
- All remain fine-tunable for a wide range of downstream NLP tasks.
- All achieve performance competitive with or close to the original BERT on standard benchmarks.
Summary
Knowledge distillation trains a compact student model to mimic a large teacher, capturing most of its accuracy at a fraction of the cost. DistilBERT halves BERT's layers with a triple loss, MobileBERT uses bottleneck structures for on-device use, and TinyBERT distills in two stages for the most aggressive compression.
In the next tutorial you put this into practice — fine-tuning DistilBERT, MobileBERT, and TinyBERT for fake news detection and benchmarking their speed and accuracy head to head.