Knowledge Distillation: DistilBERT, TinyBERT, MobileBERT

Understand knowledge distillation and how DistilBERT, TinyBERT, and MobileBERT compress BERT into smaller, faster models that keep most of its accuracy.

Jun 18, 20265 min readFollow

Topics You Will Master

What knowledge distillation is and why teacher-student training works
How DistilBERT halves BERT's layers using a triple loss function
How MobileBERT uses bottleneck structures for on-device deployment
How TinyBERT's two-stage transformer distillation works

Full BERT is accurate but heavy — 110M parameters is a lot to run on a phone or in a high-traffic API. Knowledge distillation solves this by training a smaller, faster student model to mimic a larger, more accurate teacher model. The result keeps most of the teacher's accuracy at a fraction of the size.

This article explains the distillation idea and the three most popular distilled BERT models — DistilBERT, MobileBERT, and TinyBERT — so you know which to reach for when speed and size matter. You will fine-tune these models in the next tutorial.

Prerequisites: Familiarity with BERT's architecture and the encoder stack.

95% OFF

Fine Tuning LLM with HuggingFace Transformers for NLP

Learn the basics of transformers and then fine-tune large language models on your own custom datasets.

Enroll Now — 95% OFF →

What Is Knowledge Distillation?

Knowledge distillation is a technique where a smaller, simpler model (the student) is trained to reproduce the behavior of a larger, more complex model (the teacher). Instead of learning only from hard labels, the student also learns from the teacher's full output distribution — the "soft" probabilities the teacher assigns across all classes.

Those soft probabilities carry extra information. A teacher that is 70% confident an image is a husky, 25% a wolf, and 5% a cat tells the student that huskies look more like wolves than cats — a nuance the hard label "husky" alone never conveys.

Distillation is done for three reasons:

  • Model compression — far fewer parameters
  • Inference speedup — faster predictions
  • Deployment efficiency — fits on mobile and edge devices

Diagram of knowledge distillation: a large teacher model transferring soft predictions to a small student model

A small student model learns to mimic the soft output distribution of a large teacher model.


DistilBERT

DistilBERT is a distilled version of BERT: smaller, faster, cheaper, and lighter.

Its headline numbers are striking:

  • 60% faster at inference than BERT
  • 44M fewer parameters — about 40% smaller overall
  • Retains 97% of BERT's performance on the GLUE benchmark

Architecture

DistilBERT keeps BERT's design but halves the depth — 6 encoder layers instead of 12. The student is trained to predict the same probability distribution over the vocabulary as the teacher for the same input, and to replicate the teacher's attention patterns. During training, temperature scaling is applied to the softmax outputs to soften the distributions and expose more of the teacher's knowledge.

The Triple Loss

DistilBERT updates its weights using a loss made of three components:

Loss component What it teaches
Masked language modeling (MLM) loss The standard BERT objective — predict masked tokens
Distillation loss Match the teacher's soft output distribution
Similarity (cosine) loss Align the student's hidden-state directions with the teacher's

Together these form the triple loss that lets a 6-layer model recover almost all of a 12-layer model's quality.

Diagram of DistilBERT's triple loss combining MLM loss, distillation loss, and similarity loss

DistilBERT trains with three combined losses: MLM, distillation, and cosine similarity.


MobileBERT

MobileBERT compresses and accelerates BERT specifically for mobile and resource-limited devices, while keeping accuracy high.

Its key properties:

  • It is task-agnostic — fine-tune it for any NLP task without task-specific modifications, just like BERT.
  • It uses a "thin" version of BERT-large with bottleneck structures plus balanced self-attention and feed-forward blocks to cut the computational load.
  • It is trained by transferring knowledge from an inverted-bottleneck BERT-large (IB-BERT) teacher, so the small model retains the larger model's quality.
  • It is 4.3× smaller and 5.5× faster than BERT-base, with competitive results on GLUE and SQuAD.

Note

Because MobileBERT is task-agnostic, you distill it once and then fine-tune the same checkpoint for many downstream tasks — no need to re-distill per task.


TinyBERT

TinyBERT reduces BERT's size and improves its speed using a custom Transformer distillation method that transfers knowledge from a larger BERT to a much smaller one.

Its distinctive feature is a two-stage learning framework:

  1. General distillation from a non-fine-tuned BERT, giving the student broad language ability.
  2. Task-specific distillation from a fine-tuned BERT, sharpening it for the target task.

TinyBERT fits different representations from the teacher's layers using three loss types:

  • The output of the embedding layer
  • The hidden states and attention matrices from the Transformer layers
  • The logits from the prediction layer

The payoff: a 4-layer TinyBERT achieves over 96.8% of BERT-base's GLUE performance while being 7.5× smaller and 9.4× faster at inference.

Diagram of TinyBERT's two-stage distillation: general distillation then task-specific distillation

TinyBERT distills in two stages — general first, then task-specific — across embedding, attention, and logit layers.


How Similar Are These Models?

All three share the same goal and the same core technique, with different engineering choices:

Model Compression Speed Distillation approach
DistilBERT ~40% smaller 60% faster Triple loss, 6 layers
MobileBERT 4.3× smaller 5.5× faster Bottleneck structures, IB-BERT teacher
TinyBERT 7.5× smaller 9.4× faster Two-stage transformer distillation

They have more in common than not:

  • All aim to shrink BERT for efficient, deployable models.
  • All use knowledge distillation to transfer a teacher's capabilities to a smaller student.
  • All remain fine-tunable for a wide range of downstream NLP tasks.
  • All achieve performance competitive with or close to the original BERT on standard benchmarks.

Summary

Knowledge distillation trains a compact student model to mimic a large teacher, capturing most of its accuracy at a fraction of the cost. DistilBERT halves BERT's layers with a triple loss, MobileBERT uses bottleneck structures for on-device use, and TinyBERT distills in two stages for the most aggressive compression.

In the next tutorial you put this into practice — fine-tuning DistilBERT, MobileBERT, and TinyBERT for fake news detection and benchmarking their speed and accuracy head to head.

Found this useful? Keep building with me.

New tutorials every week on YouTube — or go deeper with a full structured course.

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments