Knowledge Distillation: DistilBERT, TinyBERT, MobileBERT

Full BERT is accurate but heavy. Its 110M parameters are a lot to run on a phone or in a busy API. So, here comes knowledge distillation to the rescue. It trains a smaller, faster student model to copy a larger, more accurate teacher model. The result keeps most of the teacher's accuracy at a fraction of the size.

In this blog, we will learn the distillation idea and the three most popular distilled BERT models: DistilBERT, MobileBERT, and TinyBERT. Then we know which one to reach for when speed and size matter. We will fine-tune these models in the next tutorial.

Prerequisites: Familiarity with BERT's architecture and the encoder stack.

What Is Knowledge Distillation?

Knowledge distillation is a technique where a small, simple model copies a large, complex one. We call the small model the student and the large one the teacher. The student does not learn only from hard labels. It also learns from the teacher's full output distribution. In simple words, it learns the soft probabilities the teacher gives across all classes.

Those soft probabilities carry extra information. Say a teacher is 70% sure an image is a husky, 25% a wolf, and 5% a cat. This tells the student that huskies look more like wolves than cats. The hard label husky alone never shows that.

Distillation is done for three reasons:

Model compression: far fewer parameters
Inference speedup: faster predictions
Deployment efficiency: fits on mobile and edge devices

Diagram of knowledge distillation: a large teacher model transferring soft predictions to a small student model

A small student model learns to mimic the soft output distribution of a large teacher model.

DistilBERT

DistilBERT is a distilled version of BERT. It is smaller, faster, cheaper, and lighter.

Its numbers are impressive:

60% faster at inference than BERT
44M fewer parameters, about 40% smaller overall
Keeps 97% of BERT's performance on the GLUE benchmark

Architecture

DistilBERT keeps BERT's design but halves the depth. It uses 6 encoder layers instead of 12. The student is trained to predict the same probability distribution as the teacher for the same input. It also copies the teacher's attention patterns. During training, temperature scaling is applied to the softmax outputs. This softens the distributions and shows more of the teacher's knowledge.

The Triple Loss

DistilBERT updates its weights using a loss made of three components:

Loss component	What it teaches
Masked language modeling (MLM) loss	The standard BERT objective, predict masked tokens
Distillation loss	Match the teacher's soft output distribution
Similarity (cosine) loss	Align the student's hidden-state directions with the teacher's

Together these form the triple loss. It lets a 6-layer model recover almost all of a 12-layer model's quality.

Diagram of DistilBERT's triple loss combining MLM loss, distillation loss, and similarity loss

DistilBERT trains with three combined losses: MLM, distillation, and cosine similarity.

MobileBERT

MobileBERT compresses and speeds up BERT for mobile and low-resource devices. It does this while keeping accuracy high.

Its key properties:

It is task-agnostic. We fine-tune it for any NLP task with no task-specific changes, just like BERT.
It uses a thin version of BERT-large with bottleneck structures. It also balances the self-attention and feed-forward blocks to cut the compute load.
It is trained by transferring knowledge from an inverted-bottleneck BERT-large (IB-BERT) teacher. So the small model keeps the larger model's quality.
It is 4.3× smaller and 5.5× faster than BERT-base, with competitive results on GLUE and SQuAD.

Note

Because MobileBERT is task-agnostic, we distill it once. Then we fine-tune the same checkpoint for many downstream tasks. There is no need to re-distill for each task.

TinyBERT

TinyBERT cuts BERT's size and improves its speed. It uses a custom Transformer distillation method. This method transfers knowledge from a larger BERT to a much smaller one.

Its main feature is a two-stage learning framework:

General distillation from a non-fine-tuned BERT. This gives the student broad language ability.
Task-specific distillation from a fine-tuned BERT. This sharpens it for the target task.

TinyBERT fits different representations from the teacher's layers using three loss types:

The output of the embedding layer
The hidden states and attention matrices from the Transformer layers
The logits from the prediction layer

The payoff is big. A 4-layer TinyBERT reaches over 96.8% of BERT-base's GLUE performance. And it is 7.5× smaller and 9.4× faster at inference.

Diagram of TinyBERT's two-stage distillation: general distillation then task-specific distillation

TinyBERT distills in two stages, general first, then task-specific, across embedding, attention, and logit layers.

How Similar Are These Models?

All three share the same goal and the same core technique, with different engineering choices:

Model	Compression	Speed	Distillation approach
DistilBERT	~40% smaller	60% faster	Triple loss, 6 layers
MobileBERT	4.3× smaller	5.5× faster	Bottleneck structures, IB-BERT teacher
TinyBERT	7.5× smaller	9.4× faster	Two-stage transformer distillation

They have more in common than not:

All aim to shrink BERT into efficient, deployable models.
All use knowledge distillation to pass a teacher's skills to a smaller student.
All stay fine-tunable for many downstream NLP tasks.
All reach performance close to the original BERT on standard benchmarks.

Summary

This is how knowledge distillation works. It trains a compact student model to copy a large teacher. The student keeps most of the accuracy at a fraction of the cost. DistilBERT halves BERT's layers with a triple loss. MobileBERT uses bottleneck structures for on-device use. TinyBERT distills in two stages for the strongest compression.

In the next tutorial, we put this into practice. We will do fine-tuning DistilBERT, MobileBERT, and TinyBERT for fake news detection. We will also benchmark their speed and accuracy head to head.

Knowledge Distillation: DistilBERT, TinyBERT, MobileBERT

Fine Tuning LLM with HuggingFace Transformers for NLP

What Is Knowledge Distillation?

DistilBERT

Architecture

The Triple Loss

MobileBERT

TinyBERT

How Similar Are These Models?

Summary

Found this useful? Keep building with me.

Latest recommendations you might like

BERT Architecture: Theory and Fine-Tuning

Fine-Tuning Distilled BERT for Fake News Detection

Fine-Tuning BERT for Sentiment Classification

Fine-Tuning DistilBERT for Restaurant Search NER

Find this tutorial useful?

Discussion & Comments