In the previous tutorial you learned how DistilBERT, MobileBERT, and TinyBERT compress BERT. Now you put them to work: building a fake news detector and benchmarking all four models head-to-head to see which gives the best accuracy-per-second.
The task is binary classification — label a news article as Real or Fake — and the workflow mirrors the BERT sentiment tutorial, so you will recognize most of it.
Prerequisites: The BERT fine-tuning workflow and a Python environment with transformers, datasets, evaluate, scikit-learn, openpyxl, and torch. A GPU is recommended.
Loading the Dataset
The dataset is an Excel file of news articles with title, author, text, and a label column. Load it and drop rows with missing values:
import pandas as pd
df = pd.read_excel("https://github.com/laxmimerit/All-CSV-ML-Data-Files-Download/raw/master/fake_news.xlsx")
df = df.dropna()
df.isnull().sum()
id 0
title 0
author 0
text 0
label 0
dtype: int64
Check the class balance:
df['label'].value_counts()
label
0 10361
1 7920
Name: count, dtype: int64
Label 0 is Real and 1 is Fake — a reasonably balanced binary problem with 18,281 articles total.
Exploring the Data
A quick bar chart confirms the balance:
import matplotlib.pyplot as plt
label_counts = df['label'].value_counts(ascending=True)
label_counts.plot.barh()
plt.title("Frequency of Classes")
plt.show()
Estimate token counts (roughly 1.5 tokens per word) for both the short title and the long text to decide what to feed the model:
df['title_tokens'] = df['title'].apply(lambda x: len(x.split()) * 1.5)
df['text_tokens'] = df['text'].apply(lambda x: len(x.split()) * 1.5)
fig, ax = plt.subplots(1, 2, figsize=(15, 5))
ax[0].hist(df['title_tokens'], bins=50, color='skyblue')
ax[0].set_title("Title Tokens")
ax[1].hist(df['text_tokens'], bins=50, color='orange')
ax[1].set_title("Text Tokens")
plt.show()
Note
Article bodies often exceed BERT's 512-token limit, while titles are short. This tutorial classifies on the title alone — short, fast, and surprisingly effective for this task.
Splitting and Building the Dataset
Split 70/20/10, stratified by label, and wrap in a DatasetDict:
from sklearn.model_selection import train_test_split
from datasets import Dataset, DatasetDict
train, test = train_test_split(df, test_size=0.3, stratify=df['label'])
test, validation = train_test_split(test, test_size=1/3, stratify=test['label'])
dataset = DatasetDict({
"train": Dataset.from_pandas(train, preserve_index=False),
"test": Dataset.from_pandas(test, preserve_index=False),
"validation": Dataset.from_pandas(validation, preserve_index=False)
})
Comparing the Distilled Tokenizers
Each model ships its own tokenizer. Load all three and compare how they split the same sentence:
from transformers import AutoTokenizer
text = "Machine learning is awesome!! Thanks KGP Talkie."
distilbert_tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
mobilebert_tokenizer = AutoTokenizer.from_pretrained("google/mobilebert-uncased")
tinybert_tokenizer = AutoTokenizer.from_pretrained("huawei-noah/TinyBERT_General_4L_312D")
All three share the same 30,522-token WordPiece vocabulary and special tokens ([CLS], [SEP], [PAD], [MASK]), so the tokenized output is consistent across models.

The three distilled models share the same WordPiece vocabulary, so titles tokenize identically.
Tokenize the dataset on the title field:
def tokenize(batch):
temp = distilbert_tokenizer(batch['title'], padding=True, truncation=True)
return temp
encoded_dataset = dataset.map(tokenize, batch_size=None, batched=True)
Building and Training DistilBERT
Set up the labels and load DistilBERT with a classification head:
from transformers import AutoModelForSequenceClassification, AutoConfig
import torch
label2id = {"Real": 0, "Fake": 1}
id2label = {0: "Real", 1: "Fake"}
model_ckpt = "distilbert-base-uncased"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
config = AutoConfig.from_pretrained(model_ckpt, label2id=label2id, id2label=id2label)
model = AutoModelForSequenceClassification.from_pretrained(model_ckpt, config=config).to(device)
Define an accuracy metric and the training arguments:
import evaluate
import numpy as np
from transformers import TrainingArguments
accuracy = evaluate.load("accuracy")
def compute_metrics_evaluate(eval_pred):
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=1)
return accuracy.compute(predictions=predictions, references=labels)
batch_size = 32
training_args = TrainingArguments(
output_dir="train_dir",
overwrite_output_dir=True,
num_train_epochs=2,
learning_rate=2e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
weight_decay=0.01,
evaluation_strategy='epoch'
)
Train the model:
from transformers import Trainer
trainer = Trainer(
model=model,
compute_metrics=compute_metrics_evaluate,
train_dataset=encoded_dataset['train'],
eval_dataset=encoded_dataset['validation'],
tokenizer=distilbert_tokenizer
)
trainer.train()
{'loss': 0.2132, 'epoch': 0.31}
{'loss': 0.1512, 'epoch': 0.94}
{'loss': 0.0764, 'epoch': 1.25}
{'loss': 0.0232, 'epoch': 2.5}
{'train_runtime': 363.4127, 'train_loss': 0.09221653908491134, 'epoch': 3.0}
Note
This Trainer is created without passing args=training_args, so it uses Hugging Face's default training arguments (3 epochs, batch size 8) — which is why the log reaches epoch 3.0. Pass args=training_args to the Trainer to use the settings defined above.
Evaluating DistilBERT
Run the test set and inspect the metrics:
preds_output = trainer.predict(encoded_dataset['test'])
preds_output.metrics
{'test_loss': 0.19827575981616974, 'test_accuracy': 0.9595185995623632, 'test_runtime': 9.4297}
A per-class report shows balanced, high performance on both classes:
from sklearn.metrics import classification_report
y_pred = np.argmax(preds_output.predictions, axis=1)
y_true = encoded_dataset['test'][:]['label']
print(classification_report(y_true, y_pred, target_names=list(label2id)))
precision recall f1-score support
Real 0.97 0.96 0.96 2072
Fake 0.95 0.96 0.95 1584
accuracy 0.96 3656
macro avg 0.96 0.96 0.96 3656
weighted avg 0.96 0.96 0.96 3656
96% accuracy from titles alone, using a model 40% smaller than BERT.
Benchmarking All Four Models
The real question is which model gives the best trade-off. Wrap training in a function and loop over all four checkpoints, timing each:
from sklearn.metrics import accuracy_score, f1_score
def compute_metrics(pred):
labels = pred.label_ids
preds = pred.predictions.argmax(-1)
f1 = f1_score(labels, preds, average="weighted")
acc = accuracy_score(labels, preds)
return {"accuracy": acc, "f1": f1}
model_dict = {
"bert-base": "bert-base-uncased",
"distilbert": "distilbert-base-uncased",
"mobilebert": "google/mobilebert-uncased",
"tinybert": "huawei-noah/TinyBERT_General_4L_312D"
}
def train_model(model_name):
model_ckpt = model_dict[model_name]
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
config = AutoConfig.from_pretrained(model_ckpt, label2id=label2id, id2label=id2label)
model = AutoModelForSequenceClassification.from_pretrained(model_ckpt, config=config).to(device)
def local_tokenizer(batch):
return tokenizer(batch['title'], padding=True, truncation=True)
encoded_dataset = dataset.map(local_tokenizer, batched=True, batch_size=None)
trainer = Trainer(
model=model,
compute_metrics=compute_metrics,
train_dataset=encoded_dataset['train'],
eval_dataset=encoded_dataset['validation'],
tokenizer=tokenizer
)
trainer.train()
return trainer.predict(encoded_dataset['test']).metrics
import time
model_performance = {}
for model_name in model_dict:
print("Training Model: ", model_name)
start = time.time()
result = train_model(model_name)
end = time.time()
model_performance[model_name] = {model_name: result, "time taken": end - start}
Warning
MobileBERT can show large, unstable loss spikes early in training (values in the thousands) before settling. It still converges to strong accuracy, but it is the slowest and most finicky of the four to train.
The Results
Here is the benchmark, distilled into a single table:
| Model | Test accuracy | Weighted F1 | Training time | Test runtime |
|---|---|---|---|---|
| bert-base | 0.9584 | 0.9584 | 679.7 s | 12.3 s |
| distilbert | 0.9584 | 0.9585 | 365.1 s | 6.4 s |
| mobilebert | 0.9631 | 0.9631 | 902.3 s | 23.0 s |
| tinybert | 0.9524 | 0.9523 | 107.5 s | 3.0 s |
The takeaways are clear:
- TinyBERT is the speed champion — it trains in 107 seconds (over 6× faster than BERT) and gives the fastest inference, losing only ~0.6% accuracy.
- DistilBERT matches BERT's accuracy exactly while training in half the time.
- MobileBERT edges out the highest accuracy but is the slowest to train and run.

TinyBERT wins on speed, DistilBERT matches BERT's accuracy at half the cost, MobileBERT is the most accurate but slowest.
Saving and Serving
Save the trained model and reload it as a pipeline for one-line predictions:
trainer.save_model("fake_news")
from transformers import pipeline
classifier = pipeline('text-classification', model='fake_news')
classifier("some text data")
[{'label': 'Fake', 'score': 0.9996247291564941}]

Save the fine-tuned model and serve predictions through a pipeline in a single line.
Summary
You fine-tuned a distilled BERT to detect fake news at 96% accuracy from titles alone, then benchmarked all four models. The lesson: distilled models are not just smaller — they deliver near-identical accuracy at a fraction of the training and inference cost, with TinyBERT offering the best speed-per-accuracy for production.
Next, you move from classifying whole sentences to labeling individual tokens — fine-tuning DistilBERT for named entity recognition on restaurant search queries.