Fine-Tuning Distilled BERT for Fake News Detection

Fine-tune DistilBERT, MobileBERT, and TinyBERT to detect fake news, then benchmark their accuracy and speed against full BERT in a head-to-head comparison.

Jun 18, 202614 min readFollow

Topics You Will Master

Loading and preparing a real-world fake-news dataset
Comparing the DistilBERT, MobileBERT, and TinyBERT tokenizers
Fine-tuning a distilled model for binary text classification
Benchmarking four models on accuracy, F1, and training time

In the previous tutorial you learned how DistilBERT, MobileBERT, and TinyBERT compress BERT. Now you put them to work: building a fake news detector and benchmarking all four models head-to-head to see which gives the best accuracy-per-second.

The task is binary classification — label a news article as Real or Fake — and the workflow mirrors the BERT sentiment tutorial, so you will recognize most of it.

Prerequisites: The BERT fine-tuning workflow and a Python environment with transformers, datasets, evaluate, scikit-learn, openpyxl, and torch. A GPU is recommended.

95% OFF

Fine Tuning LLM with HuggingFace Transformers for NLP

Learn the basics of transformers and then fine-tune large language models on your own custom datasets.

Enroll Now — 95% OFF →

Loading the Dataset

The dataset is an Excel file of news articles with title, author, text, and a label column. Load it and drop rows with missing values:

PYTHON
import pandas as pd

df = pd.read_excel("https://github.com/laxmimerit/All-CSV-ML-Data-Files-Download/raw/master/fake_news.xlsx")
df = df.dropna()
df.isnull().sum()
OUTPUT
id        0
title     0
author    0
text      0
label     0
dtype: int64

Check the class balance:

PYTHON
df['label'].value_counts()
OUTPUT
label
0    10361
1     7920
Name: count, dtype: int64

Label 0 is Real and 1 is Fake — a reasonably balanced binary problem with 18,281 articles total.


Exploring the Data

A quick bar chart confirms the balance:

PYTHON
import matplotlib.pyplot as plt

label_counts = df['label'].value_counts(ascending=True)
label_counts.plot.barh()
plt.title("Frequency of Classes")
plt.show()

Estimate token counts (roughly 1.5 tokens per word) for both the short title and the long text to decide what to feed the model:

PYTHON
df['title_tokens'] = df['title'].apply(lambda x: len(x.split()) * 1.5)
df['text_tokens'] = df['text'].apply(lambda x: len(x.split()) * 1.5)

fig, ax = plt.subplots(1, 2, figsize=(15, 5))
ax[0].hist(df['title_tokens'], bins=50, color='skyblue')
ax[0].set_title("Title Tokens")
ax[1].hist(df['text_tokens'], bins=50, color='orange')
ax[1].set_title("Text Tokens")
plt.show()

Note

Article bodies often exceed BERT's 512-token limit, while titles are short. This tutorial classifies on the title alone — short, fast, and surprisingly effective for this task.


Splitting and Building the Dataset

Split 70/20/10, stratified by label, and wrap in a DatasetDict:

PYTHON
from sklearn.model_selection import train_test_split
from datasets import Dataset, DatasetDict

train, test = train_test_split(df, test_size=0.3, stratify=df['label'])
test, validation = train_test_split(test, test_size=1/3, stratify=test['label'])

dataset = DatasetDict({
    "train": Dataset.from_pandas(train, preserve_index=False),
    "test": Dataset.from_pandas(test, preserve_index=False),
    "validation": Dataset.from_pandas(validation, preserve_index=False)
})

Comparing the Distilled Tokenizers

Each model ships its own tokenizer. Load all three and compare how they split the same sentence:

PYTHON
from transformers import AutoTokenizer

text = "Machine learning is awesome!! Thanks KGP Talkie."

distilbert_tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
mobilebert_tokenizer = AutoTokenizer.from_pretrained("google/mobilebert-uncased")
tinybert_tokenizer = AutoTokenizer.from_pretrained("huawei-noah/TinyBERT_General_4L_312D")

All three share the same 30,522-token WordPiece vocabulary and special tokens ([CLS], [SEP], [PAD], [MASK]), so the tokenized output is consistent across models.

Diagram comparing DistilBERT, MobileBERT, and TinyBERT tokenizing the same input into shared WordPiece tokens

The three distilled models share the same WordPiece vocabulary, so titles tokenize identically.

Tokenize the dataset on the title field:

PYTHON
def tokenize(batch):
    temp = distilbert_tokenizer(batch['title'], padding=True, truncation=True)
    return temp

encoded_dataset = dataset.map(tokenize, batch_size=None, batched=True)

Building and Training DistilBERT

Set up the labels and load DistilBERT with a classification head:

PYTHON
from transformers import AutoModelForSequenceClassification, AutoConfig
import torch

label2id = {"Real": 0, "Fake": 1}
id2label = {0: "Real", 1: "Fake"}

model_ckpt = "distilbert-base-uncased"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

config = AutoConfig.from_pretrained(model_ckpt, label2id=label2id, id2label=id2label)
model = AutoModelForSequenceClassification.from_pretrained(model_ckpt, config=config).to(device)

Define an accuracy metric and the training arguments:

PYTHON
import evaluate
import numpy as np
from transformers import TrainingArguments

accuracy = evaluate.load("accuracy")

def compute_metrics_evaluate(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

batch_size = 32
training_args = TrainingArguments(
    output_dir="train_dir",
    overwrite_output_dir=True,
    num_train_epochs=2,
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    evaluation_strategy='epoch'
)

Train the model:

PYTHON
from transformers import Trainer

trainer = Trainer(
    model=model,
    compute_metrics=compute_metrics_evaluate,
    train_dataset=encoded_dataset['train'],
    eval_dataset=encoded_dataset['validation'],
    tokenizer=distilbert_tokenizer
)

trainer.train()
OUTPUT
{'loss': 0.2132, 'epoch': 0.31}
{'loss': 0.1512, 'epoch': 0.94}
{'loss': 0.0764, 'epoch': 1.25}
{'loss': 0.0232, 'epoch': 2.5}
{'train_runtime': 363.4127, 'train_loss': 0.09221653908491134, 'epoch': 3.0}

Note

This Trainer is created without passing args=training_args, so it uses Hugging Face's default training arguments (3 epochs, batch size 8) — which is why the log reaches epoch 3.0. Pass args=training_args to the Trainer to use the settings defined above.


Evaluating DistilBERT

Run the test set and inspect the metrics:

PYTHON
preds_output = trainer.predict(encoded_dataset['test'])
preds_output.metrics
OUTPUT
{'test_loss': 0.19827575981616974, 'test_accuracy': 0.9595185995623632, 'test_runtime': 9.4297}

A per-class report shows balanced, high performance on both classes:

PYTHON
from sklearn.metrics import classification_report

y_pred = np.argmax(preds_output.predictions, axis=1)
y_true = encoded_dataset['test'][:]['label']
print(classification_report(y_true, y_pred, target_names=list(label2id)))
OUTPUT
              precision    recall  f1-score   support

        Real       0.97      0.96      0.96      2072
        Fake       0.95      0.96      0.95      1584

    accuracy                           0.96      3656
   macro avg       0.96      0.96      0.96      3656
weighted avg       0.96      0.96      0.96      3656

96% accuracy from titles alone, using a model 40% smaller than BERT.


Benchmarking All Four Models

The real question is which model gives the best trade-off. Wrap training in a function and loop over all four checkpoints, timing each:

PYTHON
from sklearn.metrics import accuracy_score, f1_score

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    f1 = f1_score(labels, preds, average="weighted")
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc, "f1": f1}

model_dict = {
    "bert-base": "bert-base-uncased",
    "distilbert": "distilbert-base-uncased",
    "mobilebert": "google/mobilebert-uncased",
    "tinybert": "huawei-noah/TinyBERT_General_4L_312D"
}

def train_model(model_name):
    model_ckpt = model_dict[model_name]
    tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
    config = AutoConfig.from_pretrained(model_ckpt, label2id=label2id, id2label=id2label)
    model = AutoModelForSequenceClassification.from_pretrained(model_ckpt, config=config).to(device)

    def local_tokenizer(batch):
        return tokenizer(batch['title'], padding=True, truncation=True)

    encoded_dataset = dataset.map(local_tokenizer, batched=True, batch_size=None)

    trainer = Trainer(
        model=model,
        compute_metrics=compute_metrics,
        train_dataset=encoded_dataset['train'],
        eval_dataset=encoded_dataset['validation'],
        tokenizer=tokenizer
    )
    trainer.train()
    return trainer.predict(encoded_dataset['test']).metrics

import time
model_performance = {}
for model_name in model_dict:
    print("Training Model: ", model_name)
    start = time.time()
    result = train_model(model_name)
    end = time.time()
    model_performance[model_name] = {model_name: result, "time taken": end - start}

Warning

MobileBERT can show large, unstable loss spikes early in training (values in the thousands) before settling. It still converges to strong accuracy, but it is the slowest and most finicky of the four to train.


The Results

Here is the benchmark, distilled into a single table:

Model Test accuracy Weighted F1 Training time Test runtime
bert-base 0.9584 0.9584 679.7 s 12.3 s
distilbert 0.9584 0.9585 365.1 s 6.4 s
mobilebert 0.9631 0.9631 902.3 s 23.0 s
tinybert 0.9524 0.9523 107.5 s 3.0 s

The takeaways are clear:

  • TinyBERT is the speed champion — it trains in 107 seconds (over 6× faster than BERT) and gives the fastest inference, losing only ~0.6% accuracy.
  • DistilBERT matches BERT's accuracy exactly while training in half the time.
  • MobileBERT edges out the highest accuracy but is the slowest to train and run.

Diagram comparing the four models on the accuracy versus speed trade-off, highlighting TinyBERT for speed

TinyBERT wins on speed, DistilBERT matches BERT's accuracy at half the cost, MobileBERT is the most accurate but slowest.


Saving and Serving

Save the trained model and reload it as a pipeline for one-line predictions:

PYTHON
trainer.save_model("fake_news")

from transformers import pipeline

classifier = pipeline('text-classification', model='fake_news')
classifier("some text data")
OUTPUT
[{'label': 'Fake', 'score': 0.9996247291564941}]

Diagram of the deployment path: fine-tuned distilled model saved and served via a Hugging Face pipeline

Save the fine-tuned model and serve predictions through a pipeline in a single line.


Summary

You fine-tuned a distilled BERT to detect fake news at 96% accuracy from titles alone, then benchmarked all four models. The lesson: distilled models are not just smaller — they deliver near-identical accuracy at a fraction of the training and inference cost, with TinyBERT offering the best speed-per-accuracy for production.

Next, you move from classifying whole sentences to labeling individual tokens — fine-tuning DistilBERT for named entity recognition on restaurant search queries.

Found this useful? Keep building with me.

New tutorials every week on YouTube — or go deeper with a full structured course.

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments