Fine-Tuning Distilled BERT for Fake News Detection

In the previous tutorial we learned how DistilBERT, MobileBERT, and TinyBERT compress BERT. Now we put them to work. We build a fake news detector and benchmark all four models head to head. This shows us which one gives the best accuracy per second.

The task is binary classification. We label a news article as Real or Fake. The workflow mirrors the BERT sentiment tutorial, so we will recognize most of it.

Prerequisites: The BERT fine-tuning workflow and a Python environment with transformers, datasets, evaluate, scikit-learn, openpyxl, and torch. A GPU is recommended.

Loading the Dataset

The dataset is an Excel file of news articles. It has title, author, text, and a label column. We load it and drop rows with missing values:

PYTHON

import pandas as pd

df = pd.read_excel("https://github.com/laxmimerit/All-CSV-ML-Data-Files-Download/raw/master/fake_news.xlsx")
df = df.dropna()
df.isnull().sum()

OUTPUT

id        0
title     0
author    0
text      0
label     0
dtype: int64

Now we check the class balance:

PYTHON

df['label'].value_counts()

OUTPUT

label
0    10361
1     7920
Name: count, dtype: int64

Here, we can see label 0 is Real and 1 is Fake. It is a fairly balanced binary problem with 18,281 articles in total.

Exploring the Data

A quick bar chart confirms the balance:

PYTHON

import matplotlib.pyplot as plt

label_counts = df['label'].value_counts(ascending=True)
label_counts.plot.barh()
plt.title("Frequency of Classes")
plt.show()

We estimate token counts for both the short title and the long text. A rough rule is 1.5 tokens per word. This helps us decide what to feed the model:

PYTHON

df['title_tokens'] = df['title'].apply(lambda x: len(x.split()) * 1.5)
df['text_tokens'] = df['text'].apply(lambda x: len(x.split()) * 1.5)

fig, ax = plt.subplots(1, 2, figsize=(15, 5))
ax[0].hist(df['title_tokens'], bins=50, color='skyblue')
ax[0].set_title("Title Tokens")
ax[1].hist(df['text_tokens'], bins=50, color='orange')
ax[1].set_title("Text Tokens")
plt.show()

Note

Article bodies often go over BERT's 512-token limit, while titles are short. So this tutorial classifies on the title alone. It is short, fast, and surprisingly effective for this task.

Splitting and Building the Dataset

We split 70/20/10, stratified by label, and wrap it in a DatasetDict:

PYTHON

from sklearn.model_selection import train_test_split
from datasets import Dataset, DatasetDict

train, test = train_test_split(df, test_size=0.3, stratify=df['label'])
test, validation = train_test_split(test, test_size=1/3, stratify=test['label'])

dataset = DatasetDict({
    "train": Dataset.from_pandas(train, preserve_index=False),
    "test": Dataset.from_pandas(test, preserve_index=False),
    "validation": Dataset.from_pandas(validation, preserve_index=False)
})

Comparing the Distilled Tokenizers

Each model ships its own tokenizer. We load all three and compare how they split the same sentence:

PYTHON

from transformers import AutoTokenizer

text = "Machine learning is awesome!! Thanks KGP Talkie."

distilbert_tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
mobilebert_tokenizer = AutoTokenizer.from_pretrained("google/mobilebert-uncased")
tinybert_tokenizer = AutoTokenizer.from_pretrained("huawei-noah/TinyBERT_General_4L_312D")

Here, we can see all three share the same 30,522-token WordPiece vocabulary. They also share the same special tokens ([CLS], [SEP], [PAD], [MASK]). So the tokenized output is the same across models.

Diagram comparing DistilBERT, MobileBERT, and TinyBERT tokenizing the same input into shared WordPiece tokens

The three distilled models share the same WordPiece vocabulary, so titles tokenize identically.

Now we tokenize the dataset on the title field:

PYTHON

def tokenize(batch):
    temp = distilbert_tokenizer(batch['title'], padding=True, truncation=True)
    return temp

encoded_dataset = dataset.map(tokenize, batch_size=None, batched=True)

Building and Training DistilBERT

We set up the labels and load DistilBERT with a classification head:

PYTHON

from transformers import AutoModelForSequenceClassification, AutoConfig
import torch

label2id = {"Real": 0, "Fake": 1}
id2label = {0: "Real", 1: "Fake"}

model_ckpt = "distilbert-base-uncased"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

config = AutoConfig.from_pretrained(model_ckpt, label2id=label2id, id2label=id2label)
model = AutoModelForSequenceClassification.from_pretrained(model_ckpt, config=config).to(device)

We define an accuracy metric and the training arguments:

PYTHON

import evaluate
import numpy as np
from transformers import TrainingArguments

accuracy = evaluate.load("accuracy")

def compute_metrics_evaluate(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

batch_size = 32
training_args = TrainingArguments(
    output_dir="train_dir",
    overwrite_output_dir=True,
    num_train_epochs=2,
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    evaluation_strategy='epoch'
)

Now we train the model:

PYTHON

from transformers import Trainer

trainer = Trainer(
    model=model,
    compute_metrics=compute_metrics_evaluate,
    train_dataset=encoded_dataset['train'],
    eval_dataset=encoded_dataset['validation'],
    tokenizer=distilbert_tokenizer
)

trainer.train()

OUTPUT

{'loss': 0.2132, 'epoch': 0.31}
{'loss': 0.1512, 'epoch': 0.94}
{'loss': 0.0764, 'epoch': 1.25}
{'loss': 0.0232, 'epoch': 2.5}
{'train_runtime': 363.4127, 'train_loss': 0.09221653908491134, 'epoch': 3.0}

Note

This Trainer is created without passing args=training_args. So it uses Hugging Face's default training arguments, which are 3 epochs and batch size 8. That is why the log reaches epoch 3.0. Pass args=training_args to the Trainer to use the settings defined above.

Evaluating DistilBERT

We run the test set and inspect the metrics:

PYTHON

preds_output = trainer.predict(encoded_dataset['test'])
preds_output.metrics

OUTPUT

{'test_loss': 0.19827575981616974, 'test_accuracy': 0.9595185995623632, 'test_runtime': 9.4297}

A per-class report shows balanced, high performance on both classes:

PYTHON

from sklearn.metrics import classification_report

y_pred = np.argmax(preds_output.predictions, axis=1)
y_true = encoded_dataset['test'][:]['label']
print(classification_report(y_true, y_pred, target_names=list(label2id)))

OUTPUT

              precision    recall  f1-score   support

        Real       0.97      0.96      0.96      2072
        Fake       0.95      0.96      0.95      1584

    accuracy                           0.96      3656
   macro avg       0.96      0.96      0.96      3656
weighted avg       0.96      0.96      0.96      3656

Here, we can see 96% accuracy from titles alone. And we used a model 40% smaller than BERT.

Benchmarking All Four Models

The real question is which model gives the best trade-off. So we wrap training in a function. Then we loop over all four checkpoints and time each one:

PYTHON

from sklearn.metrics import accuracy_score, f1_score

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    f1 = f1_score(labels, preds, average="weighted")
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc, "f1": f1}

model_dict = {
    "bert-base": "bert-base-uncased",
    "distilbert": "distilbert-base-uncased",
    "mobilebert": "google/mobilebert-uncased",
    "tinybert": "huawei-noah/TinyBERT_General_4L_312D"
}

def train_model(model_name):
    model_ckpt = model_dict[model_name]
    tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
    config = AutoConfig.from_pretrained(model_ckpt, label2id=label2id, id2label=id2label)
    model = AutoModelForSequenceClassification.from_pretrained(model_ckpt, config=config).to(device)

    def local_tokenizer(batch):
        return tokenizer(batch['title'], padding=True, truncation=True)

    encoded_dataset = dataset.map(local_tokenizer, batched=True, batch_size=None)

    trainer = Trainer(
        model=model,
        compute_metrics=compute_metrics,
        train_dataset=encoded_dataset['train'],
        eval_dataset=encoded_dataset['validation'],
        tokenizer=tokenizer
    )
    trainer.train()
    return trainer.predict(encoded_dataset['test']).metrics

import time
model_performance = {}
for model_name in model_dict:
    print("Training Model: ", model_name)
    start = time.time()
    result = train_model(model_name)
    end = time.time()
    model_performance[model_name] = {model_name: result, "time taken": end - start}

Warning

MobileBERT can show large, unstable loss spikes early in training, with values in the thousands, before it settles. It still reaches strong accuracy. But it is the slowest and most finicky of the four to train.

The Results

Here is the benchmark in a single table:

Model	Test accuracy	Weighted F1	Training time	Test runtime
bert-base	0.9584	0.9584	679.7 s	12.3 s
distilbert	0.9584	0.9585	365.1 s	6.4 s
mobilebert	0.9631	0.9631	902.3 s	23.0 s
tinybert	0.9524	0.9523	107.5 s	3.0 s

The takeaways are clear:

TinyBERT is the speed champion. It trains in 107 seconds, over 6× faster than BERT. It also gives the fastest inference, and it loses only about 0.6% accuracy.
DistilBERT matches BERT's accuracy exactly while training in half the time.
MobileBERT edges out the highest accuracy, but it is the slowest to train and run.

Diagram comparing the four models on the accuracy versus speed trade-off, highlighting TinyBERT for speed

TinyBERT wins on speed, DistilBERT matches BERT's accuracy at half the cost, MobileBERT is the most accurate but slowest.

Saving and Serving

We save the trained model and reload it as a pipeline for one-line predictions:

PYTHON

trainer.save_model("fake_news")

from transformers import pipeline

classifier = pipeline('text-classification', model='fake_news')
classifier("some text data")

OUTPUT

[{'label': 'Fake', 'score': 0.9996247291564941}]

Diagram of the deployment path: fine-tuned distilled model saved and served via a Hugging Face pipeline

Save the fine-tuned model and serve predictions through a pipeline in a single line.

Summary

This is how fake news detection with distilled BERT works. We fine-tuned a distilled model to 96% accuracy from titles alone. Then we benchmarked all four models. The lesson is simple. Distilled models are not just smaller. They give near-identical accuracy at a fraction of the training and inference cost. TinyBERT offers the best speed per accuracy for production.

Next, we move from classifying whole sentences to labeling single tokens. We will do fine-tuning DistilBERT for named entity recognition on restaurant search queries.

Fine-Tuning Distilled BERT for Fake News Detection

Fine Tuning LLM with HuggingFace Transformers for NLP

Loading the Dataset

Exploring the Data

Splitting and Building the Dataset

Comparing the Distilled Tokenizers

Building and Training DistilBERT

Evaluating DistilBERT

Benchmarking All Four Models

The Results

Saving and Serving

Summary

Found this useful? Keep building with me.

Latest recommendations you might like

BERT Architecture: Theory and Fine-Tuning

Fine-Tuning BERT for Sentiment Classification

Fine-Tuning DistilBERT for Restaurant Search NER

Fine-Tuning Phi-2 with LoRA and QLoRA

Find this tutorial useful?

Discussion & Comments