Fine-tuning adapts a pretrained model to your own data. In this tutorial you take bert-base-uncased and teach it to classify the emotion of a tweet — one of six classes: sadness, joy, love, anger, fear, or surprise.
You will use the Hugging Face Transformers Trainer API, which handles the training loop, evaluation, and checkpointing for you. By the end you will have a saved model that predicts emotion from raw text in one line.
Prerequisites: A grasp of BERT's architecture and a Python environment with transformers, datasets, evaluate, scikit-learn, and torch installed. A GPU is strongly recommended for training.
Loading the Dataset
The dataset is a CSV of 16,000 tweets, each labeled with an emotion. Load it with pandas:
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/laxmimerit/All-CSV-ML-Data-Files-Download/master/twitter_multi_class_sentiment.csv")
df.info()
df.isnull().sum()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16000 entries, 0 to 15999
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 text 16000 non-null object
1 label 16000 non-null int64
2 label_name 16000 non-null object
text 0
label 0
label_name 0
dtype: int64
There are three columns — the raw text, an integer label, and a human-readable label_name — and no missing values. Check how many examples each emotion has:
df['label'].value_counts()
label
1 5362
0 4666
3 2159
4 1937
2 1304
5 572
Name: count, dtype: int64
Note
The classes are imbalanced — surprise (label 5) has only 572 examples versus 5,362 for joy (label 1). That imbalance shows up later in the per-class scores.
Exploring the Data
A horizontal bar chart shows the class frequencies at a glance:
import matplotlib.pyplot as plt
label_counts = df['label_name'].value_counts(ascending=True)
label_counts.plot.barh()
plt.title("Frequency of Classes")
plt.show()
It is also worth checking tweet length, since BERT has a maximum input size. Add a word-count column and box-plot it by class:
df['Words per Tweet'] = df['text'].str.split().apply(len)
df.boxplot("Words per Tweet", by="label_name")
Tokenization
BERT cannot take raw strings — text must be tokenized into integer IDs. Load the matching tokenizer with AutoTokenizer:
from transformers import AutoTokenizer
model_ckpt = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
text = "I love machine learning! Tokenization is awesome!!"
encoded_text = tokenizer(text)
print(encoded_text)
{'input_ids': [101, 1045, 2293, 3698, 4083, 999, 19204, 3989, 2003, 12476, 999, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
The input_ids start with 101 ([CLS]) and end with 102 ([SEP]). Inspect the vocabulary size and the model's maximum sequence length:
len(tokenizer.vocab), tokenizer.vocab_size, tokenizer.model_max_length
(30522, 30522, 512)

Tokenization converts raw text into input IDs framed by the [CLS] and [SEP] special tokens.
Train/Test/Validation Split
Split the data into 70% train, 20% test, and 10% validation, stratified by class so each split keeps the same emotion distribution:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.3, stratify=df['label_name'])
test, validation = train_test_split(test, test_size=1/3, stratify=test['label_name'])
train.shape, test.shape, validation.shape
((11200, 4), (3200, 4), (1600, 4))
Convert the pandas splits into a Hugging Face DatasetDict, the format the Trainer expects:
from datasets import Dataset, DatasetDict
dataset = DatasetDict({
'train': Dataset.from_pandas(train, preserve_index=False),
'test': Dataset.from_pandas(test, preserve_index=False),
'validation': Dataset.from_pandas(validation, preserve_index=False)
})
dataset
DatasetDict({
train: Dataset({
features: ['text', 'label', 'label_name', 'Words per Tweet'],
num_rows: 11200
})
test: Dataset({
features: ['text', 'label', 'label_name', 'Words per Tweet'],
num_rows: 3200
})
validation: Dataset({
features: ['text', 'label', 'label_name', 'Words per Tweet'],
num_rows: 1600
})
})
Tokenizing the Whole Dataset
Define a tokenize function with padding and truncation, then map it over every split at once:
def tokenize(batch):
temp = tokenizer(batch['text'], padding=True, truncation=True)
return temp
emotion_encoded = dataset.map(tokenize, batched=True, batch_size=None)
Build the label-to-ID mappings the model needs to report human-readable predictions:
label2id = {x['label_name']: x['label'] for x in dataset['train']}
id2label = {v: k for k, v in label2id.items()}
label2id, id2label
({'love': 2, 'joy': 1, 'sadness': 0, 'fear': 4, 'anger': 3, 'surprise': 5}, {2: 'love', 1: 'joy', 0: 'sadness', 4: 'fear', 3: 'anger', 5: 'surprise'})
Building the Model
Load BERT with a classification head sized to the number of labels. AutoModelForSequenceClassification adds that head on top of the pretrained [CLS] output:
from transformers import AutoModelForSequenceClassification, AutoConfig
import torch
num_labels = len(label2id)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
config = AutoConfig.from_pretrained(model_ckpt, label2id=label2id, id2label=id2label)
model = AutoModelForSequenceClassification.from_pretrained(model_ckpt, config=config).to(device)
Important
You will see a warning that classifier.bias and classifier.weight are "newly initialized." That is expected — the classification head starts random and is exactly what fine-tuning trains.

AutoModelForSequenceClassification adds a head that maps BERT's [CLS] output to the six emotion classes.
Training Arguments and Metrics
Configure the training run. A learning rate of 2e-5 and 2 epochs are solid defaults for BERT fine-tuning:
from transformers import TrainingArguments
batch_size = 64
training_dir = "bert_base_train_dir"
training_args = TrainingArguments(
output_dir=training_dir,
overwrite_output_dir=True,
num_train_epochs=2,
learning_rate=2e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
weight_decay=0.01,
evaluation_strategy='epoch',
disable_tqdm=False
)
Warning
In recent versions of Transformers, evaluation_strategy was renamed to eval_strategy. If you get a deprecation warning or error, use eval_strategy='epoch' instead.
Define a metric function that reports both accuracy and weighted F1 using scikit-learn:
from sklearn.metrics import accuracy_score, f1_score
def compute_metrics(pred):
labels = pred.label_ids
preds = pred.predictions.argmax(-1)
f1 = f1_score(labels, preds, average="weighted")
acc = accuracy_score(labels, preds)
return {"accuracy": acc, "f1": f1}
Training
Assemble the Trainer with the model, arguments, metric function, datasets, and tokenizer, then train:
from transformers import Trainer
trainer = Trainer(
model=model,
args=training_args,
compute_metrics=compute_metrics,
train_dataset=emotion_encoded['train'],
eval_dataset=emotion_encoded['validation'],
tokenizer=tokenizer
)
trainer.train()
{'eval_loss': 0.4704, 'eval_accuracy': 0.85125, 'eval_f1': 0.84068, 'epoch': 1.0}
{'eval_loss': 0.2952, 'eval_accuracy': 0.909375, 'eval_f1': 0.90793, 'epoch': 2.0}
{'train_runtime': 1374.5377, 'train_loss': 0.67778, 'epoch': 2.0}
Validation accuracy climbs from 85% after the first epoch to 91% after the second — a clear sign the model is learning.
Evaluating the Model
Run the held-out test set through the trained model:
preds_output = trainer.predict(emotion_encoded['test'])
preds_output.metrics
{'test_loss': 0.2910054922103882, 'test_accuracy': 0.9028125, 'test_f1': 0.9010784813634883, 'test_runtime': 78.7905}
A per-class classification report shows where the model is strong and weak:
import numpy as np
from sklearn.metrics import classification_report
y_pred = np.argmax(preds_output.predictions, axis=1)
y_true = emotion_encoded['test'][:]['label']
print(classification_report(y_true, y_pred))
precision recall f1-score support
0 0.93 0.97 0.95 933
1 0.91 0.92 0.91 1072
2 0.79 0.74 0.76 261
3 0.94 0.93 0.93 432
4 0.86 0.87 0.87 387
5 0.89 0.61 0.72 115
accuracy 0.90 3200
macro avg 0.89 0.84 0.86 3200
weighted avg 0.90 0.90 0.90 3200
As expected from the class imbalance, the rare classes — love (2) and surprise (5) — have the lowest recall. A confusion matrix makes the mistakes visible:
import seaborn as sns
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_true, y_pred)
plt.figure(figsize=(5, 5))
sns.heatmap(cm, annot=True, xticklabels=label2id.keys(), yticklabels=label2id.keys(), fmt='d', cbar=False, cmap='Reds')
plt.ylabel("Actual")
plt.xlabel("Predicted")
plt.show()
Prediction and Saving
Wrap inference in a small helper that returns the predicted emotion name:
text = "I am super happy today. I got it done. Finally!!"
def get_prediction(text):
input_encoded = tokenizer(text, return_tensors='pt').to(device)
with torch.no_grad():
outputs = model(**input_encoded)
logits = outputs.logits
pred = torch.argmax(logits, dim=1).item()
return id2label[pred]
get_prediction(text)
'joy'
Save the fine-tuned model so you can reload it later or share it:
trainer.save_model("bert-base-uncased-sentiment-model")
The cleanest way to reuse it is through a pipeline:
from transformers import pipeline
classifier = pipeline('text-classification', model='bert-base-uncased-sentiment-model')
classifier([text, 'hello, how are you?', "love you", "i am feeling low"])
[{'label': 'joy', 'score': 0.9631468057632446}, {'label': 'joy', 'score': 0.7542405128479004}, {'label': 'love', 'score': 0.6492504477500916}, {'label': 'sadness', 'score': 0.9719626307487488}]

The end-to-end fine-tuning workflow: load data, tokenize, train, evaluate, save, and serve predictions.
Summary
You fine-tuned bert-base-uncased for six-class emotion classification, reaching about 90% test accuracy. The recipe — tokenize, wrap data in a DatasetDict, add a classification head, train with the Trainer, evaluate, and save — is the same one you will reuse across every text-classification task.
Next, you will apply this exact workflow to compact, distilled models for fake news detection and compare their speed and accuracy.