Fine-Tuning BERT for Sentiment Classification

Fine-tune BERT for multi-class emotion classification on Twitter tweets using Hugging Face Transformers, the Trainer API, and a custom evaluation function.

Jun 18, 202620 min readFollow

Topics You Will Master

Loading and exploring a multi-class emotion dataset
Tokenizing text with AutoTokenizer and building a Hugging Face DatasetDict
Attaching a classification head with AutoModelForSequenceClassification
Training with the Trainer API and a custom accuracy/F1 metric

Fine-tuning adapts a pretrained model to your own data. In this tutorial you take bert-base-uncased and teach it to classify the emotion of a tweet — one of six classes: sadness, joy, love, anger, fear, or surprise.

You will use the Hugging Face Transformers Trainer API, which handles the training loop, evaluation, and checkpointing for you. By the end you will have a saved model that predicts emotion from raw text in one line.

Prerequisites: A grasp of BERT's architecture and a Python environment with transformers, datasets, evaluate, scikit-learn, and torch installed. A GPU is strongly recommended for training.

95% OFF

Fine Tuning LLM with HuggingFace Transformers for NLP

Learn the basics of transformers and then fine-tune large language models on your own custom datasets.

Enroll Now — 95% OFF →

Loading the Dataset

The dataset is a CSV of 16,000 tweets, each labeled with an emotion. Load it with pandas:

PYTHON
import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/laxmimerit/All-CSV-ML-Data-Files-Download/master/twitter_multi_class_sentiment.csv")
df.info()
df.isnull().sum()
PYTHON
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16000 entries, 0 to 15999
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   text        16000 non-null  object
 1   label       16000 non-null  int64 
 2   label_name  16000 non-null  object
text          0
label         0
label_name    0
dtype: int64

There are three columns — the raw text, an integer label, and a human-readable label_name — and no missing values. Check how many examples each emotion has:

PYTHON
df['label'].value_counts()
OUTPUT
label
1    5362
0    4666
3    2159
4    1937
2    1304
5     572
Name: count, dtype: int64

Note

The classes are imbalanced — surprise (label 5) has only 572 examples versus 5,362 for joy (label 1). That imbalance shows up later in the per-class scores.


Exploring the Data

A horizontal bar chart shows the class frequencies at a glance:

PYTHON
import matplotlib.pyplot as plt

label_counts = df['label_name'].value_counts(ascending=True)
label_counts.plot.barh()
plt.title("Frequency of Classes")
plt.show()

It is also worth checking tweet length, since BERT has a maximum input size. Add a word-count column and box-plot it by class:

PYTHON
df['Words per Tweet'] = df['text'].str.split().apply(len)
df.boxplot("Words per Tweet", by="label_name")

Tokenization

BERT cannot take raw strings — text must be tokenized into integer IDs. Load the matching tokenizer with AutoTokenizer:

PYTHON
from transformers import AutoTokenizer

model_ckpt = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

text = "I love machine learning! Tokenization is awesome!!"
encoded_text = tokenizer(text)
print(encoded_text)
OUTPUT
{'input_ids': [101, 1045, 2293, 3698, 4083, 999, 19204, 3989, 2003, 12476, 999, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

The input_ids start with 101 ([CLS]) and end with 102 ([SEP]). Inspect the vocabulary size and the model's maximum sequence length:

PYTHON
len(tokenizer.vocab), tokenizer.vocab_size, tokenizer.model_max_length
OUTPUT
(30522, 30522, 512)

Diagram of the tokenization step: raw text split into subword tokens and mapped to input IDs with special tokens

Tokenization converts raw text into input IDs framed by the [CLS] and [SEP] special tokens.


Train/Test/Validation Split

Split the data into 70% train, 20% test, and 10% validation, stratified by class so each split keeps the same emotion distribution:

PYTHON
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.3, stratify=df['label_name'])
test, validation = train_test_split(test, test_size=1/3, stratify=test['label_name'])

train.shape, test.shape, validation.shape
OUTPUT
((11200, 4), (3200, 4), (1600, 4))

Convert the pandas splits into a Hugging Face DatasetDict, the format the Trainer expects:

PYTHON
from datasets import Dataset, DatasetDict

dataset = DatasetDict({
    'train': Dataset.from_pandas(train, preserve_index=False),
    'test': Dataset.from_pandas(test, preserve_index=False),
    'validation': Dataset.from_pandas(validation, preserve_index=False)
})
dataset
PYTHON
DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'label_name', 'Words per Tweet'],
        num_rows: 11200
    })
    test: Dataset({
        features: ['text', 'label', 'label_name', 'Words per Tweet'],
        num_rows: 3200
    })
    validation: Dataset({
        features: ['text', 'label', 'label_name', 'Words per Tweet'],
        num_rows: 1600
    })
})

Tokenizing the Whole Dataset

Define a tokenize function with padding and truncation, then map it over every split at once:

PYTHON
def tokenize(batch):
    temp = tokenizer(batch['text'], padding=True, truncation=True)
    return temp

emotion_encoded = dataset.map(tokenize, batched=True, batch_size=None)

Build the label-to-ID mappings the model needs to report human-readable predictions:

PYTHON
label2id = {x['label_name']: x['label'] for x in dataset['train']}
id2label = {v: k for k, v in label2id.items()}

label2id, id2label
OUTPUT
({'love': 2, 'joy': 1, 'sadness': 0, 'fear': 4, 'anger': 3, 'surprise': 5}, {2: 'love', 1: 'joy', 0: 'sadness', 4: 'fear', 3: 'anger', 5: 'surprise'})

Building the Model

Load BERT with a classification head sized to the number of labels. AutoModelForSequenceClassification adds that head on top of the pretrained [CLS] output:

PYTHON
from transformers import AutoModelForSequenceClassification, AutoConfig
import torch

num_labels = len(label2id)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
config = AutoConfig.from_pretrained(model_ckpt, label2id=label2id, id2label=id2label)
model = AutoModelForSequenceClassification.from_pretrained(model_ckpt, config=config).to(device)

Important

You will see a warning that classifier.bias and classifier.weight are "newly initialized." That is expected — the classification head starts random and is exactly what fine-tuning trains.

Diagram of BERT with a sequence-classification head mapping the CLS vector to six emotion classes

AutoModelForSequenceClassification adds a head that maps BERT's [CLS] output to the six emotion classes.


Training Arguments and Metrics

Configure the training run. A learning rate of 2e-5 and 2 epochs are solid defaults for BERT fine-tuning:

PYTHON
from transformers import TrainingArguments

batch_size = 64
training_dir = "bert_base_train_dir"

training_args = TrainingArguments(
    output_dir=training_dir,
    overwrite_output_dir=True,
    num_train_epochs=2,
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    evaluation_strategy='epoch',
    disable_tqdm=False
)

Warning

In recent versions of Transformers, evaluation_strategy was renamed to eval_strategy. If you get a deprecation warning or error, use eval_strategy='epoch' instead.

Define a metric function that reports both accuracy and weighted F1 using scikit-learn:

PYTHON
from sklearn.metrics import accuracy_score, f1_score

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)

    f1 = f1_score(labels, preds, average="weighted")
    acc = accuracy_score(labels, preds)

    return {"accuracy": acc, "f1": f1}

Training

Assemble the Trainer with the model, arguments, metric function, datasets, and tokenizer, then train:

PYTHON
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=emotion_encoded['train'],
    eval_dataset=emotion_encoded['validation'],
    tokenizer=tokenizer
)

trainer.train()
OUTPUT
{'eval_loss': 0.4704, 'eval_accuracy': 0.85125, 'eval_f1': 0.84068, 'epoch': 1.0}
{'eval_loss': 0.2952, 'eval_accuracy': 0.909375, 'eval_f1': 0.90793, 'epoch': 2.0}
{'train_runtime': 1374.5377, 'train_loss': 0.67778, 'epoch': 2.0}

Validation accuracy climbs from 85% after the first epoch to 91% after the second — a clear sign the model is learning.


Evaluating the Model

Run the held-out test set through the trained model:

PYTHON
preds_output = trainer.predict(emotion_encoded['test'])
preds_output.metrics
OUTPUT
{'test_loss': 0.2910054922103882, 'test_accuracy': 0.9028125, 'test_f1': 0.9010784813634883, 'test_runtime': 78.7905}

A per-class classification report shows where the model is strong and weak:

PYTHON
import numpy as np
from sklearn.metrics import classification_report

y_pred = np.argmax(preds_output.predictions, axis=1)
y_true = emotion_encoded['test'][:]['label']
print(classification_report(y_true, y_pred))
OUTPUT
              precision    recall  f1-score   support

           0       0.93      0.97      0.95       933
           1       0.91      0.92      0.91      1072
           2       0.79      0.74      0.76       261
           3       0.94      0.93      0.93       432
           4       0.86      0.87      0.87       387
           5       0.89      0.61      0.72       115

    accuracy                           0.90      3200
   macro avg       0.89      0.84      0.86      3200
weighted avg       0.90      0.90      0.90      3200

As expected from the class imbalance, the rare classes — love (2) and surprise (5) — have the lowest recall. A confusion matrix makes the mistakes visible:

PYTHON
import seaborn as sns
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_true, y_pred)
plt.figure(figsize=(5, 5))
sns.heatmap(cm, annot=True, xticklabels=label2id.keys(), yticklabels=label2id.keys(), fmt='d', cbar=False, cmap='Reds')
plt.ylabel("Actual")
plt.xlabel("Predicted")
plt.show()

Prediction and Saving

Wrap inference in a small helper that returns the predicted emotion name:

PYTHON
text = "I am super happy today. I got it done. Finally!!"

def get_prediction(text):
    input_encoded = tokenizer(text, return_tensors='pt').to(device)
    with torch.no_grad():
        outputs = model(**input_encoded)
    logits = outputs.logits
    pred = torch.argmax(logits, dim=1).item()
    return id2label[pred]

get_prediction(text)
OUTPUT
'joy'

Save the fine-tuned model so you can reload it later or share it:

PYTHON
trainer.save_model("bert-base-uncased-sentiment-model")

The cleanest way to reuse it is through a pipeline:

PYTHON
from transformers import pipeline

classifier = pipeline('text-classification', model='bert-base-uncased-sentiment-model')
classifier([text, 'hello, how are you?', "love you", "i am feeling low"])
OUTPUT
[{'label': 'joy', 'score': 0.9631468057632446}, {'label': 'joy', 'score': 0.7542405128479004}, {'label': 'love', 'score': 0.6492504477500916}, {'label': 'sadness', 'score': 0.9719626307487488}]

Diagram of the full fine-tuning workflow: dataset, tokenize, train with Trainer, evaluate, save, and predict

The end-to-end fine-tuning workflow: load data, tokenize, train, evaluate, save, and serve predictions.


Summary

You fine-tuned bert-base-uncased for six-class emotion classification, reaching about 90% test accuracy. The recipe — tokenize, wrap data in a DatasetDict, add a classification head, train with the Trainer, evaluate, and save — is the same one you will reuse across every text-classification task.

Next, you will apply this exact workflow to compact, distilled models for fake news detection and compare their speed and accuracy.

Found this useful? Keep building with me.

New tutorials every week on YouTube — or go deeper with a full structured course.

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments