Fine-Tuning BERT for Sentiment Classification

Fine-tuning adapts a pretrained model to our own data. In this blog, we take bert-base-uncased and teach it to classify the emotion of a tweet. There are six classes: sadness, joy, love, anger, fear, or surprise.

We will use the Hugging Face Transformers Trainer API. It handles the training loop, evaluation, and checkpointing for us. By the end, we will have a saved model that predicts emotion from raw text in one line.

Prerequisites: A grasp of BERT's architecture and a Python environment with transformers, datasets, evaluate, scikit-learn, and torch installed. A GPU is strongly recommended for training.

Loading the Dataset

The dataset is a CSV of 16,000 tweets. Each tweet is labeled with an emotion. We load it with pandas:

PYTHON

import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/laxmimerit/All-CSV-ML-Data-Files-Download/master/twitter_multi_class_sentiment.csv")
df.info()
df.isnull().sum()

PYTHON

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16000 entries, 0 to 15999
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   text        16000 non-null  object
 1   label       16000 non-null  int64 
 2   label_name  16000 non-null  object
text          0
label         0
label_name    0
dtype: int64

Here, we can see three columns. There is the raw text, an integer label, and a readable label_name. There are no missing values. Now we check how many examples each emotion has:

PYTHON

df['label'].value_counts()

OUTPUT

label
1    5362
0    4666
3    2159
4    1937
2    1304
5     572
Name: count, dtype: int64

Note

The classes are imbalanced. surprise (label 5) has only 572 examples, versus 5,362 for joy (label 1). That imbalance shows up later in the per-class scores.

Exploring the Data

A horizontal bar chart shows the class frequencies at a glance:

PYTHON

import matplotlib.pyplot as plt

label_counts = df['label_name'].value_counts(ascending=True)
label_counts.plot.barh()
plt.title("Frequency of Classes")
plt.show()

We should also check tweet length, because BERT has a maximum input size. We add a word-count column and box-plot it by class:

PYTHON

df['Words per Tweet'] = df['text'].str.split().apply(len)
df.boxplot("Words per Tweet", by="label_name")

Tokenization

BERT cannot take raw strings. The text must be tokenized into integer IDs first. We load the matching tokenizer with AutoTokenizer:

PYTHON

from transformers import AutoTokenizer

model_ckpt = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

text = "I love machine learning! Tokenization is awesome!!"
encoded_text = tokenizer(text)
print(encoded_text)

OUTPUT

{'input_ids': [101, 1045, 2293, 3698, 4083, 999, 19204, 3989, 2003, 12476, 999, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Here, we can see the input_ids start with 101 ([CLS]) and end with 102 ([SEP]). Now we inspect the vocabulary size and the model's maximum sequence length:

PYTHON

len(tokenizer.vocab), tokenizer.vocab_size, tokenizer.model_max_length

OUTPUT

(30522, 30522, 512)

Diagram of the tokenization step: raw text split into subword tokens and mapped to input IDs with special tokens

Tokenization converts raw text into input IDs framed by the [CLS] and [SEP] special tokens.

Train/Test/Validation Split

We split the data into 70% train, 20% test, and 10% validation. We stratify by class so each split keeps the same emotion mix:

PYTHON

from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.3, stratify=df['label_name'])
test, validation = train_test_split(test, test_size=1/3, stratify=test['label_name'])

train.shape, test.shape, validation.shape

OUTPUT

((11200, 4), (3200, 4), (1600, 4))

Now we convert the pandas splits into a Hugging Face DatasetDict. This is the format the Trainer expects:

PYTHON

from datasets import Dataset, DatasetDict

dataset = DatasetDict({
    'train': Dataset.from_pandas(train, preserve_index=False),
    'test': Dataset.from_pandas(test, preserve_index=False),
    'validation': Dataset.from_pandas(validation, preserve_index=False)
})
dataset

PYTHON

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'label_name', 'Words per Tweet'],
        num_rows: 11200
    })
    test: Dataset({
        features: ['text', 'label', 'label_name', 'Words per Tweet'],
        num_rows: 3200
    })
    validation: Dataset({
        features: ['text', 'label', 'label_name', 'Words per Tweet'],
        num_rows: 1600
    })
})

Tokenizing the Whole Dataset

We define a tokenize function with padding and truncation. Then we map it over every split at once:

PYTHON

def tokenize(batch):
    temp = tokenizer(batch['text'], padding=True, truncation=True)
    return temp

emotion_encoded = dataset.map(tokenize, batched=True, batch_size=None)

We also build the label-to-ID mappings. The model needs these to report readable predictions:

PYTHON

label2id = {x['label_name']: x['label'] for x in dataset['train']}
id2label = {v: k for k, v in label2id.items()}

label2id, id2label

OUTPUT

({'love': 2, 'joy': 1, 'sadness': 0, 'fear': 4, 'anger': 3, 'surprise': 5}, {2: 'love', 1: 'joy', 0: 'sadness', 4: 'fear', 3: 'anger', 5: 'surprise'})

Building the Model

We load BERT with a classification head sized to the number of labels. AutoModelForSequenceClassification adds that head on top of the pretrained [CLS] output:

PYTHON

from transformers import AutoModelForSequenceClassification, AutoConfig
import torch

num_labels = len(label2id)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
config = AutoConfig.from_pretrained(model_ckpt, label2id=label2id, id2label=id2label)
model = AutoModelForSequenceClassification.from_pretrained(model_ckpt, config=config).to(device)

Important

We will see a warning that classifier.bias and classifier.weight are newly initialized. That is expected. The classification head starts random and is exactly what fine-tuning trains.

Diagram of BERT with a sequence-classification head mapping the CLS vector to six emotion classes

AutoModelForSequenceClassification adds a head that maps BERT's [CLS] output to the six emotion classes.

Training Arguments and Metrics

Now we configure the training run. A learning rate of 2e-5 and 2 epochs are solid defaults for BERT fine-tuning:

PYTHON

from transformers import TrainingArguments

batch_size = 64
training_dir = "bert_base_train_dir"

training_args = TrainingArguments(
    output_dir=training_dir,
    overwrite_output_dir=True,
    num_train_epochs=2,
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    evaluation_strategy='epoch',
    disable_tqdm=False
)

Warning

In recent versions of Transformers, evaluation_strategy was renamed to eval_strategy. If we get a deprecation warning or error, use eval_strategy='epoch' instead.

We define a metric function that reports both accuracy and weighted F1 using scikit-learn:

PYTHON

from sklearn.metrics import accuracy_score, f1_score

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)

    f1 = f1_score(labels, preds, average="weighted")
    acc = accuracy_score(labels, preds)

    return {"accuracy": acc, "f1": f1}

Training

We assemble the Trainer with the model, arguments, metric function, datasets, and tokenizer. Then we train:

PYTHON

from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=emotion_encoded['train'],
    eval_dataset=emotion_encoded['validation'],
    tokenizer=tokenizer
)

trainer.train()

OUTPUT

{'eval_loss': 0.4704, 'eval_accuracy': 0.85125, 'eval_f1': 0.84068, 'epoch': 1.0}
{'eval_loss': 0.2952, 'eval_accuracy': 0.909375, 'eval_f1': 0.90793, 'epoch': 2.0}
{'train_runtime': 1374.5377, 'train_loss': 0.67778, 'epoch': 2.0}

Here, we can see validation accuracy climb from 85% after the first epoch to 91% after the second. That is a clear sign the model is learning.

Evaluating the Model

We run the held-out test set through the trained model:

PYTHON

preds_output = trainer.predict(emotion_encoded['test'])
preds_output.metrics

OUTPUT

{'test_loss': 0.2910054922103882, 'test_accuracy': 0.9028125, 'test_f1': 0.9010784813634883, 'test_runtime': 78.7905}

A per-class classification report shows where the model is strong and weak:

PYTHON

import numpy as np
from sklearn.metrics import classification_report

y_pred = np.argmax(preds_output.predictions, axis=1)
y_true = emotion_encoded['test'][:]['label']
print(classification_report(y_true, y_pred))

OUTPUT

              precision    recall  f1-score   support

           0       0.93      0.97      0.95       933
           1       0.91      0.92      0.91      1072
           2       0.79      0.74      0.76       261
           3       0.94      0.93      0.93       432
           4       0.86      0.87      0.87       387
           5       0.89      0.61      0.72       115

    accuracy                           0.90      3200
   macro avg       0.89      0.84      0.86      3200
weighted avg       0.90      0.90      0.90      3200

As expected from the class imbalance, the rare classes have the lowest recall. Those are love (2) and surprise (5). A confusion matrix makes the mistakes visible:

PYTHON

import seaborn as sns
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_true, y_pred)
plt.figure(figsize=(5, 5))
sns.heatmap(cm, annot=True, xticklabels=label2id.keys(), yticklabels=label2id.keys(), fmt='d', cbar=False, cmap='Reds')
plt.ylabel("Actual")
plt.xlabel("Predicted")
plt.show()

Prediction and Saving

We wrap inference in a small helper that returns the predicted emotion name:

PYTHON

text = "I am super happy today. I got it done. Finally!!"

def get_prediction(text):
    input_encoded = tokenizer(text, return_tensors='pt').to(device)
    with torch.no_grad():
        outputs = model(**input_encoded)
    logits = outputs.logits
    pred = torch.argmax(logits, dim=1).item()
    return id2label[pred]

get_prediction(text)

OUTPUT

'joy'

We save the fine-tuned model so we can reload it later or share it:

PYTHON

trainer.save_model("bert-base-uncased-sentiment-model")

The cleanest way to reuse it is through a pipeline:

PYTHON

from transformers import pipeline

classifier = pipeline('text-classification', model='bert-base-uncased-sentiment-model')
classifier([text, 'hello, how are you?', "love you", "i am feeling low"])

OUTPUT

[{'label': 'joy', 'score': 0.9631468057632446}, {'label': 'joy', 'score': 0.7542405128479004}, {'label': 'love', 'score': 0.6492504477500916}, {'label': 'sadness', 'score': 0.9719626307487488}]

Diagram of the full fine-tuning workflow: dataset, tokenize, train with Trainer, evaluate, save, and predict

The end-to-end fine-tuning workflow: load data, tokenize, train, evaluate, save, and serve predictions.

Summary

This is how fine-tuning BERT works. We fine-tuned bert-base-uncased for six-class emotion classification and reached about 90% test accuracy. The recipe stays the same every time. We tokenize the text, wrap the data in a DatasetDict, add a classification head, train with the Trainer, evaluate, and save. We will reuse this recipe across every text-classification task.

Next, we apply this exact workflow to compact, distilled models for fake news detection. We will also compare their speed and accuracy.

Fine-Tuning BERT for Sentiment Classification

Fine Tuning LLM with HuggingFace Transformers for NLP

Loading the Dataset

Exploring the Data

Tokenization

Train/Test/Validation Split

Tokenizing the Whole Dataset

Building the Model

Training Arguments and Metrics

Training

Evaluating the Model

Prediction and Saving

Summary

Found this useful? Keep building with me.

Latest recommendations you might like

BERT Architecture: Theory and Fine-Tuning

Fine-Tuning Distilled BERT for Fake News Detection

Fine-Tuning DistilBERT for Restaurant Search NER

Fine-Tuning Phi-2 with LoRA and QLoRA

Find this tutorial useful?

Discussion & Comments