Sentiment Classification with DistilBERT

Sentiment Classification Using DistilBERT

DistilBERT is a distilled version of BERT — 40% smaller, 60% faster, while preserving over 95% of BERT's accuracy on the GLUE benchmark. This tutorial fine-tunes DistilBERT on the IMDB movie reviews dataset for binary sentiment classification using the ktrain one-cycle training API.

Notebook Setup

BASH

pip install ktrain

Downloading the dataset

BASH

git clone https://github.com/laxmimerit/IMDB-Movie-Reviews-Large-Dataset-50k.git

Cloning into 'IMDB-Movie-Reviews-Large-Dataset-50k'...
remote: Enumerating objects: 10, done.
remote: Counting objects: 100% (10/10), done.
remote: Compressing objects: 100% (8/8), done.
remote: Total 10 (delta 1), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (10/10), done.

Importing Libraries

PYTHON

import pandas as pd
import numpy as np
import ktrain
from ktrain import text
import tensorflow as tf

Loading dataset

PYTHON

#loading the training and testing dataset

data_test = pd.read_excel('/content/IMDB-Movie-Reviews-Large-Dataset-50k/test.xlsx', dtype= str)
data_train = pd.read_excel('/content/IMDB-Movie-Reviews-Large-Dataset-50k/train.xlsx', dtype = str)

PYTHON

#prining the five sample datapoints

data_train.sample(5)

OUTPUT

	Reviews	Sentiment
16715	The sequel to the ever popular Cinderella stor...	pos
11207	Excellent pirate entertainment! It has all the...	pos
12609	The Underground Comedy movie is perhaps one of...	neg
10685	My cable TV has what's called the Arts channel...	pos
1633	This movie was terrible. Throughout the whole ...	neg

PYTHON

#printing the available text classifiers models

text.print_text_classifiers()

OUTPUT

fasttext: a fastText-like model [http://arxiv.org/pdf/1607.01759.pdf]
logreg: logistic regression using a trainable Embedding layer
nbsvm: NBSVM model [http://www.aclweb.org/anthology/P12-2018]
bigru: Bidirectional GRU with pretrained fasttext word vectors [https://fasttext.cc/docs/en/crawl-vectors.html]
standard_gru: simple 2-layer GRU with randomly initialized embeddings
bert: Bidirectional Encoder Representations from Transformers (BERT) [https://arxiv.org/abs/1810.04805]
distilbert: distilled, smaller, and faster BERT from Hugging Face [https://arxiv.org/abs/1910.01108]

PYTHON

# text.texts_from_df return two tuples
# maxlen means it is considering that much words and rest are getting trucated
# preprocess_mode means tokenizing, embedding and transformation of text corpus(here it is considering distilbert model)

train, val, preproc) = text.texts_from_df(train_df=data_train, text_column='Reviews', label_columns='Sentiment',
                   val_df = data_test,
                   maxlen = 400,
                   preprocess_mode = 'distilbert')

OUTPUT

preprocessing train...
language: en
train sequence lengths:
	mean : 234
	95percentile : 598
	99percentile : 913

Is Multi-Label? False
preprocessing test...
language: en
test sequence lengths:
	mean : 234
	95percentile : 598
	99percentile : 913

Observation:

You can able to see that it is detecting language as an English.

Also, this is not multi-label classification

PYTHON

# name = "distilbert" means, here we are using distilbert model.

model = text.text_classifier(name = 'distilbert', train_data = train, preproc=preproc)

OUTPUT

Is Multi-Label? False
maxlen is 400
done.

PYTHON

#here we have taken batch size as 6 as from the documentation it is recommend to use this with maxlen as 400

learner = ktrain.get_learner(model = model,
                             train_data = train,
                             val_data = val,
                             batch_size = 6)

PYTHON

#Essentially fit is a very basic training loop, whereas fit one cycle uses the one cycle policy callback

learner.fit_onecycle(lr = 2e-5, epochs=2)

OUTPUT

begin training using onecycle policy with max lr of 2e-05...
Train for 4167 steps, validate for 782 steps
Epoch 1/2
4167/4167 [==============================] - 3154s 757ms/step - loss: 0.2932 - accuracy: 0.8717 - val_loss: 0.1613 - val_accuracy: 0.9406
Epoch 2/2
4167/4167 [==============================] - 3131s 751ms/step - loss: 0.1552 - accuracy: 0.9440 - val_loss: 0.0623 - val_accuracy: 0.9836

PYTHON

#creating object for predictor model

predictor = ktrain.get_predictor(learner.model, preproc)

PYTHON

#mounting with google drive

from google.colab import drive
drive.mount('/content/drive')

OUTPUT

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive

PYTHON

#saving model

predictor.save('/content/drive/My Drive/distilbert')

PYTHON

data = ['this movie was really bad. acting was also bad. I will not watch again',
        'the movie was really great. I will see it again', 'another great movie. must watch to everyone']

PYTHON

predictor.predict(data)

OUTPUT

['neg', 'pos', 'pos']

Intepretation of above results :

'this movie was really bad. acting was also bad. I will not watch again' - neg

'the movie was really great. I will see it again' - pos

'nother great movie. must watch to everyone' - pos

PYTHON

#printing available classes

predictor.get_classes()

OUTPUT

['neg', 'pos']

PYTHON

#return_proba = True means it will give the prediction probabilty for each class

predictor.predict(data, return_proba=True)

OUTPUT

array([[0.9944576 , 0.00554235],
       [0.00516187, 0.99483806],
       [0.00479033, 0.99520963]], dtype=float32)

Conclusion

In this tutorial you fine-tuned DistilBERT on the 50k IMDB dataset for binary sentiment classification using ktrain's one-cycle training API. In just 2 epochs (roughly 2 hours on GPU), the model reached 98.4% validation accuracy — substantially outperforming a full-BERT baseline that required more compute while achieving comparable results, and far exceeding the 83% LSTM baseline from the same dataset.

Key takeaways:

DistilBERT retains 95%+ of BERT's NLP capability at 40% fewer parameters and 60% faster inference, making it the practical default when compute or latency is a constraint.
preprocess_mode="distilbert" handles all tokenization internals (WordPiece, [CLS]/[SEP], attention masks) automatically — the maxlen=400 truncates reviews longer than 400 tokens, which the 95th percentile of IMDB reviews at 598 tokens makes inevitable.
The one-cycle policy (fit_onecycle) trains far faster than a constant learning rate by warming up to 2e-5 then annealing — 2 epochs are often sufficient for fine-tuning pre-trained transformers on classification tasks.
predictor.save() persists both model weights and the preprocessing pipeline as a single artifact, so ktrain.load_predictor() is the only call needed for inference on new text.

Next steps:

Compare DistilBERT against full BERT in Sentiment Classification Using BERT to quantify the accuracy-speed trade-off on the same IMDB dataset.
Explore learner.lr_find() and learner.lr_plot() to empirically select the peak learning rate rather than relying on the 2e-5 default.
Apply the same DistilBERT fine-tuning workflow to multi-class text categorization by changing label_columns to multiple sentiment categories.

Sentiment Classification with DistilBERT

Topics You Will Master

Sentiment Classification Using DistilBERT

Notebook Setup

Downloading the dataset

Importing Libraries

Loading dataset

Conclusion

Latest recommendations you might like

IMDB Sentiment Classification with LSTM

Sentiment Classification Using BERT

Find this tutorial useful?

Discussion & Comments