Sentiment Classification Using DistilBERT
DistilBERT is a distilled version of BERT — 40% smaller, 60% faster, while preserving over 95% of BERT's accuracy on the GLUE benchmark. This tutorial fine-tunes DistilBERT on the IMDB movie reviews dataset for binary sentiment classification using the ktrain one-cycle training API.
Notebook Setup
pip install ktrain
Downloading the dataset
git clone https://github.com/laxmimerit/IMDB-Movie-Reviews-Large-Dataset-50k.git
Cloning into 'IMDB-Movie-Reviews-Large-Dataset-50k'...
remote: Enumerating objects: 10, done.
remote: Counting objects: 100% (10/10), done.
remote: Compressing objects: 100% (8/8), done.
remote: Total 10 (delta 1), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (10/10), done.
Importing Libraries
import pandas as pd
import numpy as np
import ktrain
from ktrain import text
import tensorflow as tf
Loading dataset
#loading the training and testing dataset
data_test = pd.read_excel('/content/IMDB-Movie-Reviews-Large-Dataset-50k/test.xlsx', dtype= str)
data_train = pd.read_excel('/content/IMDB-Movie-Reviews-Large-Dataset-50k/train.xlsx', dtype = str)
#prining the five sample datapoints
data_train.sample(5)
| Reviews | Sentiment | |
|---|---|---|
| 16715 | The sequel to the ever popular Cinderella stor... | pos |
| 11207 | Excellent pirate entertainment! It has all the... | pos |
| 12609 | The Underground Comedy movie is perhaps one of... | neg |
| 10685 | My cable TV has what's called the Arts channel... | pos |
| 1633 | This movie was terrible. Throughout the whole ... | neg |
#printing the available text classifiers models
text.print_text_classifiers()
fasttext: a fastText-like model [http://arxiv.org/pdf/1607.01759.pdf]
logreg: logistic regression using a trainable Embedding layer
nbsvm: NBSVM model [http://www.aclweb.org/anthology/P12-2018]
bigru: Bidirectional GRU with pretrained fasttext word vectors [https://fasttext.cc/docs/en/crawl-vectors.html]
standard_gru: simple 2-layer GRU with randomly initialized embeddings
bert: Bidirectional Encoder Representations from Transformers (BERT) [https://arxiv.org/abs/1810.04805]
distilbert: distilled, smaller, and faster BERT from Hugging Face [https://arxiv.org/abs/1910.01108]
# text.texts_from_df return two tuples
# maxlen means it is considering that much words and rest are getting trucated
# preprocess_mode means tokenizing, embedding and transformation of text corpus(here it is considering distilbert model)
train, val, preproc) = text.texts_from_df(train_df=data_train, text_column='Reviews', label_columns='Sentiment',
val_df = data_test,
maxlen = 400,
preprocess_mode = 'distilbert')
preprocessing train...
language: en
train sequence lengths:
mean : 234
95percentile : 598
99percentile : 913
Is Multi-Label? False
preprocessing test...
language: en
test sequence lengths:
mean : 234
95percentile : 598
99percentile : 913
Observation:
You can able to see that it is detecting language as an English.
Also, this is not multi-label classification
# name = "distilbert" means, here we are using distilbert model.
model = text.text_classifier(name = 'distilbert', train_data = train, preproc=preproc)
Is Multi-Label? False
maxlen is 400
done.
#here we have taken batch size as 6 as from the documentation it is recommend to use this with maxlen as 400
learner = ktrain.get_learner(model = model,
train_data = train,
val_data = val,
batch_size = 6)
#Essentially fit is a very basic training loop, whereas fit one cycle uses the one cycle policy callback
learner.fit_onecycle(lr = 2e-5, epochs=2)
begin training using onecycle policy with max lr of 2e-05...
Train for 4167 steps, validate for 782 steps
Epoch 1/2
4167/4167 [==============================] - 3154s 757ms/step - loss: 0.2932 - accuracy: 0.8717 - val_loss: 0.1613 - val_accuracy: 0.9406
Epoch 2/2
4167/4167 [==============================] - 3131s 751ms/step - loss: 0.1552 - accuracy: 0.9440 - val_loss: 0.0623 - val_accuracy: 0.9836
#creating object for predictor model
predictor = ktrain.get_predictor(learner.model, preproc)
#mounting with google drive
from google.colab import drive
drive.mount('/content/drive')
Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly
Enter your authorization code:
··········
Mounted at /content/drive
#saving model
predictor.save('/content/drive/My Drive/distilbert')
data = ['this movie was really bad. acting was also bad. I will not watch again',
'the movie was really great. I will see it again', 'another great movie. must watch to everyone']
predictor.predict(data)
['neg', 'pos', 'pos']
Intepretation of above results :
'this movie was really bad. acting was also bad. I will not watch again' - neg
'the movie was really great. I will see it again' - pos
'nother great movie. must watch to everyone' - pos
#printing available classes
predictor.get_classes()
['neg', 'pos']
#return_proba = True means it will give the prediction probabilty for each class
predictor.predict(data, return_proba=True)
array([[0.9944576 , 0.00554235],
[0.00516187, 0.99483806],
[0.00479033, 0.99520963]], dtype=float32)
Conclusion
In this tutorial you fine-tuned DistilBERT on the 50k IMDB dataset for binary sentiment classification using ktrain's one-cycle training API. In just 2 epochs (roughly 2 hours on GPU), the model reached 98.4% validation accuracy — substantially outperforming a full-BERT baseline that required more compute while achieving comparable results, and far exceeding the 83% LSTM baseline from the same dataset.
Key takeaways:
- DistilBERT retains 95%+ of BERT's NLP capability at 40% fewer parameters and 60% faster inference, making it the practical default when compute or latency is a constraint.
preprocess_mode="distilbert"handles all tokenization internals (WordPiece,[CLS]/[SEP], attention masks) automatically — themaxlen=400truncates reviews longer than 400 tokens, which the 95th percentile of IMDB reviews at 598 tokens makes inevitable.- The one-cycle policy (
fit_onecycle) trains far faster than a constant learning rate by warming up to2e-5then annealing — 2 epochs are often sufficient for fine-tuning pre-trained transformers on classification tasks. predictor.save()persists both model weights and the preprocessing pipeline as a single artifact, soktrain.load_predictor()is the only call needed for inference on new text.
Next steps:
- Compare DistilBERT against full BERT in Sentiment Classification Using BERT to quantify the accuracy-speed trade-off on the same IMDB dataset.
- Explore
learner.lr_find()andlearner.lr_plot()to empirically select the peak learning rate rather than relying on the2e-5default. - Apply the same DistilBERT fine-tuning workflow to multi-class text categorization by changing
label_columnsto multiple sentiment categories.
