Sentiment Classification with BERT
BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained transformer that captures deep bidirectional context — making it state-of-the-art across NLP tasks. This tutorial fine-tunes BERT on the IMDB movie reviews dataset for binary sentiment classification using the ktrain one-cycle training API.
What is ktrain
ktrain is a library to help build, train, debug, and deploy neural networks in the deep learning software framework, Keras.
ktrain uses tf.keras in TensorFlow instead of standalone Keras.) Inspired by the fastai library, with only a few lines of code, ktrain allows you to easily:
- estimate an optimal learning rate for your model given your data using a learning rate finder
- employ learning rate schedules such as the triangular learning rate policy, 1cycle policy, and SGDR to more effectively train your model
- employ fast and easy-to-use pre-canned models for both text classification (e.g., NBSVM, fastText, GRU with pre-trained word embeddings) and image classification (e.g., ResNet, Wide Residual Networks, Inception)
- load and preprocess text and image data from a variety of formats
- inspect data points that were misclassified to help improve your model
- leverage a simple prediction API for saving and deploying both models and data-preprocessing steps to make predictions on new raw data
ktrain GitHub: amaiya/ktrain
Notebook Setup
pip install ktrain
Importing Libraries
import tensorflow as tf
import pandas as pd
import numpy as np
import ktrain
from ktrain import text
import tensorflow as tf
tf.__version__
'2.1.0'
Downloading the dataset
git clone https://github.com/laxmimerit/IMDB-Movie-Reviews-Large-Dataset-50k.git
Cloning into 'IMDB-Movie-Reviews-Large-Dataset-50k'...
#loading the train dataset
data_train = pd.read_excel('IMDB-Movie-Reviews-Large-Dataset-50k/train.xlsx', dtype = str)
#loading the test dataset
data_test = pd.read_excel('IMDB-Movie-Reviews-Large-Dataset-50k/test.xlsx', dtype = str)
#dimension of the dataset
print("Size of train dataset: ",data_train.shape)
print("Size of test dataset: ",data_test.shape)
Size of train dataset: (25000, 2)
Size of test dataset: (25000, 2)
Observation: Both train and test dataset is having 25000 rows and 2 columns
#printing last rows of train dataset
data_train.tail()
| Reviews | Sentiment | |
|---|---|---|
| 24995 | Everyone plays their part pretty well in this ... | pos |
| 24996 | It happened with Assault on Prescient 13 in 20... | neg |
| 24997 | My God. This movie was awful. I can't complain... | neg |
| 24998 | When I first popped in Happy Birthday to Me, I... | neg |
| 24999 | So why does this show suck? Unfortunately, tha... | neg |
#printing head rows of test dataset
data_test.head()
| Reviews | Sentiment | |
|---|---|---|
| 0 | Who would have thought that a movie about a ma... | pos |
| 1 | After realizing what is going on around us ...... | pos |
| 2 | I grew up watching the original Disney Cindere... | neg |
| 3 | David Mamet wrote the screenplay and made his ... | pos |
| 4 | Admittedly, I didn't have high expectations of... | neg |
Splitting data into Train and Test:
# text.texts_from_df return two tuples
# maxlen means it is considering that much words and rest are getting trucated
# preprocess_mode means tokenizing, embedding and transformation of text corpus(here it is considering BERT model)
(X_train, y_train), (X_test, y_test), preproc = text.texts_from_df(train_df=data_train,
text_column = 'Reviews',
label_columns = 'Sentiment',
val_df = data_test,
maxlen = 500,
preprocess_mode = 'bert')
downloading pretrained BERT model (uncased_L-12_H-768_A-12.zip)...
[██████████████████████████████████████████████████]
extracting pretrained BERT model...
done.
cleanup downloaded zip...
done.
preprocessing train...
language: en
Is Multi-Label? False
preprocessing test...
language: en
Observation:
- You can able to see that it is detecting language as an English
- Also, this is not a multilabel classification
# name = "bert" means, here we are using BERT model.
model = text.text_classifier(name = 'bert',
train_data = (X_train, y_train),
preproc = preproc)
Is Multi-Label? False
maxlen is 500
done.
#here we have taken batch size as 6 as from the documentation it is recommend to use this with maxlen as 500
learner = ktrain.get_learner(model=model, train_data=(X_train, y_train),
val_data = (X_test, y_test),
batch_size = 6)
# find out best learning rate?
# learner.lr_find()
# learner.lr_plot()
# it may take days or many days to find out.
#Essentially fit is a very basic training loop, whereas fit one cycle uses the one cycle policy callback
learner.fit_onecycle(lr = 2e-5, epochs = 1)
predictor = ktrain.get_predictor(learner.model, preproc)
predictor.save('/content/drive/My Drive/bert')
predictor = ktrain.get_predictor(learner.model, preproc)
predictor.save('/content/drive/My Drive/bert')
#sample dataset to test on
data = ['this movie was horrible, the plot was really boring. acting was okay',
'the fild is really sucked. there is not plot and acting was bad',
'what a beautiful movie. great plot. acting was good. will see it again']
predictor.predict(data)
['neg', 'neg', 'pos']
Intepretation of above results :
- 'this movie was horrible, the plot was really boring. acting was okay' - neg
- 'the fild is really sucked. there is not plot and acting was bad' - neg
- 'what a beautiful movie. great plot. acting was good. will see it again' - pos
#return_proba = True means it will give the prediction probabilty for each class
predictor.predict(data, return_proba=True)
array([[0.99797565, 0.00202436],
[0.99606663, 0.00393336],
[0.00292433, 0.9970757 ]], dtype=float32)
#classes available
predictor.get_classes()
['neg', 'pos']
# saving model and weights
predictor.save('/content/drive/My Drive/bert')
!zip -r /content/bert.zip /content/bert
adding: content/bert/ (stored 0%)
adding: content/bert/tf_model.h5 (deflated 11%)
adding: content/bert/tf_model.preproc (deflated 52%)
#loading the model
predictor_load = ktrain.load_predictor('/content/bert')
#predicting the data
predictor_load.predict(data)
['neg', 'neg', 'pos']
Conclusion
In this tutorial you fine-tuned a pre-trained BERT model on the 50k IMDB sentiment dataset using the ktrain one-cycle training API. After downloading BERT's uncased weights, tokenizing reviews with the WordPiece tokenizer at maxlen=500, and fine-tuning with a learning rate of 2e-5 for a single epoch, the model correctly classified all three test phrases — including a subtly negative review about a "sucked" film. The predictor was then saved and reloaded to demonstrate deployment.
Key takeaways:
- BERT's bidirectional self-attention reads the full context of each token at once (left and right), unlike unidirectional LSTMs — this is why it transfers so powerfully to downstream tasks with minimal fine-tuning.
- ktrain's
text.texts_from_dfhandles all BERT-specific preprocessing (WordPiece tokenization,[CLS]/[SEP]token insertion, attention masks) in one call, hiding the boilerplate that normally requires the transformers library. - The one-cycle learning rate schedule (
fit_onecycle) trains faster and more stably than a constant rate by warming up to a peak then cooling down — even a single epoch can produce strong results on fine-tuning tasks. - For production,
predictor.save()serializes both the model weights and the preprocessing pipeline together, so inference requires no additional setup beyondktrain.load_predictor().
Next steps:
- Compare BERT against its compressed variant in DistilBERT — Smaller, Faster, Cheaper, Lighter to see the speed-accuracy trade-off.
- Apply the same fine-tuning workflow to multi-class sentiment with a custom dataset to extend beyond binary classification.
- Try
learner.lr_find()andlearner.lr_plot()to empirically choose the best learning rate rather than using the default2e-5.
