Sentiment Classification Using BERT

Sentiment Classification with BERT

In this blog, we will fine-tune BERT for sentiment classification. BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained transformer that captures deep bidirectional context, which makes it state-of-the-art across NLP tasks. We train it on the IMDB movie reviews dataset for binary sentiment using the ktrain one-cycle training API.

What is `ktrain`

ktrain is a library to help build, train, debug, and deploy neural networks in the deep learning software framework, Keras.

ktrain uses tf.keras in TensorFlow instead of standalone Keras.) Inspired by the fastai library, with only a few lines of code, ktrain lets us easily:

estimate a good learning rate for our model and data using a learning rate finder
use learning rate schedules such as the triangular policy, 1cycle policy, and SGDR to train our model more effectively
use fast, ready-made models for both text classification (e.g., NBSVM, fastText, GRU with pre-trained word embeddings) and image classification (e.g., ResNet, Wide Residual Networks, Inception)
load and preprocess text and image data from a variety of formats
inspect data points that were misclassified to help improve our model
use a simple prediction API for saving and deploying both models and preprocessing steps to make predictions on new raw data

ktrain GitHub: amaiya/ktrain

Notebook Setup

BASH

pip install ktrain

Importing Libraries

PYTHON

import tensorflow as tf
import pandas as pd
import numpy as np
import ktrain
from ktrain import text
import tensorflow as tf

PYTHON

tf.__version__

OUTPUT

'2.1.0'

Downloading the dataset

BASH

git clone https://github.com/laxmimerit/IMDB-Movie-Reviews-Large-Dataset-50k.git

OUTPUT

Cloning into 'IMDB-Movie-Reviews-Large-Dataset-50k'...

PYTHON

#loading the train dataset

data_train = pd.read_excel('IMDB-Movie-Reviews-Large-Dataset-50k/train.xlsx', dtype = str)

PYTHON

#loading the test dataset

data_test = pd.read_excel('IMDB-Movie-Reviews-Large-Dataset-50k/test.xlsx', dtype = str)

PYTHON

#dimension of the dataset

print("Size of train dataset: ",data_train.shape)
print("Size of test dataset: ",data_test.shape)

OUTPUT

Size of train dataset:  (25000, 2)
Size of test dataset:  (25000, 2)

Observation: Both train and test dataset is having 25000 rows and 2 columns

PYTHON

#printing last rows of train dataset

data_train.tail()

OUTPUT

	Reviews	Sentiment
24995	Everyone plays their part pretty well in this ...	pos
24996	It happened with Assault on Prescient 13 in 20...	neg
24997	My God. This movie was awful. I can't complain...	neg
24998	When I first popped in Happy Birthday to Me, I...	neg
24999	So why does this show suck? Unfortunately, tha...	neg

PYTHON

#printing head rows of test dataset

data_test.head()

OUTPUT

	Reviews	Sentiment
0	Who would have thought that a movie about a ma...	pos
1	After realizing what is going on around us ......	pos
2	I grew up watching the original Disney Cindere...	neg
3	David Mamet wrote the screenplay and made his ...	pos
4	Admittedly, I didn't have high expectations of...	neg

Splitting data into Train and Test:

PYTHON

# text.texts_from_df return two tuples
# maxlen means it is considering that much words and rest are getting trucated
# preprocess_mode means tokenizing, embedding and transformation of text corpus(here it is considering BERT model)

(X_train, y_train), (X_test, y_test), preproc = text.texts_from_df(train_df=data_train,
                                                                   text_column = 'Reviews',
                                                                   label_columns = 'Sentiment',
                                                                   val_df = data_test,
                                                                   maxlen = 500,
                                                                   preprocess_mode = 'bert')

OUTPUT

downloading pretrained BERT model (uncased_L-12_H-768_A-12.zip)...
[██████████████████████████████████████████████████]
extracting pretrained BERT model...
done.

cleanup downloaded zip...
done.

preprocessing train...
language: en

Is Multi-Label? False
preprocessing test...
language: en

Observation:

We can see that it is detecting the language as English
Also, this is not a multilabel classification

PYTHON

# name = "bert" selects the BERT model.

model = text.text_classifier(name = 'bert',
                             train_data = (X_train, y_train),
                             preproc = preproc)

OUTPUT

Is Multi-Label? False
maxlen is 500
done.

PYTHON

# batch size 6, recommended by the docs when maxlen is 500

learner = ktrain.get_learner(model=model, train_data=(X_train, y_train),
                   val_data = (X_test, y_test),
                   batch_size = 6)

PYTHON

# find out best learning rate?
# learner.lr_find()
# learner.lr_plot()

# it may take days or many days to find out.

PYTHON

#Essentially fit is a very basic training loop, whereas fit one cycle uses the one cycle policy callback

learner.fit_onecycle(lr = 2e-5, epochs = 1)

predictor = ktrain.get_predictor(learner.model, preproc)
predictor.save('/content/drive/My Drive/bert')

PYTHON

predictor = ktrain.get_predictor(learner.model, preproc)
predictor.save('/content/drive/My Drive/bert')

PYTHON

#sample dataset to test on

data = ['this movie was horrible, the plot was really boring. acting was okay',
        'the fild is really sucked. there is not plot and acting was bad',
        'what a beautiful movie. great plot. acting was good. will see it again']

PYTHON

predictor.predict(data)

OUTPUT

['neg', 'neg', 'pos']

Interpretation of above results:

'this movie was horrible, the plot was really boring. acting was okay' - neg
'the fild is really sucked. there is not plot and acting was bad' - neg
'what a beautiful movie. great plot. acting was good. will see it again' - pos

PYTHON

#return_proba = True means it will give the prediction probabilty for each class

predictor.predict(data, return_proba=True)

OUTPUT

array([[0.99797565, 0.00202436],
       [0.99606663, 0.00393336],
       [0.00292433, 0.9970757 ]], dtype=float32)

PYTHON

#classes available

predictor.get_classes()

OUTPUT

['neg', 'pos']

PYTHON

# saving model and weights

predictor.save('/content/drive/My Drive/bert')

OUTPUT

!zip -r /content/bert.zip /content/bert

adding: content/bert/ (stored 0%)
  adding: content/bert/tf_model.h5 (deflated 11%)
  adding: content/bert/tf_model.preproc (deflated 52%)

PYTHON

#loading the model

predictor_load = ktrain.load_predictor('/content/bert')

PYTHON

#predicting the data

predictor_load.predict(data)

OUTPUT

['neg', 'neg', 'pos']

Conclusion

In this blog, we fine-tuned a pre-trained BERT model on the 50k IMDB sentiment dataset using the ktrain one-cycle training API. We downloaded BERT's uncased weights, tokenized reviews with the WordPiece tokenizer at maxlen=500, and fine-tuned with a learning rate of 2e-5 for a single epoch. The model got all three test phrases right, including a subtly negative review about a "sucked" film. We then saved and reloaded the predictor to show deployment.

Key takeaways:

BERT's two-way self-attention reads the full context of each token at once (left and right), unlike one-way LSTMs. This is why it transfers so well to new tasks with little fine-tuning.
ktrain's text.texts_from_df handles all the BERT preprocessing (WordPiece tokenization, [CLS]/[SEP] tokens, attention masks) in one call. It hides the boilerplate that the transformers library usually needs.
The one-cycle learning rate schedule (fit_onecycle) trains faster and more stably than a constant rate. It warms up to a peak and then cools down. Even a single epoch can give strong results.
For production, predictor.save() stores both the model weights and the preprocessing pipeline together. So inference needs no extra setup beyond ktrain.load_predictor().

Next steps:

Compare BERT against its compressed variant in DistilBERT -- Smaller, Faster, Cheaper, Lighter to see the speed-accuracy trade-off.
Apply the same fine-tuning workflow to multi-class sentiment with a custom dataset to extend beyond binary classification.
Try learner.lr_find() and learner.lr_plot() to empirically choose the best learning rate rather than using the default 2e-5.

Sentiment Classification Using BERT

Sentiment Classification with BERT

What is `ktrain`

Notebook Setup

Importing Libraries

Downloading the dataset

Splitting data into Train and Test:

Conclusion

Found this useful? Keep building with me.

Latest recommendations you might like

Words Embedding using GloVe Vectors

Sentiment Classification with DistilBERT

Image Classification with Pre-trained VGG-16

Multi-Label Movie Poster Classification with CNN

Find this tutorial useful?

Discussion & Comments

Sentiment Classification with BERT

What is ktrain

Notebook Setup

Importing Libraries

Downloading the dataset

Splitting data into Train and Test:

Conclusion

Found this useful? Keep building with me.

Latest recommendations you might like

Words Embedding using GloVe Vectors

Sentiment Classification with DistilBERT

Image Classification with Pre-trained VGG-16

Multi-Label Movie Poster Classification with CNN

Find this tutorial useful?

Discussion & Comments

What is `ktrain`