Amazon and IMDB Review Sentiment Classification using SpaCy

Published by georgiannacambel on

Sentiment Classification using SpaCy

What is NLP?

Natural Language Processing (NLP) is the field of Artificial Intelligence concerned with the processing and understanding of human language. Since its inception during the 1950s, machine understanding of language has played a pivotal role in translation, topic modeling, document indexing, information retrieval, and extraction.

Some Applications of NLP

  • Text Classification
  • Spam Filters
  • Voice text messaging
  • Sentiment analysis
  • Spell or grammar check
  • Chat bot
  • Search Suggestion
  • Search Autocorrect
  • Automatic Review Analysis system
  • Machine translation

spaCy installation

You can run the following commands:-

!pip install -U spacy

!pip install -U spacy-lookups-data

!python -m spacy download en_core_web_sm

Scikit-learn installation

You can run the following command:-

!pip install scikit-learn

Data Cleaning Options

  • Case Normalization
  • Removing Stop Words
  • Removing Punctuations or Special Symbols
  • Lemmatization or Stemming
  • Parts of Speech Tagging
  • Entity Detection
  • Bag of Words
  • TF-IDF

Bag of Words - The Simplest Word Embedding Technique

This is one of the simplest methods of embedding words into numerical vectors. It is not often used in practice due to its oversimplification of language, but often the first embedding technique to be taught in the classroom setting. Whenever we apply any algorithm in NLP, it works on numbers. We cannot directly feed our text into that algorithm. Hence, Bag of Words model is used to preprocess the text by converting it into a bag of words, which keeps a count of the total occurrences of unique words.

A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:

  • A vocabulary of known words.
  • A measure of the presence of known words.

It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

doc1 = "I am high"
doc2 = "Yes I am high"
doc3 = "I am kidding"
image.png

Bag of Words and Tf-idf

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html

In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.

image.png

Pipeline in SpaCy

When you call nlp on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline.

The pipeline used by the default models consists of a tagger, a parser and an entity recognizer. Each pipeline component returns the processed Doc, which is then passed on to the next component.

image.png

Datasets

You can get all the datasets used in this notebook from here.

Watch Full Video here:

Let's Get Started

Here we are importing the necessary libraries.

import spacy
from spacy import displacy

spacy.load() loads a model. When you call nlp on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed using the pipeline.

nlp = spacy.load('en_core_web_sm')
text = "Apple, This is first sentence. and Google this is another one. here 3rd one is"
doc = nlp(text)
doc
Apple, This is first sentence. and Google this is another one. here 3rd one is

Now we will see the tokens in doc.

for token in doc:
    print(token)
Apple
,
This
is
first
sentence
.
and
Google
this
is
another
one
.
here
3rd
one
is

create_pipe() creates the pipeline components. sentencizer adds rule-based sentence segmentation without the dependency parse. Custom components can be added to the pipeline using the add_pipe method. Optionally, you can either specify a component to add it before or after, tell spaCy to add it first or last in the pipeline. We will add sentencizer before parser.

sent = nlp.create_pipe('sentencizer')
nlp.add_pipe(sent, before='parser')
doc = nlp(text)
for sent in doc.sents:
    print(sent)
Apple, This is first sentence.
and Google this is another one.
here 3rd one is

stop words are words which are filtered out before or after processing of natural language data. A They are commonly used word such as “the”, “a”, “an”, “in” which don't add significant meaning to the sentence. STOP_WORDS is a set of default stop words for English language model in SpaCy. We can see the stop words in SpaCy below.

from spacy.lang.en.stop_words import STOP_WORDS
stopwords = list(STOP_WORDS)
print(stopwords)
['move', 'again', 'during', 'herself', 'him', 'hereby', 'third', 'once', 'call', 'both', '‘ll', 'doing', 'something', 'when', '’ll', 'unless', 'thereafter', 'before', 'so', 'is', 'will', 'toward', 'has', 'whom', 'it', 'who', 'what', 'his', '‘s', 'towards', 'quite', 'below', 'alone', 'yourselves', 'which', 'does', 'ca', 'moreover', 'seems', 'or', 'first', 'here', 'various', 'n‘t', 'very', 'why', 'beyond', 'mine', 'themselves', 'twenty', 'really', 'almost', 'indeed', 'amongst', 'until', 'empty', 'everyone', 'should', 'bottom', 'five', 'among', 'also', 'over', "'s", 'via', 'against', 'just', 'above', 'twelve', 'although', 'could', 'hence', 'are', 'than', 'being', 'however', 'front', 'eight', 'already', 'three', 'two', 'have', 'while', 'beforehand', 'myself', '’ve', 'much', 'rather', 'seemed', 'back', '‘ve', 'from', 'every', 'other', 'between', 'of', 'serious', 'since', 'as', 'but', 'i', 'she', 'whoever', 'used', 'those', 'whatever', 'beside', 'someone', '’m', 'to', 'within', 'forty', 'sometimes', 'upon', 'by', 'into', 'regarding', 'hereupon', 'together', 'wherever', 'made', 'that', 'own', 'must', 'namely', 'had', 'hers', 'hereafter', 'perhaps', 'afterwards', 'part', 'another', 'next', 'across', 'nor', 'latter', 'get', 'this', 'our', 'whose', 'off', 'see', 'a', 'anyhow', 'former', "'ll", 'amount', 'becomes', 'same', 'full', 'himself', 'after', 'itself', 'they', 'how', 'using', "'re", 'somewhere', 'thus', 'somehow', 'too', 'because', 'still', 'us', 'ever', 'n’t', 'give', 'and', 'if', 'we', 'most', 'no', 'ours', 'became', 'for', 'may', 'fifty', 'everywhere', 'whenever', 'be', 'everything', 'an', 'whole', 'last', 'whether', '’s', '‘d', 'besides', 'along', 'all', 'say', 'might', 'seeming', 'on', 'neither', 'these', 'anywhere', '‘m', 'more', 'per', 'ourselves', 'otherwise', 'mostly', 'make', 'due', '’re', 'becoming', 'yours', 'each', 'thereby', 'any', 'onto', 'not', 'others', 'fifteen', 'were', 'many', 'would', 'though', 'either', 'keep', 'take', 'nevertheless', "'ve", 'about', 'you', 'therefore', 'thru', 'around', 'behind', 'else', 'he', 'its', 'throughout', 'four', 'further', 'herein', '’d', 're', 'am', 'where', 'do', 'well', "n't", 'side', 'whereupon', 'none', 'latterly', "'m", "'d", 'noone', 'at', 'whereas', 'even', 'anyone', 'nine', 'nowhere', 'down', 'did', 'them', 'name', 'thereupon', 'cannot', 'me', 'least', 'anyway', 'nothing', 'top', 'few', 'therein', 'yet', 'less', 'show', 'one', 'been', 'done', 'some', 'thence', 'her', 'up', 'can', 'put', 'whereafter', 'become', 'seem', 'nobody', 'only', 'enough', 'often', 'sometime', 'out', 'now', 'your', 'their', 'always', 'ten', 'under', 'please', 'six', 'yourself', 'then', 'wherein', 'except', 'eleven', 'meanwhile', 'whither', 'whereby', 'in', 'with', 'go', 'there', 'my', 'such', '‘re', 'anything', 'hundred', 'the', 'whence', 'was', 'never', 'sixty', 'formerly', 'several', 'without', 'through', 'elsewhere']

There are 326 stopwords.

len(stopwords)
326

Here we are printing all the tokens excluding the stopwords.

for token in doc:
    if token.is_stop == False:
        print(token)
Apple
,
sentence
.
Google
.
3rd

Lemmatization

In lemmatization, the words are replaced by the root words or the words with similar context. For example, given the word went, the lemma would be 'go' since went is the past form of go. Below we have printed the tokens and the lemma for each token.

doc = nlp('run runs running runner')
for lem in doc:
    print(lem.text, lem.lemma_)
run run
runs run
running run
runner runner

POS

Part-of-Speech (PoS) tagging, then it may be defined as the process of assigning one of the parts of speech to the given word. It is generally called POS tagging. In simple words, we can say that POS tagging is a task of labelling each word in a sentence with its appropriate part of speech. In the example below we have printed the token.pos_ for each token.

  • DET means determiner
  • AUX means auxiliary
  • ADJ means adverb
  • ADP means adposition
  • NOUN as the name suggests means common noun
  • PUNCT means punctuation
doc = nlp('All is well at your end!')
for token in doc:
    print(token.text, token.pos_)
All DET
is AUX
well ADJ
at ADP
your DET
end NOUN
! PUNCT

displacy visualizes dependencies and entities in your browser or in a notebook. displaCy is able to detect whether you’re working in a Jupyter notebook, and will return markup that can be rendered in a cell straight away. Below command will show the dependencies.

displacy.render(doc, style = 'dep')

All DETis AUXwell ADJat ADPyour DETend! NOUNnsubjadvmodprepposspobj

Entity Detection

Entity detection is a popular technique used in information extraction to identify and segment the entities and classify or categorize them under various predefined classes. It locates and classifies named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

doc = nlp("New York City on Tuesday declared a public health emergency and ordered mandatory measles vaccinations amid an outbreak, becoming the latest national flash point over refusals to inoculate against dangerous diseases. At least 285 people have contracted measles in the city since September, mostly in Brooklyn’s Williamsburg neighborhood. The order covers four Zip codes there, Mayor Bill de Blasio (D) said Tuesday. The mandate orders all unvaccinated people in the area, including a concentration of Orthodox Jews, to receive inoculations, including for children as young as 6 months old. Anyone who resists could be fined up to $1,000.")
doc
New York City on Tuesday declared a public health emergency and ordered mandatory measles vaccinations amid an outbreak, becoming the latest national flash point over refusals to inoculate against dangerous diseases. At least 285 people have contracted measles in the city since September, mostly in Brooklyn’s Williamsburg neighborhood. The order covers four Zip codes there, Mayor Bill de Blasio (D) said Tuesday. The mandate orders all unvaccinated people in the area, including a concentration of Orthodox Jews, to receive inoculations, including for children as young as 6 months old. Anyone who resists could be fined up to $1,000.
  • GPE means Countries, cities, states.
  • DATE means Absolute or relative dates or periods.
  • CARDINAL means Numerals that do not fall under another type.
  • PERSON means People, including fictional.
  • NORP means Nationalities or religious or political groups.
  • MONEY means Monetary values, including unit.
displacy.render(doc, style = 'ent')

New York City GPE on Tuesday DATE declared a public health emergency and ordered mandatory measles vaccinations amid an outbreak, becoming the latest national flash point over refusals to inoculate against dangerous diseases. At least 285 CARDINAL people have contracted measles in the city since September DATE , mostly in Brooklyn GPE ’s Williamsburg GPE neighborhood. The order covers four CARDINAL Zip codes there, Mayor Bill de Blasio PERSON (D) said Tuesday DATE . The mandate orders all unvaccinated people in the area, including a concentration of Orthodox NORP Jews NORP , to receive inoculations, including for children as young as 6 months old. Anyone who resists could be fined up to $1,000 MONEY .

Text Classification

Here we have imported the necessary libraries.

  • pandas is used to load the dataset and perform operations on the dataframe.
  • TfidfVectorizer is used to convert the data from text to numbers.
  • Pipeline is used for create a pipeline.
  • train_test_split is used to split the dataset in training and testing dataset.
  • accuracy_scoreclassification_reportconfusion_matrix are used for validation.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

read_csv is used to load the data into the dataframe. We have passed header=None option to tell pandas that there is no header row in this file. We have used sep='\t' because yelp_labelled.txt is a tab separated file. data.head() can be used to see the first 5 rows of the dataset.

data_yelp = pd.read_csv('datasets/yelp_labelled.txt', sep='\t', header = None)
data_yelp.head()
01
0Wow... Loved this place.1
1Crust is not good.0
2Not tasty and the texture was just nasty.0
3Stopped by during the late May bank holiday of...1
4The selection on the menu was great and so wer...1

The dataset contains 2 columns. The first column has the review and the second column has the sentiment of the review. 0 indicates a negative sentiment and 1 indicates a positive sentiment. The dataset does not have column names hence we will name the columns as Review and Sentiment.

columns_name = ['Review', 'Sentiment']
data_yelp.columns = columns_name
data_yelp.head()
ReviewSentiment
0Wow... Loved this place.1
1Crust is not good.0
2Not tasty and the texture was just nasty.0
3Stopped by during the late May bank holiday of...1
4The selection on the menu was great and so wer...1

data_yelp contains 1000 rows i.e. 1000 reviews and each row has 2 columns.

data_yelp.shape
(1000, 2)

Simillary, we will now load the second dataset into data_amazon and assign column names.

data_amazon = pd.read_csv('datasets/amazon_cells_labelled.txt', sep = '\t', header = None)
data_amazon.columns = columns_name
data_amazon.head()
ReviewSentiment
0So there is no way for me to plug it in here i...0
1Good case, Excellent value.1
2Great for the jawbone.1
3Tied to charger for conversations lasting more...0
4The mic is great.1

data_amazon contains 1000 rows i.e. 1000 reviews and each row has 2 columns.

data_amazon.shape
(1000, 2)

Now we will now the third dataset into data_imdb and assign column names.

data_imdb = pd.read_csv('datasets/imdb_labelled.txt', sep = '\t', header = None)
data_imdb.columns = columns_name
data_imdb.head()
ReviewSentiment
0A very, very, very slow-moving, aimless movie ...0
1Not sure who was more lost - the flat characte...0
2Attempting artiness with black & white and cle...0
3Very little music or anything to speak of.0
4The best scene in the movie was when Gerardo i...1

data_imdb contains 748 rows i.e. 748 reviews and each row has 2 columns.

data_imdb.shape
(748, 2)

Now we are going to create data which will contain all the 3 datasets. We will append all the 3 datasets one after the other to get the final dataset. ignore_index=True does not use the index values on the concatenation axis. The resulting axis will be labeled 0, …, n - 1. The final dataset has 2748 rows and 2 columns.

data = data_yelp.append([data_amazon, data_imdb], ignore_index=True)
data.shape
(2748, 2)
data.head()
ReviewSentiment
0Wow... Loved this place.1
1Crust is not good.0
2Not tasty and the texture was just nasty.0
3Stopped by during the late May bank holiday of...1
4The selection on the menu was great and so wer...1

Now we will see the distribution of Sentiment in out dataset. The value_counts() function is used to get a Series containing counts of unique values. In data there are 1386 positive sentiment reviews and 1362 negative reviews.

data['Sentiment'].value_counts()
1    1386
0    1362
Name: Sentiment, dtype: int64

Now we will check if there are null values in our data using isnull(). As you can see there are no null values.

data.isnull().sum()
Review       0
Sentiment    0
dtype: int64

Cleaning - Tokenization and Lemmatization

string.punctuation is a pre-initialized string used as string constant. It will give the all sets of punctuation.

import string
punct = string.punctuation
punct
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In the function text_data_cleaning() we are first converting the sentence into tokens. Then for each token we are performing lemmatization. If the lemma is not a pronoun we are converting the lemma to lower case; else we are converting the token to lower case. Finally we are removing all the stopwords and punctuation marks.

def text_data_cleaning(sentence):
    doc = nlp(sentence)
    
    tokens = []
    for token in doc:
        if token.lemma_ != "-PRON-":
            temp = token.lemma_.lower().strip()
        else:
            temp = token.lower_
        tokens.append(temp)
    
    cleaned_tokens = []
    for token in tokens:
        if token not in stopwords and token not in punct:
            cleaned_tokens.append(token)
    return cleaned_tokens

text_data_cleaning("    Hello how are you. Like this video")
['hello', 'like', 'video']

Vectorization Feature Engineering (TF-IDF)

TfidfVectorizer() converts a collection of raw documents to a matrix of TF-IDF features. We have passed text_data_cleaning() as the tokenizerLinearSVC is a faster implementation of Support Vector Classification for the case of a linear kernel.

tfidf = TfidfVectorizer(tokenizer = text_data_cleaning)
classifier = LinearSVC()

X contains the feature space and Y contains the labels.

X = data['Review']
y = data['Sentiment']

Here we are dividing the data into training data and test data using train_test_split() from sklearn which we have already imported. We are going to use 80% of the data for training the model and 20% of the data for testing. random_state controls the shuffling applied to the data before applying the split.

We can see that we have got 2198 samples in the traning dataset and 550 samples in the test dataset.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
X_train.shape, X_test.shape
((2198,), (550,))

Pipeline enables us to apply a pipeline of transforms with a final estimator. It sequentially applies a list of transforms and a final estimator. Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The final estimator only needs to implement fit. The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. The parameter passed in Pipeline are a list of (name, transform) tuples (implementing fit/transform) that are chained, in the order in which they are chained, with the last object an estimator.

clf = Pipeline([('tfidf', tfidf), ('clf', classifier)])
clf.fit(X_train, y_train)
Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function text_data_cleaning at 0x0000026300CBB158>,
                                 use_idf=True, vocabulary=None)),
                ('clf',
                 LinearSVC(C=1.0, class_weight=None, dual=True,
                           fit_intercept=True, intercept_scaling=1,
                           loss='squared_hinge', max_iter=1000,
                           multi_class='ovr', penalty='l2', random_state=None,
                           tol=0.0001, verbose=0))],
         verbose=False)

Now we will predict the labels for X_test.

y_pred = clf.predict(X_test)

classification_report() builds a text report showing the main classification metrics.

print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.77      0.81      0.79       285
           1       0.78      0.74      0.76       265

    accuracy                           0.78       550
   macro avg       0.78      0.78      0.78       550
weighted avg       0.78      0.78      0.78       550

confusion_matrix() computes confusion matrix to evaluate the accuracy of a classification. By definition a confusion matrix C is such that C(i,j) is equal to the number of observations known to be in group i and predicted to be in group j. Thus in binary classification, the count of true negatives is C(0,0), false negatives is C(1,0), true positives is C(1,1) and false positives is C(0,1).

confusion_matrix(y_test, y_pred)
array([[230,  55],
       [ 68, 197]], dtype=int64)

Now we will predict the label of some random sentences.

clf.predict(['Wow, this is amazing lesson'])
array([1], dtype=int64)
clf.predict(['Wow, this sucks'])
array([0], dtype=int64)
clf.predict(['Worth of watching it. Please like it'])
array([1], dtype=int64)
clf.predict(['Loved it. Amazing'])
array([1], dtype=int64)

In this blog we saw some features of SpaCy. Then we went ahead and performed sentiment analysis by loading the data, pre-processing it and then training our model. We used tf-idf vectorizer and Linear SVC to train the model. We got an accuracy of 78%.