Amazon and IMDB Review Sentiment Classification using SpaCy
Sentiment Classification using SpaCy
What is NLP?
Natural Language Processing (NLP) is the field of Artificial Intelligence concerned with the processing and understanding of human language. Since its inception during the 1950s, machine understanding of language has played a pivotal role in translation, topic modeling, document indexing, information retrieval, and extraction.
Some Applications of NLP
- Text Classification
- Spam Filters
- Voice text messaging
- Sentiment analysis
- Spell or grammar check
- Chat bot
- Search Suggestion
- Search Autocorrect
- Automatic Review Analysis system
- Machine translation
spaCy installation
You can run the following commands:-
!pip install -U spacy
!pip install -U spacy-lookups-data
!python -m spacy download en_core_web_sm
Scikit-learn installation
You can run the following command:-
!pip install scikit-learn
Data Cleaning Options
- Case Normalization
- Removing Stop Words
- Removing Punctuations or Special Symbols
- Lemmatization or Stemming
- Parts of Speech Tagging
- Entity Detection
- Bag of Words
- TF-IDF
Bag of Words - The Simplest Word Embedding Technique
This is one of the simplest methods of embedding words into numerical vectors. It is not often used in practice due to its oversimplification of language, but often the first embedding technique to be taught in the classroom setting. Whenever we apply any algorithm in NLP, it works on numbers. We cannot directly feed our text into that algorithm. Hence, Bag of Words model is used to preprocess the text by converting it into a bag of words, which keeps a count of the total occurrences of unique words.
A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:
- A vocabulary of known words.
- A measure of the presence of known words.
It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.
doc1 = "I am high"
doc2 = "Yes I am high"
doc3 = "I am kidding"
Bag of Words and Tf-idf
In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.
Pipeline in SpaCy
When you call nlp on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline.
The pipeline used by the default models consists of a tagger, a parser and an entity recognizer. Each pipeline component returns the processed Doc, which is then passed on to the next component.
Datasets
You can get all the datasets used in this notebook from here.
Watch Full Video here:
Let's Get Started
Here we are importing the necessary libraries.
import spacy from spacy import displacy
spacy.load()
loads a model. When you call nlp
on a text, spaCy first tokenizes the text to produce a Doc
object. The Doc
is then processed using the pipeline.
nlp = spacy.load('en_core_web_sm') text = "Apple, This is first sentence. and Google this is another one. here 3rd one is" doc = nlp(text) doc
Apple, This is first sentence. and Google this is another one. here 3rd one is
Now we will see the tokens
in doc
.
for token in doc: print(token)
Apple , This is first sentence . and Google this is another one . here 3rd one is
create_pipe()
creates the pipeline components. sentencizer
adds rule-based sentence segmentation without the dependency parse. Custom components can be added to the pipeline using the add_pipe
method. Optionally, you can either specify a component to add it before or after, tell spaCy
to add it first or last in the pipeline. We will add sentencizer
before parser
.
sent = nlp.create_pipe('sentencizer') nlp.add_pipe(sent, before='parser') doc = nlp(text) for sent in doc.sents: print(sent)
Apple, This is first sentence. and Google this is another one. here 3rd one is
stop words
are words which are filtered out before or after processing of natural language data. A They are commonly used word such as “the”, “a”, “an”, “in” which don't add significant meaning to the sentence. STOP_WORDS
is a set of default stop words for English language model in SpaCy. We can see the stop words in SpaCy below.
from spacy.lang.en.stop_words import STOP_WORDS stopwords = list(STOP_WORDS) print(stopwords)
['move', 'again', 'during', 'herself', 'him', 'hereby', 'third', 'once', 'call', 'both', '‘ll', 'doing', 'something', 'when', '’ll', 'unless', 'thereafter', 'before', 'so', 'is', 'will', 'toward', 'has', 'whom', 'it', 'who', 'what', 'his', '‘s', 'towards', 'quite', 'below', 'alone', 'yourselves', 'which', 'does', 'ca', 'moreover', 'seems', 'or', 'first', 'here', 'various', 'n‘t', 'very', 'why', 'beyond', 'mine', 'themselves', 'twenty', 'really', 'almost', 'indeed', 'amongst', 'until', 'empty', 'everyone', 'should', 'bottom', 'five', 'among', 'also', 'over', "'s", 'via', 'against', 'just', 'above', 'twelve', 'although', 'could', 'hence', 'are', 'than', 'being', 'however', 'front', 'eight', 'already', 'three', 'two', 'have', 'while', 'beforehand', 'myself', '’ve', 'much', 'rather', 'seemed', 'back', '‘ve', 'from', 'every', 'other', 'between', 'of', 'serious', 'since', 'as', 'but', 'i', 'she', 'whoever', 'used', 'those', 'whatever', 'beside', 'someone', '’m', 'to', 'within', 'forty', 'sometimes', 'upon', 'by', 'into', 'regarding', 'hereupon', 'together', 'wherever', 'made', 'that', 'own', 'must', 'namely', 'had', 'hers', 'hereafter', 'perhaps', 'afterwards', 'part', 'another', 'next', 'across', 'nor', 'latter', 'get', 'this', 'our', 'whose', 'off', 'see', 'a', 'anyhow', 'former', "'ll", 'amount', 'becomes', 'same', 'full', 'himself', 'after', 'itself', 'they', 'how', 'using', "'re", 'somewhere', 'thus', 'somehow', 'too', 'because', 'still', 'us', 'ever', 'n’t', 'give', 'and', 'if', 'we', 'most', 'no', 'ours', 'became', 'for', 'may', 'fifty', 'everywhere', 'whenever', 'be', 'everything', 'an', 'whole', 'last', 'whether', '’s', '‘d', 'besides', 'along', 'all', 'say', 'might', 'seeming', 'on', 'neither', 'these', 'anywhere', '‘m', 'more', 'per', 'ourselves', 'otherwise', 'mostly', 'make', 'due', '’re', 'becoming', 'yours', 'each', 'thereby', 'any', 'onto', 'not', 'others', 'fifteen', 'were', 'many', 'would', 'though', 'either', 'keep', 'take', 'nevertheless', "'ve", 'about', 'you', 'therefore', 'thru', 'around', 'behind', 'else', 'he', 'its', 'throughout', 'four', 'further', 'herein', '’d', 're', 'am', 'where', 'do', 'well', "n't", 'side', 'whereupon', 'none', 'latterly', "'m", "'d", 'noone', 'at', 'whereas', 'even', 'anyone', 'nine', 'nowhere', 'down', 'did', 'them', 'name', 'thereupon', 'cannot', 'me', 'least', 'anyway', 'nothing', 'top', 'few', 'therein', 'yet', 'less', 'show', 'one', 'been', 'done', 'some', 'thence', 'her', 'up', 'can', 'put', 'whereafter', 'become', 'seem', 'nobody', 'only', 'enough', 'often', 'sometime', 'out', 'now', 'your', 'their', 'always', 'ten', 'under', 'please', 'six', 'yourself', 'then', 'wherein', 'except', 'eleven', 'meanwhile', 'whither', 'whereby', 'in', 'with', 'go', 'there', 'my', 'such', '‘re', 'anything', 'hundred', 'the', 'whence', 'was', 'never', 'sixty', 'formerly', 'several', 'without', 'through', 'elsewhere']
There are 326 stopwords.
len(stopwords)
326
Here we are printing all the tokens excluding the stopwords.
for token in doc: if token.is_stop == False: print(token)
Apple , sentence . Google . 3rd
Lemmatization
In lemmatization
, the words are replaced by the root words or the words with similar context. For example, given the word went, the lemma would be 'go' since went is the past form of go. Below we have printed the tokens and the lemma for each token.
doc = nlp('run runs running runner') for lem in doc: print(lem.text, lem.lemma_)
run run runs run running run runner runner
POS
Part-of-Speech (PoS) tagging
, then it may be defined as the process of assigning one of the parts of speech to the given word. It is generally called POS tagging. In simple words, we can say that POS tagging is a task of labelling each word in a sentence with its appropriate part of speech. In the example below we have printed the token.pos_
for each token.
DET
means determinerAUX
means auxiliaryADJ
means adverbADP
means adpositionNOUN
as the name suggests means common nounPUNCT
means punctuation
doc = nlp('All is well at your end!') for token in doc: print(token.text, token.pos_)
All DET is AUX well ADJ at ADP your DET end NOUN ! PUNCT
displacy
visualizes dependencies and entities in your browser or in a notebook. displaCy
is able to detect whether you’re working in a Jupyter notebook, and will return markup that can be rendered in a cell straight away. Below command will show the dependencies.
displacy.render(doc, style = 'dep')
All DETis AUXwell ADJat ADPyour DETend! NOUNnsubjadvmodprepposspobj
Entity Detection
Entity detection
is a popular technique used in information extraction to identify and segment the entities and classify or categorize them under various predefined classes. It locates and classifies named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.
doc = nlp("New York City on Tuesday declared a public health emergency and ordered mandatory measles vaccinations amid an outbreak, becoming the latest national flash point over refusals to inoculate against dangerous diseases. At least 285 people have contracted measles in the city since September, mostly in Brooklyn’s Williamsburg neighborhood. The order covers four Zip codes there, Mayor Bill de Blasio (D) said Tuesday. The mandate orders all unvaccinated people in the area, including a concentration of Orthodox Jews, to receive inoculations, including for children as young as 6 months old. Anyone who resists could be fined up to $1,000.") doc
New York City on Tuesday declared a public health emergency and ordered mandatory measles vaccinations amid an outbreak, becoming the latest national flash point over refusals to inoculate against dangerous diseases. At least 285 people have contracted measles in the city since September, mostly in Brooklyn’s Williamsburg neighborhood. The order covers four Zip codes there, Mayor Bill de Blasio (D) said Tuesday. The mandate orders all unvaccinated people in the area, including a concentration of Orthodox Jews, to receive inoculations, including for children as young as 6 months old. Anyone who resists could be fined up to $1,000.
GPE
means Countries, cities, states.DATE
means Absolute or relative dates or periods.CARDINAL
means Numerals that do not fall under another type.PERSON
means People, including fictional.NORP
means Nationalities or religious or political groups.MONEY
means Monetary values, including unit.
displacy.render(doc, style = 'ent')
New York City GPE on Tuesday DATE declared a public health emergency and ordered mandatory measles vaccinations amid an outbreak, becoming the latest national flash point over refusals to inoculate against dangerous diseases. At least 285 CARDINAL people have contracted measles in the city since September DATE , mostly in Brooklyn GPE ’s Williamsburg GPE neighborhood. The order covers four CARDINAL Zip codes there, Mayor Bill de Blasio PERSON (D) said Tuesday DATE . The mandate orders all unvaccinated people in the area, including a concentration of Orthodox NORP Jews NORP , to receive inoculations, including for children as young as 6 months old. Anyone who resists could be fined up to $1,000 MONEY .
Text Classification
Here we have imported the necessary libraries.
pandas
is used to load the dataset and perform operations on the dataframe.TfidfVectorizer
is used to convert the data from text to numbers.Pipeline
is used for create a pipeline.train_test_split
is used to split the dataset in training and testing dataset.accuracy_score
,classification_report
,confusion_matrix
are used for validation.
import pandas as pd from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.svm import LinearSVC from sklearn.pipeline import Pipeline from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
read_csv
is used to load the data into the dataframe. We have passed header=None
option to tell pandas that there is no header row in this file. We have used sep='\t'
because yelp_labelled.txt
is a tab separated file. data.head()
can be used to see the first 5 rows of the dataset.
data_yelp = pd.read_csv('datasets/yelp_labelled.txt', sep='\t', header = None) data_yelp.head()
0 | 1 | |
---|---|---|
0 | Wow... Loved this place. | 1 |
1 | Crust is not good. | 0 |
2 | Not tasty and the texture was just nasty. | 0 |
3 | Stopped by during the late May bank holiday of... | 1 |
4 | The selection on the menu was great and so wer... | 1 |
The dataset contains 2 columns. The first column has the review and the second column has the sentiment of the review. 0 indicates a negative sentiment and 1 indicates a positive sentiment. The dataset does not have column names hence we will name the columns as Review
and Sentiment
.
columns_name = ['Review', 'Sentiment'] data_yelp.columns = columns_name data_yelp.head()
Review | Sentiment | |
---|---|---|
0 | Wow... Loved this place. | 1 |
1 | Crust is not good. | 0 |
2 | Not tasty and the texture was just nasty. | 0 |
3 | Stopped by during the late May bank holiday of... | 1 |
4 | The selection on the menu was great and so wer... | 1 |
data_yelp
contains 1000 rows i.e. 1000 reviews and each row has 2 columns.
data_yelp.shape
(1000, 2)
Simillary, we will now load the second dataset into data_amazon
and assign column names.
data_amazon = pd.read_csv('datasets/amazon_cells_labelled.txt', sep = '\t', header = None) data_amazon.columns = columns_name data_amazon.head()
Review | Sentiment | |
---|---|---|
0 | So there is no way for me to plug it in here i... | 0 |
1 | Good case, Excellent value. | 1 |
2 | Great for the jawbone. | 1 |
3 | Tied to charger for conversations lasting more... | 0 |
4 | The mic is great. | 1 |
data_amazon
contains 1000 rows i.e. 1000 reviews and each row has 2 columns.
data_amazon.shape
(1000, 2)
Now we will now the third dataset into data_imdb
and assign column names.
data_imdb = pd.read_csv('datasets/imdb_labelled.txt', sep = '\t', header = None) data_imdb.columns = columns_name data_imdb.head()
Review | Sentiment | |
---|---|---|
0 | A very, very, very slow-moving, aimless movie ... | 0 |
1 | Not sure who was more lost - the flat characte... | 0 |
2 | Attempting artiness with black & white and cle... | 0 |
3 | Very little music or anything to speak of. | 0 |
4 | The best scene in the movie was when Gerardo i... | 1 |
data_imdb
contains 748 rows i.e. 748 reviews and each row has 2 columns.
data_imdb.shape
(748, 2)
Now we are going to create data
which will contain all the 3 datasets. We will append
all the 3 datasets one after the other to get the final dataset. ignore_index=True
does not use the index values on the concatenation axis. The resulting axis will be labeled 0, …, n - 1. The final dataset has 2748 rows and 2 columns.
data = data_yelp.append([data_amazon, data_imdb], ignore_index=True) data.shape
(2748, 2)
data.head()
Review | Sentiment | |
---|---|---|
0 | Wow... Loved this place. | 1 |
1 | Crust is not good. | 0 |
2 | Not tasty and the texture was just nasty. | 0 |
3 | Stopped by during the late May bank holiday of... | 1 |
4 | The selection on the menu was great and so wer... | 1 |
Now we will see the distribution of Sentiment
in out dataset. The value_counts()
function is used to get a Series containing counts of unique values. In data
there are 1386 positive sentiment reviews and 1362 negative reviews.
data['Sentiment'].value_counts()
1 1386 0 1362 Name: Sentiment, dtype: int64
Now we will check if there are null values in our data using isnull()
. As you can see there are no null values.
data.isnull().sum()
Review 0 Sentiment 0 dtype: int64
Cleaning - Tokenization and Lemmatization
string.punctuation
is a pre-initialized string used as string constant. It will give the all sets of punctuation.
import string punct = string.punctuation punct
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
In the function text_data_cleaning()
we are first converting the sentence into tokens. Then for each token we are performing lemmatization
. If the lemma
is not a pronoun
we are converting the lemma
to lower case
; else we are converting the token
to lower case
. Finally we are removing all the stopwords
and punctuation
marks.
def text_data_cleaning(sentence): doc = nlp(sentence) tokens = [] for token in doc: if token.lemma_ != "-PRON-": temp = token.lemma_.lower().strip() else: temp = token.lower_ tokens.append(temp) cleaned_tokens = [] for token in tokens: if token not in stopwords and token not in punct: cleaned_tokens.append(token) return cleaned_tokens text_data_cleaning(" Hello how are you. Like this video")
['hello', 'like', 'video']
Vectorization Feature Engineering (TF-IDF)
TfidfVectorizer()
converts a collection of raw documents to a matrix of TF-IDF features. We have passed text_data_cleaning()
as the tokenizer
. LinearSVC
is a faster implementation of Support Vector Classification for the case of a linear kernel.
tfidf = TfidfVectorizer(tokenizer = text_data_cleaning) classifier = LinearSVC()
X
contains the feature space and Y
contains the labels.
X = data['Review'] y = data['Sentiment']
Here we are dividing the data into training data and test data using train_test_split()
from sklearn
which we have already imported. We are going to use 80%
of the data for training the model and 20%
of the data for testing. random_state
controls the shuffling applied to the data before applying the split.
We can see that we have got 2198
samples in the traning
dataset and 550
samples in the test
dataset.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42) X_train.shape, X_test.shape
((2198,), (550,))
Pipeline
enables us to apply a pipeline of transforms with a final estimator. It sequentially applies a list of transforms and a final estimator. Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The final estimator only needs to implement fit. The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. The parameter passed in Pipeline
are a list of (name, transform) tuples (implementing fit/transform) that are chained, in the order in which they are chained, with the last object an estimator.
clf = Pipeline([('tfidf', tfidf), ('clf', classifier)]) clf.fit(X_train, y_train)
Pipeline(memory=None, steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict', dtype=<class 'numpy.float64'>, encoding='utf-8', input='content', lowercase=True, max_df=1.0, max_features=None, min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True, stop_words=None, strip_accents=None, sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=<function text_data_cleaning at 0x0000026300CBB158>, use_idf=True, vocabulary=None)), ('clf', LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True, intercept_scaling=1, loss='squared_hinge', max_iter=1000, multi_class='ovr', penalty='l2', random_state=None, tol=0.0001, verbose=0))], verbose=False)
Now we will predict
the labels for X_test
.
y_pred = clf.predict(X_test)
classification_report()
builds a text report showing the main classification metrics.
print(classification_report(y_test, y_pred))
precision recall f1-score support 0 0.77 0.81 0.79 285 1 0.78 0.74 0.76 265 accuracy 0.78 550 macro avg 0.78 0.78 0.78 550 weighted avg 0.78 0.78 0.78 550
confusion_matrix()
computes confusion matrix to evaluate the accuracy of a classification. By definition a confusion matrix C is such that C(i,j) is equal to the number of observations known to be in group i and predicted to be in group j. Thus in binary classification, the count of true negatives is C(0,0), false negatives is C(1,0), true positives is C(1,1) and false positives is C(0,1).
confusion_matrix(y_test, y_pred)
array([[230, 55], [ 68, 197]], dtype=int64)
Now we will predict the label of some random sentences.
clf.predict(['Wow, this is amazing lesson'])
array([1], dtype=int64)
clf.predict(['Wow, this sucks'])
array([0], dtype=int64)
clf.predict(['Worth of watching it. Please like it'])
array([1], dtype=int64)
clf.predict(['Loved it. Amazing'])
array([1], dtype=int64)
In this blog we saw some features of SpaCy. Then we went ahead and performed sentiment analysis by loading the data, pre-processing it and then training our model. We used tf-idf vectorizer and Linear SVC to train the model. We got an accuracy of 78%.
0 Comments