Sentiment Classification with spaCy

Classify Amazon, IMDB, and Yelp reviews using spaCy's tokenization and scikit-learn. Build a machine learning pipeline to predict text sentiment.

Sep 12, 2020Updated May 17, 202630 min readFollow

Topics You Will Master

Loading and concatenating review datasets from Yelp, Amazon, and IMDb
Understanding spaCy's tokenization, lemmatization, and named entity recognition
Building a custom text cleaning function using spaCy lemmas and stop words
Training and evaluating a TF-IDF + LinearSVC classifier inside a scikit-learn Pipeline

Sentiment analysis is one of the most practical applications of NLP. Businesses use it to process thousands of customer reviews automatically, without anyone having to read them one by one.

This tutorial builds a sentiment classifier in Python using reviews from Yelp, Amazon, and IMDb. You'll process text with spaCy's linguistic features and train a LinearSVC model inside a scikit-learn pipeline.

Prerequisites: Python 3.x, spaCy, scikit-learn, Pandas, displaCy.

Datasets used in this tutorial: Datasets for Sentiment Classification on GitHub

Natural Language Processing Concepts

Natural Language Processing (NLP) is the field of AI concerned with making computers understand human language. It started in the 1950s with rule-based translation and has grown into the systems that now run search engines, chatbots, and machine translation.

Applications of NLP

NLP is used across many domains to automate language understanding:

  • Text Classification: Automatically sorting emails or documents into categories.
  • Spam Filters: Detecting and blocking unsolicited messages.
  • Voice Text Messaging: Transcribing spoken language into written text.
  • Sentiment Analysis: Detecting opinion or emotion in a block of text.
  • Spell or Grammar Check: Suggesting corrections for writing errors.
  • Chatbots: Responding to users to answer questions or solve issues.
  • Search Auto-suggestions and Autocorrect: Predicting and correcting search terms.
  • Automatic Review Analysis: Parsing feedback to extract customer insights.
  • Machine Translation: Converting text between languages.

Data Cleaning Techniques

Before feeding text into a machine learning model, raw strings need to be cleaned and converted to numerical features.

Common preprocessing steps include:

  • Case Normalization: Converting all text to lowercase.
  • Removing Stop Words: Filtering out common words with little semantic value (e.g., "the", "is", "at").
  • Removing Punctuation or Special Symbols: Stripping noise like exclamation marks or brackets.
  • Lemmatization or Stemming: Reducing words to their base form (e.g., "running" to "run").
  • Parts of Speech Tagging: Identifying the grammatical category of each word.
  • Entity Detection: Identifying proper nouns like names, dates, and locations.

Bag of Words and Word Embeddings

A Bag of Words (BoW) represents a document by counting how often each word appears. It ignores word order and grammar, focusing only on frequency.

The three documents below illustrate how this works:

PYTHON
doc1 = "I am high"
doc2 = "Yes I am high"
doc3 = "I am kidding"

The table below shows how BoW constructs a document-term matrix by counting each unique word across the documents:

Bag of Words document-term matrix showing word counts across three sample documents

TF-IDF Vectorization

TF-IDF scores words by how common they are in one document versus across all documents. A word that appears often in one review but rarely elsewhere gets a high score. Common words like "the" score near zero.

The chart shows how TF-IDF filters out common words while amplifying distinctive terms:

TF-IDF explanation chart showing document frequency and term relevance

spaCy Pipelines and Installation

Library Installation

Install spaCy and the small English language model (en_core_web_sm):

PYTHON
# pip install -U spacy
# pip install -U spacy-lookups-data
# python -m spacy download en_core_web_sm

Also install scikit-learn:

PYTHON
# pip install scikit-learn

The spaCy Processing Pipeline

When spaCy's nlp object processes a string, it tokenizes the text into a Doc object, then runs it through a series of pipeline components that handle POS tagging, dependency parsing, and named entity recognition.

The diagram outlines the default pipeline flow:

spaCy processing pipeline architecture showing tokenizer, tagger, parser, and NER steps

Basic Text Processing with spaCy

Import spaCy and its visualization module, displaCy:

PYTHON
import spacy
from spacy import displacy

Load the English language model and process a sample sentence:

PYTHON
nlp = spacy.load('en_core_web_sm')
text = "Apple, This is first sentence. and Google this is another one. here 3rd one is"
doc = nlp(text)
doc

Output:

PLAINTEXT
Apple, This is first sentence. and Google this is another one. here 3rd one is

Iterate through the document to inspect each token:

PYTHON
for token in doc:
    print(token)
OUTPUT
Apple
,
This
is
first
sentence
.
and
Google
this
is
another
one
.
here
3rd
one
is

Sentence Segmentation

You can add custom components to spaCy's pipeline. The sentencizer handles rule-based sentence segmentation without running a full dependency parse. Add it before the parser component and print each sentence:

PYTHON
sent = nlp.create_pipe('sentencizer')
nlp.add_pipe(sent, before='parser')
doc = nlp(text)
for sent in doc.sents:
    print(sent)

Three sentences, segmented by spaCy's rules:

PLAINTEXT
Apple, This is first sentence.
and Google this is another one.
here 3rd one is

Stop Words Filtering

Stop words are common words that carry little meaning on their own, like "the" or "is". Import spaCy's built-in English stop word list:

PYTHON
from spacy.lang.en.stop_words import STOP_WORDS
stopwords = list(STOP_WORDS)
print(stopwords)
OUTPUT
['move', 'again', 'during', 'herself', 'him', 'hereby', 'third', 'once', 'call', 'both', ''ll', 'doing', 'something', 'when', ''ll', 'unless', 'thereafter', 'before', 'so', 'is', 'will', 'toward', 'has', 'whom', 'it', 'who', 'what', 'his', ''s', 'towards', 'quite', 'below', 'alone', 'yourselves', 'which', 'does', 'ca', 'moreover', 'seems', 'or', 'first', 'here', 'various', "n't", 'very', 'why', 'beyond', 'mine', 'themselves', 'twenty', 'really', 'almost', 'indeed', 'amongst', 'until', 'empty', 'everyone', 'should', 'bottom', 'five', 'among', 'also', 'over', "'s", 'via', 'against', 'just', 'above', 'twelve', 'although', 'could', 'hence', 'are', 'than', 'being', 'however', 'front', 'eight', 'already', 'three', 'two', 'have', 'while', 'beforehand', 'myself', ''ve', 'much', 'rather', 'seemed', 'back', ''ve', 'from', 'every', 'other', 'between', 'of', 'serious', 'since', 'as', 'but', 'i', 'she', 'whoever', 'used', 'those', 'whatever', 'beside', 'someone', ''m', 'to', 'within', 'forty', 'sometimes', 'upon', 'by', 'into', 'regarding', 'hereupon', 'together', 'wherever', 'made', 'that', 'own', 'must', 'namely', 'had', 'hers', 'hereafter', 'perhaps', 'afterwards', 'part', 'another', 'next', 'across', 'nor', 'latter', 'get', 'this', 'our', 'whose', 'off', 'see', 'a', 'anyhow', 'former', "'ll", 'amount', 'becomes', 'same', 'full', 'himself', 'after', 'itself', 'they', 'how', 'using', "'re", 'somewhere', 'thus', 'somehow', 'too', 'because', 'still', 'us', 'ever', "n't", 'give', 'and', 'if', 'we', 'most', 'no', 'ours', 'became', 'for', 'may', 'fifty', 'everywhere', 'whenever', 'be', 'everything', 'an', 'whole', 'last', 'whether', ''s', ''d', 'besides', 'along', 'all', 'say', 'might', 'seeming', 'on', 'neither', 'these', 'anywhere', ''m', 'more', 'per', 'ourselves', 'otherwise', 'mostly', 'make', 'due', ''re', 'becoming', 'yours', 'each', 'thereby', 'any', 'onto', 'not', 'others', 'fifteen', 'were', 'many', 'would', 'though', 'either', 'keep', 'take', 'nevertheless', "'ve", 'about', 'you', 'therefore', 'thru', 'around', 'behind', 'else', 'he', 'its', 'throughout', 'four', 'further', 'herein', ''d', 're', 'am', 'where', 'do', 'well', "n't", 'side', 'whereupon', 'none', 'latterly', "'m", "'d", 'noone', 'at', 'whereas', 'even', 'anyone', 'nine', 'nowhere', 'down', 'did', 'them', 'name', 'thereupon', 'cannot', 'me', 'least', 'anyway', 'nothing', 'top', 'few', 'therein', 'yet', 'less', 'show', 'one', 'been', 'done', 'some', 'thence', 'her', 'up', 'can', 'put', 'whereafter', 'become', 'seem', 'nobody', 'only', 'enough', 'often', 'sometime', 'out', 'now', 'your', 'their', 'always', 'ten', 'under', 'please', 'six', 'yourself', 'then', 'wherein', 'except', 'eleven', 'meanwhile', 'whither', 'whereby', 'in', 'with', 'go', 'there', 'my', 'such', ''re', 'anything', 'hundred', 'the', 'whence', 'was', 'never', 'sixty', 'formerly', 'several', 'without', 'through', 'elsewhere']

Check how many stop words there are:

PYTHON
len(stopwords)
OUTPUT
326

Filter out stop words by checking the is_stop attribute on each token:

PYTHON
for token in doc:
    if token.is_stop == False:
        print(token)

After filtering, only non-stop tokens remain:

PLAINTEXT
Apple
,
sentence
.
Google
.
3rd

Lemmatization

Lemmatization maps words back to their base form. "runs", "running", and "ran" all reduce to "run". Test this on a small string:

PYTHON
doc = nlp('run runs running runner')
for lem in doc:
    print(lem.text, lem.lemma_)

"runs" and "running" map back to "run", but "runner" stays as-is since it's a different word:

PLAINTEXT
run run
runs run
running run
runner runner

Part-of-Speech (POS) Tagging

POS tagging assigns a grammatical category to each word: noun, verb, adjective, and so on. Run it on a short sentence:

PYTHON
doc = nlp('All is well at your end!')
for token in doc:
    print(token.text, token.pos_)

Each token appears alongside its POS tag:

PLAINTEXT
All DET
is AUX
well ADJ
at ADP
your DET
end NOUN
! PUNCT

Use displaCy to visualize the dependency parse inside your notebook:

PYTHON
displacy.render(doc, style = 'dep')

The raw layout output:

All DETis AUXwell ADJat ADPyour DETend! NOUNnsubjadvmodprepposspobj

Named Entity Recognition (NER)

NER finds proper nouns in text and labels them by type: person, location, date, money, and so on. Run it on a longer paragraph:

PYTHON
doc = nlp("New York City on Tuesday declared a public health emergency and ordered mandatory measles vaccinations amid an outbreak, becoming the latest national flash point over refusals to inoculate against dangerous diseases. At least 285 people have contracted measles in the city since September, mostly in Brooklyn's Williamsburg neighborhood. The order covers four Zip codes there, Mayor Bill de Blasio (D) said Tuesday. The mandate orders all unvaccinated people in the area, including a concentration of Orthodox Jews, to receive inoculations, including for children as young as 6 months old. Anyone who resists could be fined up to $1,000.")
doc

The parsed doc:

PLAINTEXT
New York City on Tuesday declared a public health emergency and ordered mandatory measles vaccinations amid an outbreak, becoming the latest national flash point over refusals to inoculate against dangerous diseases. At least 285 people have contracted measles in the city since September, mostly in Brooklyn's Williamsburg neighborhood. The order covers four Zip codes there, Mayor Bill de Blasio (D) said Tuesday. The mandate orders all unvaccinated people in the area, including a concentration of Orthodox Jews, to receive inoculations, including for children as young as 6 months old. Anyone who resists could be fined up to $1,000.

Render the named entities with displaCy:

PYTHON
displacy.render(doc, style = 'ent')

displaCy highlights each entity with its type label (GPE, DATE, CARDINAL, PERSON, NORP, MONEY):

New York City GPE on Tuesday DATE declared a public health emergency and ordered mandatory measles vaccinations amid an outbreak, becoming the latest national flash point over refusals to inoculate against dangerous diseases. At least 285 CARDINAL people have contracted measles in the city since September DATE , mostly in Brooklyn GPE 's Williamsburg GPE neighborhood. The order covers four CARDINAL Zip codes there, Mayor Bill de Blasio PERSON (D) said Tuesday DATE . The mandate orders all unvaccinated people in the area, including a concentration of Orthodox NORP Jews NORP , to receive inoculations, including for children as young as 6 months old. Anyone who resists could be fined up to $1,000 MONEY .

Building the Sentiment Classifier

With the spaCy basics covered, it's time to build the classifier. The pipeline combines a custom spaCy tokenizer with a TF-IDF vectorizer and a LinearSVC trained on reviews from Yelp, Amazon, and IMDb.

Importing Machine Learning Libraries

Import the necessary modules from pandas and scikit-learn:

PYTHON
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

Loading and Merging the Datasets

Load the Yelp review dataset. Reviews are tab-separated, with 0 for negative and 1 for positive:

PYTHON
data_yelp = pd.read_csv('datasets/yelp_labelled.txt', sep='\t', header = None)
data_yelp.head()
OUTPUT
01
0Wow... Loved this place.1
1Crust is not good.0
2Not tasty and the texture was just nasty.0
3Stopped by during the late May bank holiday of...1
4The selection on the menu was great and so wer...1

Assign column names:

PYTHON
columns_name = ['Review', 'Sentiment']
data_yelp.columns = columns_name
data_yelp.head()
OUTPUT
ReviewSentiment
0Wow... Loved this place.1
1Crust is not good.0
2Not tasty and the texture was just nasty.0
3Stopped by during the late May bank holiday of...1
4The selection on the menu was great and so wer...1

Check the shape:

PYTHON
data_yelp.shape
OUTPUT
(1000, 2)

Load the Amazon dataset the same way:

PYTHON
data_amazon = pd.read_csv('datasets/amazon_cells_labelled.txt', sep = '\t', header = None)
data_amazon.columns = columns_name
data_amazon.head()
OUTPUT
ReviewSentiment
0So there is no way for me to plug it in here i...0
1Good case, Excellent value.1
2Great for the jawbone.1
3Tied to charger for conversations lasting more...0
4The mic is great.1
PYTHON
data_amazon.shape
OUTPUT
(1000, 2)

Load IMDb:

PYTHON
data_imdb = pd.read_csv('datasets/imdb_labelled.txt', sep = '\t', header = None)
data_imdb.columns = columns_name
data_imdb.head()
OUTPUT
ReviewSentiment
0A very, very, very slow-moving, aimless movie ...0
1Not sure who was more lost - the flat characte...0
2Attempting artiness with black & white and cle...0
3Very little music or anything to speak of.0
4The best scene in the movie was when Gerardo i...1

IMDb has fewer reviews than the other two:

PYTHON
data_imdb.shape
OUTPUT
(748, 2)

Concatenate the three datasets:

PYTHON
data = data_yelp.append([data_amazon, data_imdb], ignore_index=True)
data.shape
OUTPUT
(2748, 2)

Check the first few rows:

PYTHON
data.head()
OUTPUT
ReviewSentiment
0Wow... Loved this place.1
1Crust is not good.0
2Not tasty and the texture was just nasty.0
3Stopped by during the late May bank holiday of...1
4The selection on the menu was great and so wer...1

Check the class distribution:

PYTHON
data['Sentiment'].value_counts()

The dataset is well balanced: 1,386 positive reviews and 1,362 negative:

PLAINTEXT
1    1386
0    1362
Name: Sentiment, dtype: int64

Check for missing values:

PYTHON
data.isnull().sum()

No nulls anywhere:

PLAINTEXT
Review       0
Sentiment    0
dtype: int64

Text Preprocessing Function

The cleaning function will strip punctuation. Import Python's string module to see what those characters are:

PYTHON
import string
punct = string.punctuation
punct
OUTPUT
'!"#$%&\'()*+,-./:;?@[\\]^_`{|}~'

Define the cleaning function. It parses each sentence with spaCy, lemmatizes the tokens, lowercases them, and removes stop words and punctuation:

PYTHON
def text_data_cleaning(sentence):
    doc = nlp(sentence)

    tokens = []
    for token in doc:
        if token.lemma_ != "-PRON-":
            temp = token.lemma_.lower().strip()
        else:
            temp = token.lower_
        tokens.append(temp)

    cleaned_tokens = []
    for token in tokens:
        if token not in stopwords and token not in punct:
            cleaned_tokens.append(token)
    return cleaned_tokens

text_data_cleaning("    Hello how are you. Like this video")

The function returns:

PLAINTEXT
['hello', 'like', 'video']

Pipeline and Model Training

Initialize the vectorizer with the custom tokenizer and define the classifier:

PYTHON
tfidf = TfidfVectorizer(tokenizer = text_data_cleaning)
classifier = LinearSVC()

Split the data into reviews (X) and sentiment labels (y):

PYTHON
X = data['Review']
y = data['Sentiment']

Split into 80% training, 20% test:

PYTHON
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
X_train.shape, X_test.shape
OUTPUT
((2198,), (550,))

Build and fit the pipeline:

PYTHON
clf = Pipeline([('tfidf', tfidf), ('clf', classifier)])
clf.fit(X_train, y_train)
PYTHON
Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=,
                                 use_idf=True, vocabulary=None)),
                ('clf',
                 LinearSVC(C=1.0, class_weight=None, dual=True,
                           fit_intercept=True, intercept_scaling=1,
                           loss='squared_hinge', max_iter=1000,
                           multi_class='ovr', penalty='l2', random_state=None,
                           tol=0.0001, verbose=0))],
         verbose=False)

Model Evaluation and Predictions

Evaluating Model Performance

Generate predictions on the test set:

PYTHON
y_pred = clf.predict(X_test)

Print the classification report:

PYTHON
print(classification_report(y_test, y_pred))

The model hits 78% accuracy on the test set:

PLAINTEXT
precision    recall  f1-score   support

           0       0.77      0.81      0.79       285
           1       0.78      0.74      0.76       265

    accuracy                           0.78       550
   macro avg       0.78      0.78      0.78       550
weighted avg       0.78      0.78      0.78       550

Print the confusion matrix:

PYTHON
confusion_matrix(y_test, y_pred)

Rows are actual labels, columns are predicted:

PLAINTEXT
array([[230,  55],
       [ 68, 197]], dtype=int64)

Testing Custom Reviews

Test the classifier on custom reviews:

PYTHON
clf.predict(['Wow, this is amazing lesson'])

Returns 1 (positive):

PLAINTEXT
array([1], dtype=int64)
PYTHON
clf.predict(['Wow, this sucks'])

Returns 0 (negative):

PLAINTEXT
array([0], dtype=int64)
PYTHON
clf.predict(['Worth of watching it. Please like it'])
OUTPUT
array([1], dtype=int64)
PYTHON
clf.predict(['Loved it. Amazing'])
OUTPUT
array([1], dtype=int64)

Conclusion

The model trained on 2,748 reviews from Yelp, Amazon, and IMDb and hit 78% accuracy. Along the way you used spaCy's tokenization, lemmatization, and NER, wrapped a custom cleaning function into a scikit-learn TfidfVectorizer, and chained everything inside a Pipeline.

Key takeaways:

  • spaCy's lemmatizer maps inflected forms like "loved", "loves", and "loving" to the same root word, which improves the TF-IDF signal.
  • Any Python function can plug into TfidfVectorizer via the tokenizer parameter.
  • Putting preprocessing and modeling inside a Pipeline prevents data leakage and keeps the training code clean.

Next steps:

Found this useful? Keep building with me.

New tutorials every week on YouTube: or go deeper with a full structured course.

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments