Sentiment analysis is one of the most common applications of Natural Language Processing (NLP). It allows businesses to automatically track customer feedback, analyze product reviews, and monitor social media sentiment at scale. Instead of reading thousands of individual reviews manually, a machine learning classifier can instantly categorize text as positive or negative.
In this tutorial, you will build an end-to-end sentiment classification pipeline in Python. You will load review datasets from Yelp, Amazon, and IMDb, preprocess and tokenize the text using spaCy's linguistic features, and train a LinearSVC model using a scikit-learn pipeline.
Prerequisites: Python 3.x, spaCy, Scikit-learn, Pandas, displaCy.
Datasets used in this tutorial: Datasets for Sentiment Classification on GitHub
Natural Language Processing Concepts
Natural Language Processing (NLP) is the branch of artificial intelligence that empowers computers to process, understand, and interpret human language. Since the 1950s, NLP has evolved from simple rule-based translations to deep learning models that power modern search engines, translation tools, and conversational agents.
Applications of NLP
In practice, NLP is used across a wide variety of domains to automate human-like language understanding:
- Text Classification: Automatically sorting emails or documents into categories.
- Spam Filters: Detecting and blocking unsolicited messages.
- Voice Text Messaging: Transcribing spoken language into written text.
- Sentiment Analysis: Detecting emotion or opinion in a block of text.
- Spell or Grammar Check: Suggesting corrections for writing errors.
- Chatbots: Conversing with users to solve issues or answer queries.
- Search Auto-suggestions and Autocorrect: Predicting and correcting search terms in real time.
- Automatic Review Analysis: Parsing feedback to extract key customer insights.
- Machine Translation: Translating text between different languages.
Data Cleaning Techniques
Before feeding text into machine learning algorithms, the raw strings must be cleaned and transformed into numerical features.
Standard preprocessing steps include:
- Case Normalization: Converting all text to lowercase to ensure consistency.
- Removing Stop Words: Filtering out common words that add little semantic value (e.g., "the", "is", "at").
- Removing Punctuations or Special Symbols: Stripping out noise like exclamation marks or brackets.
- Lemmatization or Stemming: Reducing words to their base or dictionary form (e.g., "running" to "run").
- Parts of Speech Tagging: Identifying the grammatical category (noun, verb, adjective) of each word.
- Entity Detection: Identifying proper nouns such as names, dates, and locations.
Bag of Words and Word Embeddings
A Bag of Words (BoW) is a simple text representation that describes the occurrences of words within a document. It discards word order and grammatical structure, focusing solely on the frequency of vocabulary terms.
The example below shows a list of three simple documents that we want to represent numerically:
doc1 = "I am high"
doc2 = "Yes I am high"
doc3 = "I am kidding"
The table below illustrates how the Bag of Words model constructs a document-term matrix by counting the occurrence of each unique word across our documents:

TF-IDF Vectorization
Term Frequency-Inverse Document Frequency (TF-IDF) is a statistical measure that evaluates how relevant a word is to a document in a collection. It works by multiplying two metrics: how frequently a word appears in a document, and how rarely the word appears across the entire dataset. This downweights common words (like "the") and highlights distinctive terms.
The chart below shows a visual representation of how TF-IDF filters out common stop words while emphasizing rare, high-value keywords:

spaCy Pipelines and Installation
Library Installation
First, make sure you have spaCy and the small English language model (en_core_web_sm) installed on your system. You can install these packages using the pip command:
# pip install -U spacy
# pip install -U spacy-lookups-data
# python -m spacy download en_core_web_sm
Additionally, make sure you have scikit-learn installed to build the machine learning pipeline:
# pip install scikit-learn
The spaCy Processing Pipeline
When you process a string using spaCy's nlp object, it tokenizes the text to produce a Doc object, then passes it through a series of pipeline components. By default, these components tag parts of speech, parse grammatical dependencies, and recognize named entities.
The diagram below outlines the sequential flow of a document through spaCy's built-in processing pipeline:

Basic Text Processing with spaCy
Let's begin by importing spaCy and its visualization module, displaCy:
import spacy
from spacy import displacy
Load the small English language model and parse a sample sentence to generate a parsed Doc object:
nlp = spacy.load('en_core_web_sm')
text = "Apple, This is first sentence. and Google this is another one. here 3rd one is"
doc = nlp(text)
doc
The output displays the raw text content of the processed document:
Apple, This is first sentence. and Google this is another one. here 3rd one is
You can iterate through the document to inspect each individual token generated by spaCy:
for token in doc:
print(token)
The parsed token output shows the words, punctuation marks, and numbers parsed sequentially:
Apple
,
This
is
first
sentence
.
and
Google
this
is
another
one
.
here
3rd
one
is
Sentence Segmentation
You can add custom components to customize spaCy's pipeline. For instance, the sentencizer component provides rule-based sentence segmentation without needing a full dependency parse. Let's add the sentencizer to the pipeline before the parser component and print each sentence:
sent = nlp.create_pipe('sentencizer')
nlp.add_pipe(sent, before='parser')
doc = nlp(text)
for sent in doc.sents:
print(sent)
The output shows three distinct sentences separated by spaCy's segmentation rules:
Apple, This is first sentence.
and Google this is another one.
here 3rd one is
Stop Words Filtering
Stop words are frequently occurring words that do not add significant meaning to a sentence. We can import spaCy's built-in list of English stop words to inspect them:
from spacy.lang.en.stop_words import STOP_WORDS
stopwords = list(STOP_WORDS)
print(stopwords)
The printed list contains all the default stop words:
['move', 'again', 'during', 'herself', 'him', 'hereby', 'third', 'once', 'call', 'both', '‘ll', 'doing', 'something', 'when', '’ll', 'unless', 'thereafter', 'before', 'so', 'is', 'will', 'toward', 'has', 'whom', 'it', 'who', 'what', 'his', '‘s', 'towards', 'quite', 'below', 'alone', 'yourselves', 'which', 'does', 'ca', 'moreover', 'seems', 'or', 'first', 'here', 'various', 'n‘t', 'very', 'why', 'beyond', 'mine', 'themselves', 'twenty', 'really', 'almost', 'indeed', 'amongst', 'until', 'empty', 'everyone', 'should', 'bottom', 'five', 'among', 'also', 'over', "'s", 'via', 'against', 'just', 'above', 'twelve', 'although', 'could', 'hence', 'are', 'than', 'being', 'however', 'front', 'eight', 'already', 'three', 'two', 'have', 'while', 'beforehand', 'myself', '’ve', 'much', 'rather', 'seemed', 'back', '‘ve', 'from', 'every', 'other', 'between', 'of', 'serious', 'since', 'as', 'but', 'i', 'she', 'whoever', 'used', 'those', 'whatever', 'beside', 'someone', '’m', 'to', 'within', 'forty', 'sometimes', 'upon', 'by', 'into', 'regarding', 'hereupon', 'together', 'wherever', 'made', 'that', 'own', 'must', 'namely', 'had', 'hers', 'hereafter', 'perhaps', 'afterwards', 'part', 'another', 'next', 'across', 'nor', 'latter', 'get', 'this', 'our', 'whose', 'off', 'see', 'a', 'anyhow', 'former', "'ll", 'amount', 'becomes', 'same', 'full', 'himself', 'after', 'itself', 'they', 'how', 'using', "'re", 'somewhere', 'thus', 'somehow', 'too', 'because', 'still', 'us', 'ever', 'n’t', 'give', 'and', 'if', 'we', 'most', 'no', 'ours', 'became', 'for', 'may', 'fifty', 'everywhere', 'whenever', 'be', 'everything', 'an', 'whole', 'last', 'whether', '’s', '‘d', 'besides', 'along', 'all', 'say', 'might', 'seeming', 'on', 'neither', 'these', 'anywhere', '‘m', 'more', 'per', 'ourselves', 'otherwise', 'mostly', 'make', 'due', '’re', 'becoming', 'yours', 'each', 'thereby', 'any', 'onto', 'not', 'others', 'fifteen', 'were', 'many', 'would', 'though', 'either', 'keep', 'take', 'nevertheless', "'ve", 'about', 'you', 'therefore', 'thru', 'around', 'behind', 'else', 'he', 'its', 'throughout', 'four', 'further', 'herein', '’d', 're', 'am', 'where', 'do', 'well', "n't", 'side', 'whereupon', 'none', 'latterly', "'m", "'d", 'noone', 'at', 'whereas', 'even', 'anyone', 'nine', 'nowhere', 'down', 'did', 'them', 'name', 'thereupon', 'cannot', 'me', 'least', 'anyway', 'nothing', 'top', 'few', 'therein', 'yet', 'less', 'show', 'one', 'been', 'done', 'some', 'thence', 'her', 'up', 'can', 'put', 'whereafter', 'become', 'seem', 'nobody', 'only', 'enough', 'often', 'sometime', 'out', 'now', 'your', 'their', 'always', 'ten', 'under', 'please', 'six', 'yourself', 'then', 'wherein', 'except', 'eleven', 'meanwhile', 'whither', 'whereby', 'in', 'with', 'go', 'there', 'my', 'such', '‘re', 'anything', 'hundred', 'the', 'whence', 'was', 'never', 'sixty', 'formerly', 'several', 'without', 'through', 'elsewhere']
Check the size of the stop words dictionary:
len(stopwords)
The output shows that the dictionary contains 326 default English stop words:
326
We can filter out all stop words by checking the is_stop attribute of each token:
for token in doc:
if token.is_stop == False:
print(token)
The filtered output only displays tokens that are not classified as stop words:
Apple
,
sentence
.
Google
.
3rd
Lemmatization
Lemmatization replaces words with their base dictionary form (or lemma). For example, the lemma for "runs", "running", and "ran" is "run". Let's run a lemmatizer example on a small document:
doc = nlp('run runs running runner')
for lem in doc:
print(lem.text, lem.lemma_)
The output shows that the inflectional forms "runs" and "running" are successfully mapped back to "run":
run run
runs run
running run
runner runner
Part-of-Speech (POS) Tagging
Part-of-Speech (PoS) tagging is the process of labeling each word in a text with its grammatical category, such as verb, noun, or adjective. Let's print the part-of-speech tags for a sample sentence:
doc = nlp('All is well at your end!')
for token in doc:
print(token.text, token.pos_)
The output shows the text token next to its predicted part-of-speech tag:
All DET
is AUX
well ADJ
at ADP
your DET
end NOUN
! PUNCT
You can render and visualize syntactic dependency parses directly inside your notebooks using displaCy:
displacy.render(doc, style = 'dep')
The visualization output renders the dependencies in raw layout format:
All DETis AUXwell ADJat ADPyour DETend! NOUNnsubjadvmodprepposspobj
Named Entity Recognition (NER)
Named Entity Recognition (NER) identifies proper nouns in a text and classifies them into categories like names, dates, locations, or monetary amounts. Let's analyze a sample paragraph:
doc = nlp("New York City on Tuesday declared a public health emergency and ordered mandatory measles vaccinations amid an outbreak, becoming the latest national flash point over refusals to inoculate against dangerous diseases. At least 285 people have contracted measles in the city since September, mostly in Brooklyn’s Williamsburg neighborhood. The order covers four Zip codes there, Mayor Bill de Blasio (D) said Tuesday. The mandate orders all unvaccinated people in the area, including a concentration of Orthodox Jews, to receive inoculations, including for children as young as 6 months old. Anyone who resists could be fined up to $1,000.")
doc
The output displays the parsed paragraph text:
New York City on Tuesday declared a public health emergency and ordered mandatory measles vaccinations amid an outbreak, becoming the latest national flash point over refusals to inoculate against dangerous diseases. At least 285 people have contracted measles in the city since September, mostly in Brooklyn’s Williamsburg neighborhood. The order covers four Zip codes there, Mayor Bill de Blasio (D) said Tuesday. The mandate orders all unvaccinated people in the area, including a concentration of Orthodox Jews, to receive inoculations, including for children as young as 6 months old. Anyone who resists could be fined up to $1,000.
Let's render the named entities discovered in the paragraph using displaCy:
displacy.render(doc, style = 'ent')
The output highlights entities with their respective category tags (such as GPE, DATE, CARDINAL, PERSON, NORP, and MONEY):
New York City GPE on Tuesday DATE declared a public health emergency and ordered mandatory measles vaccinations amid an outbreak, becoming the latest national flash point over refusals to inoculate against dangerous diseases. At least 285 CARDINAL people have contracted measles in the city since September DATE , mostly in Brooklyn GPE ’s Williamsburg GPE neighborhood. The order covers four CARDINAL Zip codes there, Mayor Bill de Blasio PERSON (D) said Tuesday DATE . The mandate orders all unvaccinated people in the area, including a concentration of Orthodox NORP Jews NORP , to receive inoculations, including for children as young as 6 months old. Anyone who resists could be fined up to $1,000 MONEY .
Building the Sentiment Classifier
Now that we have covered basic spaCy preprocessing, we will build a sentiment classifier. We will load datasets of reviews from Yelp, Amazon, and IMDb, merge them, and train a pipeline containing a custom spaCy tokenizer, a TF-IDF vectorizer, and a Linear Support Vector Classifier (LinearSVC).
Importing Machine Learning Libraries
Let's import the necessary modules from pandas and scikit-learn:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
Loading and Merging the Datasets
Load the Yelp review dataset, naming the columns Review and Sentiment. Reviews are tab-separated, with a sentiment label of 0 for negative and 1 for positive:
data_yelp = pd.read_csv('datasets/yelp_labelled.txt', sep='\t', header = None)
data_yelp.head()
The table below displays the first five reviews from Yelp:
| 0 | 1 | |
|---|---|---|
| 0 | Wow... Loved this place. | 1 |
| 1 | Crust is not good. | 0 |
| 2 | Not tasty and the texture was just nasty. | 0 |
| 3 | Stopped by during the late May bank holiday of... | 1 |
| 4 | The selection on the menu was great and so wer... | 1 |
Assign descriptive column headers to the Yelp dataset:
columns_name = ['Review', 'Sentiment']
data_yelp.columns = columns_name
data_yelp.head()
The structured DataFrame displays Yelp reviews with named columns:
| Review | Sentiment | |
|---|---|---|
| 0 | Wow... Loved this place. | 1 |
| 1 | Crust is not good. | 0 |
| 2 | Not tasty and the texture was just nasty. | 0 |
| 3 | Stopped by during the late May bank holiday of... | 1 |
| 4 | The selection on the menu was great and so wer... | 1 |
Check the shape of the Yelp dataset:
data_yelp.shape
The output shows that Yelp contains 1,000 rows and 2 columns:
(1000, 2)
Next, load and format the Amazon reviews dataset:
data_amazon = pd.read_csv('datasets/amazon_cells_labelled.txt', sep = '\t', header = None)
data_amazon.columns = columns_name
data_amazon.head()
The table shows the top rows of the Amazon dataset:
| Review | Sentiment | |
|---|---|---|
| 0 | So there is no way for me to plug it in here i... | 0 |
| 1 | Good case, Excellent value. | 1 |
| 2 | Great for the jawbone. | 1 |
| 3 | Tied to charger for conversations lasting more... | 0 |
| 4 | The mic is great. | 1 |
Verify the shape of the Amazon dataset:
data_amazon.shape
Like Yelp, Amazon reviews consist of 1,000 samples:
(1000, 2)
Now, load and format the IMDb reviews dataset:
data_imdb = pd.read_csv('datasets/imdb_labelled.txt', sep = '\t', header = None)
data_imdb.columns = columns_name
data_imdb.head()
The table displays the top reviews from IMDb:
| Review | Sentiment | |
|---|---|---|
| 0 | A very, very, very slow-moving, aimless movie ... | 0 |
| 1 | Not sure who was more lost - the flat characte... | 0 |
| 2 | Attempting artiness with black & white and cle... | 0 |
| 3 | Very little music or anything to speak of. | 0 |
| 4 | The best scene in the movie was when Gerardo i... | 1 |
Inspect the shape of the IMDb dataset:
data_imdb.shape
The IMDb dataset contains 748 reviews:
(748, 2)
Append the three review DataFrames together to create a single master dataset:
data = data_yelp.append([data_amazon, data_imdb], ignore_index=True)
data.shape
The combined dataset has a total of 2,748 rows:
(2748, 2)
Let's check the first few rows of our combined dataset:
data.head()
The table shows the top merged reviews:
| Review | Sentiment | |
|---|---|---|
| 0 | Wow... Loved this place. | 1 |
| 1 | Crust is not good. | 0 |
| 2 | Not tasty and the texture was just nasty. | 0 |
| 3 | Stopped by during the late May bank holiday of... | 1 |
| 4 | The selection on the menu was great and so wer... | 1 |
Check the class distribution of sentiment labels in the combined dataset:
data['Sentiment'].value_counts()
The output shows that the dataset is highly balanced, containing 1,386 positive reviews and 1,362 negative reviews:
1 1386
0 1362
Name: Sentiment, dtype: int64
Verify if there are any missing values in the DataFrame:
data.isnull().sum()
The result confirms there are zero null values in either column:
Review 0
Sentiment 0
dtype: int64
Text Preprocessing Function
To clean our reviews, we need to strip punctuation. Import the standard Python string module and inspect the list of punctuation marks:
import string
punct = string.punctuation
punct
The output lists all punctuation symbols to be removed:
'!"#$%&\'()*+,-./:;?@[\\]^_`{|}~'
Define a cleaning function that parses the text with spaCy, lemmatizes the tokens, strips pronouns, converts them to lowercase, and filters out stop words and punctuation:
def text_data_cleaning(sentence):
doc = nlp(sentence)
tokens = []
for token in doc:
if token.lemma_ != "-PRON-":
temp = token.lemma_.lower().strip()
else:
temp = token.lower_
tokens.append(temp)
cleaned_tokens = []
for token in tokens:
if token not in stopwords and token not in punct:
cleaned_tokens.append(token)
return cleaned_tokens
text_data_cleaning(" Hello how are you. Like this video")
The test output shows the cleaned tokens returned by our function:
['hello', 'like', 'video']
Pipeline and Model Training
Initialize the TfidfVectorizer using our custom text_data_cleaning function as the tokenizer, and define a LinearSVC classifier:
tfidf = TfidfVectorizer(tokenizer = text_data_cleaning)
classifier = LinearSVC()
Split the data into reviews (X) and sentiment labels (y):
X = data['Review']
y = data['Sentiment']
Perform an 80/20 train-test split using train_test_split:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
X_train.shape, X_test.shape
The output lists the number of samples in the training and testing splits:
((2198,), (550,))
Assemble the pipeline steps and fit the model on the training data:
clf = Pipeline([('tfidf', tfidf), ('clf', classifier)])
clf.fit(X_train, y_train)
The model pipeline details show the configuration of our vectorizer and SVM classifier:
Pipeline(memory=None,
steps=[('tfidf',
TfidfVectorizer(analyzer='word', binary=False,
decode_error='strict',
dtype=,
encoding='utf-8', input='content',
lowercase=True, max_df=1.0, max_features=None,
min_df=1, ngram_range=(1, 1), norm='l2',
preprocessor=None, smooth_idf=True,
stop_words=None, strip_accents=None,
sublinear_tf=False,
token_pattern='(?u)\\b\\w\\w+\\b',
tokenizer=,
use_idf=True, vocabulary=None)),
('clf',
LinearSVC(C=1.0, class_weight=None, dual=True,
fit_intercept=True, intercept_scaling=1,
loss='squared_hinge', max_iter=1000,
multi_class='ovr', penalty='l2', random_state=None,
tol=0.0001, verbose=0))],
verbose=False)
Model Evaluation and Predictions
Evaluating Model Performance
Generate predictions on the test set:
y_pred = clf.predict(X_test)
Print the classification report to inspect metrics like precision, recall, and F1-score:
print(classification_report(y_test, y_pred))
The classification report indicates a model accuracy of 78%:
precision recall f1-score support
0 0.77 0.81 0.79 285
1 0.78 0.74 0.76 265
accuracy 0.78 550
macro avg 0.78 0.78 0.78 550
weighted avg 0.78 0.78 0.78 550
Print the confusion matrix:
confusion_matrix(y_test, y_pred)
The confusion matrix shows the count of true negatives, false positives, false negatives, and true positives:
array([[230, 55],
[ 68, 197]], dtype=int64)
Testing Custom Reviews
Let's test the classifier on a set of custom reviews. First, predict the sentiment of a positive review:
clf.predict(['Wow, this is amazing lesson'])
The output array value of 1 indicates positive sentiment:
array([1], dtype=int64)
Predict the sentiment of a negative review:
clf.predict(['Wow, this sucks'])
The output array value of 0 indicates negative sentiment:
array([0], dtype=int64)
Predict the sentiment of another positive recommendation:
clf.predict(['Worth of watching it. Please like it'])
The output correctly identifies positive sentiment:
array([1], dtype=int64)
Finally, predict the sentiment of a short enthusiastic statement:
clf.predict(['Loved it. Amazing'])
The model classifies this review as positive:
array([1], dtype=int64)
Conclusion
In this tutorial, you built an end-to-end sentiment classification model using reviews from Yelp, Amazon, and IMDb. You loaded the data into Pandas, explored spaCy's linguistic features (including tokenization, lemmatization, and Named Entity Recognition), and created a custom text cleaning function. Finally, you integrated this clean tokenizer into a scikit-learn Pipeline with a LinearSVC model, achieving a sentiment classification accuracy of 78%.
Key takeaways:
- spaCy's lemmatizer reduces inflected words to their dictionary root, ensuring that terms like "loved", "loves", and "loving" are counted as the same word.
- Custom tokenizers can be easily integrated into scikit-learn's
TfidfVectorizervia thetokenizerparameter. - Combining preprocessing, vectorization, and modeling steps inside a scikit-learn
Pipelinekeeps the training and deployment workflows clean and prevents data leakage.
Next steps:
- Build a text classification system for spam text messages in Spam Text Message Classification using NLP.
- Learn how to define custom entity extraction rules in Custom Rules using spaCy.
- Explore general text processing and vectorization techniques in NLP: End to End Text Processing for Beginners.
