#sentiment analysis#scikit-learn#nlp#tfidf#svm#python#text classification

Sentiment Analysis with Scikit-learn

Build a binary sentiment classifier for IMDB movie reviews using TF-IDF text vectorization and a Linear Support Vector Machine in Python with scikit-learn.

May 19, 2026 at 6:00 PM8 min readFollowFollow (Hindi)

Topics You Will Master

How TF-IDF turns raw review text into numerical features a model can learn from
How a Linear Support Vector Machine draws a decision boundary between positive and negative sentiment
How to preprocess messy real-world text — removing HTML, URLs, and repeated characters
How to evaluate a binary classifier using precision, recall, and F1-score
How to save a trained model and vectorizer to disk for later reuse
Best For

Python developers who know basic supervised learning and want to build their first end-to-end NLP classification pipeline.

Expected Outcome

A trained LinearSVC classifier that predicts positive or negative sentiment on IMDB movie reviews, achieving 87 % accuracy, saved to disk with pickle for deployment.

Knowing what people think about a film — or any product — at scale is one of the most common real-world machine learning tasks. Sentiment analysis is the process of automatically detecting whether a piece of text expresses a positive or negative opinion. Rather than reading thousands of reviews by hand, you train a model to do it for you.

In this tutorial you will build a complete sentiment classifier from scratch. You will load 50,000 IMDB movie reviews, clean the text, convert it into numerical vectors using TF-IDF (Term Frequency–Inverse Document Frequency), and train a Linear SVM (Support Vector Machine) to label each review as pos or neg. Along the way you will learn how to evaluate the classifier and save it for production use.

Prerequisites: Python 3.x, Pandas, NumPy, scikit-learn, spaCy, BeautifulSoup4, TextBlob.

Setting Up the Environment

Install the core libraries before running any code:

PLAINTEXT
!pip install pandas
!pip install numpy
!pip install scikit-learn

Import Pandas (the table-based data analysis library) and NumPy (the numerical array library):

PYTHON
import pandas as pd
import numpy as np

Loading the Dataset

The IMDB dataset contains 50,000 reviews split into a Reviews column (raw text) and a Sentiment column (pos or neg). Clone it from GitHub using git clone:

PLAINTEXT
!git clone https://github.com/laxmimerit/IMDB-Movie-Reviews-Large-Dataset-50k.git
PLAINTEXT
Cloning into 'IMDB-Movie-Reviews-Large-Dataset-50k'...

Read the training split into a Pandas DataFrame:

PYTHON
df = pd.read_excel('IMDB-Movie-Reviews-Large-Dataset-50k/train.xlsx')

TF-IDF Vectorization

Machine learning models cannot read text directly — every word must become a number. TF-IDF is the standard technique for this: it scores each word by how often it appears in a document (term frequency) divided by how common it is across all documents (inverse document frequency). Rare but meaningful words like "masterpiece" score higher than common filler words like "the".

The diagram below illustrates the TF-IDF formula with its component parts labeled:

TF-IDF formula diagram showing TF(t,d) multiplied by IDF(t), with document frequency and number of documents labeled

The formula is:

Where:

  • — the term (word) being scored
  • — the document (review) containing the term
  • — the number of times term appears in document
  • — the total number of documents in the corpus
  • — the number of documents that contain term

In practice, a word like "beautiful" scores much higher than "is" because "is" appears in almost every document (making its IDF close to zero), while "beautiful" is selective.

Import the vectorizer, model, and evaluation utilities you will need:

PYTHON
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report

Inspect the first five rows to confirm the column layout:

PYTHON
# displaying top 5 rows of our dataset
df.head()
OUTPUT
ReviewsSentiment
0When I first tuned in on this morning news, I ...neg
1Mere thoughts of "Going Overboard" (aka "Babes...neg
2Why does this movie fall WELL below standards?...neg
3Wow and I thought that any Steven Segal movie ...neg
4The story is seen before, but that does'n matt...neg

Text Preprocessing

Raw user reviews contain noise — HTML tags, email addresses, URLs, and exaggerated repeated characters (e.g., "looooove") — that add no signal to the model. You need to strip all of that before vectorizing.

Install the preprocess_kgptalkie package, which bundles all the cleaning helpers used here:

PLAINTEXT
!pip install git+https://github.com/laxmimerit/preprocess_kgptalkie.git

The package depends on spaCy, BeautifulSoup4, and TextBlob. Install them first:

PLAINTEXT
!pip install spacy==2.2.3
!python -m spacy download en_core_web_sm
!pip install beautifulsoup4==4.9.1
!pip install textblob==0.15.3

Import the preprocessing module and Python's built-in regular expression library:

PYTHON
import preprocess_kgptalkie as ps
import re

The docstring below describes each cleaning step that get_clean applies:

PYTHON
"""
Step 1: Lowering the letter then after replacing backward slash from nothing and underscore from space.
Step 2: Remove emails from the Reviews column.
Step 3: Removing html tags from the Reviews column.
Step 4: Removing special character.
Step 5: If you have multiple repeated character then it converted into single character and make meaningful.
E.g. x = 'lllooooovvveeee youuuu'
x = re.sub("(.)\1{2,}", "\1", x)
print(x)
-------
love you
"""

Define the get_clean function and apply it to every row in the Reviews column:

PYTHON
def get_clean(x):
    x = str(x).lower().replace('\\', '').replace('_', ' ')
    x = ps.cont_exp(x)
    x = ps.remove_emails(x)
    x = ps.remove_urls(x)
    x = ps.remove_html_tags(x)
    x = ps.remove_accented_chars(x)
    x = ps.remove_special_chars(x)
    x = re.sub("(.)\\1{2,}", "\\1", x)
    return x

df['Reviews'] = df['Reviews'].apply(lambda x: get_clean(x))
df.head()
OUTPUT
ReviewsSentiment
0when i first tuned in on this morning news i t...neg
1mere thoughts of going overboard aka babes aho...neg
2why does this movie fall well below standards ...neg
3wow and i thought that any steven segal movie ...neg
4the story is seen before but that doesn matter...neg

Notice that upper-case letters, HTML artefacts, and punctuation are gone — the text is now uniform and ready to vectorize.

Fit the TfidfVectorizer on the cleaned reviews, keeping only the 5,000 most informative words (max_features=5000):

PYTHON
tfidf = TfidfVectorizer(max_features=5000)
X = df['Reviews']
y = df['Sentiment']

X = tfidf.fit_transform(X)
X
PYTHON
'
	with 2843804 stored elements in Compressed Sparse Row format>

The result is a sparse matrix — most of its 2.8 million stored values are zero, because each review uses only a small fraction of the 5,000-word vocabulary. Sparse storage keeps memory usage manageable.

Split the dataset into 80 % training and 20 % test sets:

PYTHON
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Support Vector Machine Classifier

A Support Vector Machine (SVM) is a supervised learning algorithm that finds the hyperplane — a decision boundary in high-dimensional space — that separates two classes with the widest possible margin. The data points closest to the boundary on each side are called support vectors; maximising the gap between them is what makes SVMs robust against noise.

The diagram below shows how the SVM chooses between multiple valid boundaries by picking the one with the highest margin:

Handwritten SVM diagram showing margin, support vectors, and multiple candidate decision boundaries on a scatter plot

When data is not linearly separable in two dimensions, the kernel trick maps it into a higher-dimensional space where a flat hyperplane can separate the classes. The diagram below illustrates this: data that cannot be split by a line in 2D becomes separable once a third axis is added:

Handwritten SVM diagram showing the kernel trick: non-separable 2D data becomes separable in 3D by adding a z = x² + y² axis

LinearSVC is a fast linear-kernel SVM optimised for large text datasets. It finds the hyperplane that best separates positive from negative reviews in the 5,000-dimensional TF-IDF space. Fit it on the training set and generate predictions:

PYTHON
clf = LinearSVC()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

Evaluating the Classifier

A classification report breaks down performance by class. Precision measures how many of the predicted positives were actually positive; recall measures how many actual positives were correctly identified; F1-score is the harmonic mean of the two.

PYTHON
print(classification_report(y_test, y_pred))
OUTPUT
              precision    recall  f1-score   support

         neg       0.87      0.87      0.87      2480
         pos       0.87      0.88      0.88      2520

    accuracy                           0.87      5000
   macro avg       0.87      0.87      0.87      5000
weighted avg       0.87      0.87      0.87      5000

The classifier reaches 87 % accuracy, with balanced precision and recall on both classes — meaning it is equally reliable at detecting negative and positive reviews. Neither class is systematically ignored.

Predicting on New Text

To verify the model on a custom sentence, pass it through the same cleaning pipeline and then transform it with the fitted vectorizer before calling predict:

PYTHON
x = 'this movie is really good. thanks a lot for making it'

x = get_clean(x)
vec = tfidf.transform([x])

Check the shape of the resulting feature vector — it should have exactly 5,000 columns, one per vocabulary word:

PYTHON
vec.shape
OUTPUT
(1, 5000)

Predict the sentiment label:

PYTHON
clf.predict(vec)
OUTPUT
array(['pos'], dtype=object)

The model correctly labels the enthusiastic review as positive.

PYTHON
clf.predict(vec)
OUTPUT
array(['pos'], dtype=object)

Saving the Model with Pickle

Pickle serializes Python objects — converting them to a byte stream — so you can write them to disk and reload them later without retraining. Save both the classifier and the vectorizer (you need both at inference time):

PYTHON
import pickle
pickle.dump(clf, open('model', 'wb'))
pickle.dump(tfidf, open('tfidf', 'wb'))

Both files are now on disk. To use them in a deployment script, call pickle.load() on each file, then call tfidf.transform() and clf.predict() exactly as you did above.

Conclusion

In this tutorial you built a complete end-to-end sentiment classifier for IMDB movie reviews. You cleaned raw text with preprocess_kgptalkie, converted it to a 5,000-feature TF-IDF matrix, and trained a LinearSVC that achieved 87 % accuracy — with balanced precision and recall across both classes. Finally, you saved the trained model and vectorizer to disk using pickle.

Key takeaways:

  • TF-IDF down-weights common words and up-weights rare but meaningful words, giving the SVM better signal to learn from.
  • LinearSVC is a fast, high-quality baseline for text classification — it handles the high-dimensional sparse output of TfidfVectorizer efficiently.
  • The preprocessing step matters as much as the model: removing HTML tags, URLs, and repeated characters directly improves the quality of the learned vocabulary.
  • You must save both the vectorizer and the classifier — new text must pass through the same tfidf.transform() call before prediction.
  • Balanced precision and recall (both ~0.87) tells you the model is not biased toward one class.

Next steps:

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments