#nlp#svm#tf-idf#text classification#scikit-learn#python#amazon reviews

Star Rating Prediction with SVM and TF-IDF

Predict Amazon product star ratings from review text using TF-IDF vectorization and a Support Vector Machine classifier in Python with scikit-learn.

May 18, 2026 at 8:15 AM10 min readFollowFollow (Hindi)

Topics You Will Master

How TF-IDF converts raw review text into numerical features a model can learn from
How a Linear Support Vector Machine classifier handles multi-class star-rating prediction
How to balance a heavily skewed rating dataset by under-sampling majority classes
How to read a classification report and interpret per-class precision, recall, and F1-score
Best For

Python developers and data scientists who know basic supervised learning and want to build a real text classification pipeline from raw reviews to predicted star ratings.

Expected Outcome

A trained LinearSVC model that predicts 1–5 star Amazon ratings from review text, evaluated with a full classification report, with two live prediction examples you can test immediately.

Amazon collects millions of product reviews every day, and each review carries a star rating from 1 to 5. Automatically predicting that rating from the text alone is a classic text classification problem — you turn words into numbers, train a classifier, and let the model decide which star bucket a new review falls into.

In this tutorial you will work with the Amazon Musical Instruments reviews dataset. You will clean the raw text, convert it into TF-IDF features, train a LinearSVC classifier, and evaluate the results. Along the way you will also handle class imbalance — the dataset has many more 5-star reviews than 1-star reviews, which can skew the model unless you correct for it.

Prerequisites: Python 3.x, Pandas, NumPy, scikit-learn, spaCy, BeautifulSoup4, TextBlob, and the preprocess_kgptalkie helper package.


The banner below shows Amazon's star-rating system — the same 1–5 scale your model will learn to predict:

Amazon product star rating banner showing various products with 1 to 5 star ratings

Setting Up the Environment

Install the required libraries before importing anything. Run these commands once in your terminal or notebook:

PLAINTEXT
!pip install pandas
!pip install numpy
!pip install scikit-learn

Once installed, import Pandas — a column-oriented data analysis library — and NumPy, which adds support for large multi-dimensional arrays and fast mathematical operations:

PYTHON
import pandas as pd
import numpy as np

Loading the Dataset

The Amazon Musical Instruments reviews dataset is hosted on GitHub. Load it directly with pd.read_csv(), keeping only the two columns you need — reviewText (the raw text) and overall (the star rating):

PYTHON
df = pd.read_csv('https://raw.githubusercontent.com/laxmimerit/Amazon-Musical-Reviews-Rating-Dataset/master/Musical_instruments_reviews.csv', usecols = ['reviewText', 'overall'])

df.sample(5) draws five random rows so you can quickly inspect what the data looks like:

PYTHON
df.sample(5)
OUTPUT
reviewTextoverall
7959Cheap and good. Just what I needed. No issue...4.0
6048It sounds like it's Behringer. Very fake, che...3.0
1596I already had the nickel finished version (whi...5.0
8796Well... it's not really too expensive, but it ...1.0
948The mic stand pick holder is a great way to ke...5.0

Handling Class Imbalance

Check how many reviews exist per star rating:

PYTHON
df['overall'].value_counts()
OUTPUT
5.0    6938
4.0    2084
3.0     772
2.0     250
1.0     217
Name: overall, dtype: int64

The dataset is heavily skewed — there are 6,938 five-star reviews but only 217 one-star reviews. If you train on this raw distribution the model will simply learn to predict "5 stars" most of the time and still appear accurate. To fix this, under-sample each class down to 217 rows (the size of the smallest class):

PYTHON
df1 = pd.DataFrame()
for val in df['overall'].unique():
  temp = df[df['overall']==val].sample(217)
  df1 = df1.append(temp, ignore_index = True)
df1
OUTPUT
reviewTextoverall
0First off; let me start by saying I bought thi...5.0
1I purchased these cables for my Behringer 802 ...5.0
2It looks fine but as other reviewers have poin...5.0
3I bought this for my 11 year old daughter to h...5.0
4I use this with my mobile DJ equipment and it ...5.0
.........
1080Bought this a while back just got around to in...1.0
1081DOA...no good, out of the box,plug it, nothing...1.0
1082I have had 2 of these tuners (you'd think I'd ...1.0
1083These speakers worked great for 14 months. Las...1.0
1084This is a cheap stand and I was not surprised ...1.0

1085 rows × 2 columns

After under-sampling you have 1,085 rows with exactly 217 reviews per star class — a perfectly balanced training set.

Text Preprocessing

Raw review text is messy: it contains HTML tags, URLs, email addresses, special characters, and repeated letters like "looooove". Text preprocessing is the step that cleans all of this away so the model only sees meaningful words.

Install the preprocess_kgptalkie package along with its dependencies:

PLAINTEXT
!pip install spacy==2.2.3
!python -m spacy download en_core_web_sm
!pip install beautifulsoup4==4.9.1
!pip install textblob==0.15.3
PLAINTEXT
!pip install git+https://github.com/laxmimerit/preprocess_kgptalkie.git

The installation output confirms the package version:

PLAINTEXT
Collecting git+https://github.com/laxmimerit/preprocess_kgptalkie.git
  Cloning https://github.com/laxmimerit/preprocess_kgptalkie.git to c:\users\mdezaj~1\appdata\local\temp\pip-req-build-5g7bbg9w
Requirement already satisfied (use --upgrade to upgrade): preprocess-kgptalkie==0.0.5 from git+https://github.com/laxmimerit/preprocess_kgptalkie.git in c:\users\md  ezajul hassan\appdata\local\programs\python\python37\lib\site-packages
PLAINTEXT
You are using pip version 19.0.3, however version 20.2.2 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.

Now import the package and the standard re module for regular expressions:

PYTHON
import preprocess_kgptalkie as ps
import re

The get_clean function runs every review through a six-step pipeline. The docstring below documents each step:

PYTHON
"""
Step 1: Lowering the letter then after replacing backward slash from nothing and underscore from space.
Step 2: Remove emails from the Reviews column.
Step 3: Removing html tags from the Reviews column.
Step 4: Removing special character.
Step 5: If you have multiple repeated character then it converted into single character and make meaningful.

          E.g. x = 'lllooooovvveeee youuuu'
               x = re.sub("(.)\\1{2,}", "\\1", x)
               print(x)
               -------
               love you
"""

Define the function and apply it to every row in the reviewText column:

PYTHON
def get_clean(x):
    x = str(x).lower().replace('\\', '').replace('_', ' ')
    x = ps.cont_exp(x)
    x = ps.remove_emails(x)
    x = ps.remove_urls(x)
    x = ps.remove_html_tags(x)
    x = ps.remove_accented_chars(x)
    x = ps.remove_special_chars(x)
    x = re.sub("(.)\\1{2,}", "\\1", x)
    return x
df['reviewText'] = df['reviewText'].apply(lambda x: get_clean(x))
df.head()
OUTPUT
reviewTextoverall
0not much to write about here but it does exact...5.0
1the product does exactly as it should and is q...5.0
2the primary job of this device is to block the...5.0
3nice windscreen protects my mxl mic and preven...5.0
4this pop filter is great it looks and performs...5.0

The text is now lowercase, stripped of noise, and ready for vectorization.

TF-IDF Vectorization

TF-IDF (Term Frequency–Inverse Document Frequency) converts text into numbers by measuring how important each word — or character sequence — is within a document relative to the whole collection. Common words like "the" get a low score; rare but meaningful words like "beautiful" get a high score.

The diagram below illustrates the TF-IDF formula and its two components:

Hand-drawn TF-IDF formula diagram showing TF(t,d) multiplied by IDF(t), with the IDF expansion log((1 + n) / (1 + df(d,t))) + 1

The full formula is:

Where:

  • — the term (word or character n-gram) being scored
  • — the document (review) being scored
  • — term frequency: the number of times term appears in document
  • — total number of documents in the corpus
  • — document frequency: the number of documents that contain term
  • The smoothing prevents division-by-zero for terms not seen during fitting

Import the vectorizer and all other modeling utilities in one block:

PYTHON
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report

Configure TfidfVectorizer with character-level n-grams up to length 5 and a vocabulary cap of 20,000 features. Character n-grams capture sub-word patterns (prefixes, suffixes, typos) that word-level tokens miss:

PYTHON
tfidf = TfidfVectorizer(max_features=20000, ngram_range=(1,5), analyzer='char')

Fit the vectorizer on the full review column and create the feature matrix X and label vector y:

PYTHON
X = tfidf.fit_transform(df['reviewText'])
y = df['overall']

Check the shape of the resulting sparse matrix:

PYTHON
X.shape, y.shape
OUTPUT
((10261, 20000), (10261,))

Each of the 10,261 reviews is now represented as a vector of 20,000 TF-IDF scores. Split the data into 80 % training and 20 % test sets:

PYTHON
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Confirm the training set size:

PYTHON
X_train.shape
OUTPUT
(8208, 20000)

Support Vector Machine Classifier

A Support Vector Machine (SVM) is a supervised learning algorithm that finds the hyperplane — a flat boundary in high-dimensional space — that best separates the classes while maximising the margin between them. The margin is the gap between the hyperplane and the nearest data points from each class; a wider margin means better generalisation to new data.

The diagram below shows how SVM chooses between multiple valid boundaries by picking the one with the largest margin:

Hand-drawn SVM diagram showing multiple possible boundaries between two classes, with margin, support vectors, and the concept of choosing the maximum-margin boundary labelled

When data is not linearly separable in its original feature space, SVM applies the kernel trick — a mathematical transformation that projects the data into a higher-dimensional space where a linear boundary can be drawn. The diagram below illustrates this 2D-to-3D transformation:

Hand-drawn SVM kernel trick diagram showing non-linearly separable data in 2D transformed to 3D space using Z = x² + y², where a linear hyperplane can separate the classes

LinearSVC — the linear variant of SVM — skips the full kernel computation and fits directly in the original feature space, which makes it fast enough for high-dimensional TF-IDF vectors. The C parameter controls regularisation strength (higher C = less regularisation, tighter fit), and class_weight='balanced' automatically adjusts class weights to compensate for any remaining imbalance:

PYTHON
clf = LinearSVC(C = 20, class_weight='balanced')
clf.fit(X_train, y_train)

The convergence warning below is informational — the model converged but Liblinear suggests increasing max_iter for a cleaner exit:

PLAINTEXT
c:\users\md  ezajul hassan\appdata\local\programs\python\python37\lib\site-packages\sklearn\svm\_base.py:977: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)
PYTHON
LinearSVC(C=20, class_weight='balanced')

Generate predictions on the held-out test set:

PYTHON
y_pred = clf.predict(X_test)

Evaluating the Model

A classification report shows precision, recall, and F1-score for every class individually. This matters here because 5-star reviews are still more common than 1-star reviews even in the test set, and a single accuracy number would hide how poorly the model might perform on rarer classes.

PYTHON
print(classification_report(y_test, y_pred))
OUTPUT
precision    recall  f1-score   support

         1.0       0.31      0.21      0.25        39
         2.0       0.18      0.11      0.13        55
         3.0       0.23      0.27      0.25       134
         4.0       0.34      0.33      0.34       451
         5.0       0.77      0.78      0.78      1374

    accuracy                           0.62      2053
   macro avg       0.37      0.34      0.35      2053
weighted avg       0.61      0.62      0.62      2053

Overall accuracy is 62 %. The model performs best on 5-star reviews (F1 = 0.78) because even after under-sampling the 5-star class still dominates the test set. The middle classes (2- and 3-star) are hardest to distinguish — their F1 scores fall below 0.25, reflecting genuine ambiguity in borderline reviews.

Live Predictions

Test the pipeline end-to-end on a hand-written negative review. Clean the text first with get_clean, then transform it into a TF-IDF vector and call predict:

PYTHON
x = 'this product is really bad. i do not like it'
x = get_clean(x)
vec = tfidf.transform([x])
clf.predict(vec)
OUTPUT
array([1.])

The model correctly assigns 1 star to the negative review. Now test a positive one:

PYTHON
x = 'this product is really good. thanks a lot for speedy delivery'
x = get_clean(x)
vec = tfidf.transform([x])
clf.predict(vec)
OUTPUT
array([5.])

The model assigns 5 stars to the positive review — both extreme predictions are correct.

Conclusion

In this tutorial you built a complete text classification pipeline that predicts Amazon star ratings from raw review text. You loaded and balanced a skewed dataset, cleaned the text with a multi-step preprocessing function, converted reviews into 20,000-dimensional TF-IDF character n-gram features, and trained a LinearSVC classifier that reached 62 % overall accuracy. The model handles extreme sentiments (1-star and 5-star) well but struggles with middle ratings, which is expected given the natural overlap in language between 3- and 4-star reviews.

Key takeaways:

  • Class imbalance directly harms classifier fairness — always check value_counts() before training and correct with under-sampling or class_weight='balanced'.
  • Character-level n-grams (analyzer='char') capture sub-word patterns and handle typos better than word-level features for informal review text.
  • LinearSVC scales efficiently to high-dimensional sparse TF-IDF matrices where kernel SVM would be too slow.
  • A classification report is essential for multi-class problems — overall accuracy alone masks per-class failures.
  • Middle star ratings (2–4) are inherently hard to distinguish; consider collapsing them into positive/negative/neutral buckets for a more reliable model.

Next steps:

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments