#chitresh vardhan verma#gensim#jupyter#kgp talkie#Machine Learning#Natural Language Processing#python#Text classification#tfidf#word2vec

Spam Text Message Classification with NLP

Classify SMS messages as spam or ham using TF-IDF and Word2Vec. Covers text preprocessing, feature extraction, and Naive Bayes and SVM model comparison.

May 18, 2026 at 9:45 AM14 min readFollowFollow (Hindi)

Topics You Will Master

SMS spam dataset loading and class distribution exploration
Text preprocessing: tokenization, stop-word removal, and stemming
TF-IDF vectorization with sklearn and Word2Vec with gensim
Training and comparing Naive Bayes, SVM, and Random Forest classifiers
Precision, recall, and F1-score evaluation for spam detection
Best For

NLP beginners building and comparing their first text classification pipelines.

Expected Outcome

A trained spam detector with TF-IDF and Word2Vec feature extraction comparison.

Spam Ham text classification

Watch Full Video Here

Objective

  • Our objective of this code is to classify texts into two classes spam and ham.

What is Natural Language Processing

  • Natural Language Processing (NLP) is the field of Artificial Intelligence, where we analyse text using machine learning models

Application of NLP

  • Text Classification

  • Spam Filters

  • Voice text messaging

  • Sentiment analysis

  • Spell or grammar check

  • Chat bot

  • Search Suggestion

  • Search Autocorrect

  • Automatic Review Analysis system

  • Machine translation

  • And so much more

  • Natural Language Understanding (Text classification)

  • Natural Language Generation (Text Generation)

The Process of Natural Language Understanding (Text Classification)

Natural Language Understanding (NLU) process flow diagram showing text input through tokenization, POS tagging, and entity detection

Sentence Breakdown

Sentence breakdown diagram showing grammatical analysis of "The boy kicked the ball" with subject, verb, and object labels

Natural Language Generation

Natural Language Generation (NLG) diagram showing how structured data is converted to natural language output

How to get started with NLP

Following are the libraries which are generally used in Natural Language Processing.

  • Sklearn
  • Spacy
  • NLTK
  • Gensim
  • Tensorflow and Keras
BASH
pip install scikit-learn
python -m spacy download en
pip install -U spacy
pip install gensim
pip install lightgbm

NLP library ecosystem diagram listing sklearn, spaCy, NLTK, gensim, and TensorFlow with their capabilities

Application of these libraries

  • Tokenization
  • Parts of Speech Tagging
  • Entity Detection
  • Dependency Parsing
  • Noun Phrases
  • Words-to-Vectors Integration
  • Context Derivation
  • and so much more

Data Cleaning Options

  • Case Normalization
  • Removing Stop Words
  • Removing Punctuations or Special Symbols
  • Lemmatization and Stemming (word normalization)
  • Parts of Speech Tagging
  • Entity Detection
  • Bag of Words
  • Word-to-Vec

Tokenization

Tokenization is breaking the raw text into small chunks. Tokenization breaks the raw text into words, sentences called tokens. These tokens help in understanding the context or developing the model for the NLP. The tokenization helps in interpreting the meaning of the text by analyzing the sequence of the words.

Parts of Speech Tagging

A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc., although generally computational applications use more fine-grained POS tags like 'noun-plural.

Entity Detection

Named entity recognition (NER), also known as entity chunking/extraction , is a popular technique used in information extraction to identify and segment the named entities and classify or categorize them under various predefined classes.

Dependency Parsing

Syntactic Parsing or Dependency Parsing is the task of recognizing a sentence and assigning a syntactic structure to it. The most widely used syntactic structure is the parse tree which can be generated using some parsing algorithms. These parse trees are useful in various applications like grammar checking or more importantly it plays a critical role in the semantic analysis stage.

Noun Phrases

Noun phrases are part of speech patterns that include a noun. They can also include whatever other parts of speech make grammatical sense, and can include multiple nouns.

Some common noun phrase patterns are:

  • Noun
  • Noun-Noun..… -Noun
  • Adjective(s)-Noun
  • Verb-(Adjectives-)Noun

Words-to-Vectors Integration

  • Computers interact with humans in programming languages which are unambiguous, precise and often structured. However, natural (human) language has a lot of ambiguity. There are multiple words with same meaning (synonyms), words with multiple meanings (polysemy) some of which are entirely opposite in nature (auto-antonyms), and words which behave differently when used as noun and verb. These words make sense contextually in natural language which humans can comprehend and distinguish easily, but machines can’t. That’s what makes NLP one of the most difficult and interesting tasks in AI.
  • Word2Vec is a group of models which helps derive relations between a word and its contextual words.

Case Normalization

Normalization is a process that converts a list of words to a more uniform sequence. For example, converting all words to lowercase will simplify the searching process.

Removing Stop Words

  • A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query.
  • To check the list of stopwords you can type the following commands in the python shell.

import nltk

from nltk.corpus import stopwords

print(stopwords.words('english'))

Stemming

  • Stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language.

Playing -------> Play

Plays ---------> Play

Played --------> Play

Lemmatisation

  • Lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.

is,am, are ------> be

Example

the boy's cars are different colors ------> the boy car be differ color

What will be covered in this blog

  • Introduction of NLP and Spam detection using sklearn
  • Reading Text and PDF files in Python
  • Tokenization
  • Parts of Speech Tagging
  • Word-to-Vectors
  • Then real-world practical examples

Bag of Words - The Simples Word Embedding Technique

PYTHON
doc1 = "I am high"
doc2 = "Yes I am high"
doc3 = "I am kidding"

Bag of Words matrix example showing document-term frequency counts for three sample sentences

By comparing the vectors we see that some words are common

Bag of Words and Tf-idf

TfidfTransformer Documentation (sklearn)

tf–idf for “Term Frequency times Inverse Document Frequency

TF-IDF formula diagram showing term frequency times inverse document frequency calculation with example

Let's start now the coding part

PYTHON
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
PYTHON
df = pd.read_csv('spam.tsv', sep='\t')
df.head()
OUTPUT
labelmessagelengthpunct
0hamGo until jurong point, crazy.. Available only ...1119
1hamOk lar... Joking wif u oni...296
2spamFree entry in 2 a wkly comp to win FA Cup fina...1556
3hamU dun say so early hor... U c already then say...496
4hamNah I don't think he goes to usf, he lives aro...612
  • Now to check whether there is any NULL value in data set we run following code
PYTHON
df.isnull().sum()
OUTPUT
label      0
message    0
length     0
punct      0
dtype: int64
PYTHON
len(df)
OUTPUT
5572
PYTHON
df['label'].value_counts()
OUTPUT
ham     4825
spam     747
Name: label, dtype: int64
  • Balancing the data so that we have euqal number of spam and ham messages, so that our machine learning model learns well about both classes while training .
PYTHON
ham = df[df['label']=='ham']
ham.head()
OUTPUT
labelmessagelengthpunct
0hamGo until jurong point, crazy.. Available only ...1119
1hamOk lar... Joking wif u oni...296
3hamU dun say so early hor... U c already then say...496
4hamNah I don't think he goes to usf, he lives aro...612
6hamEven my brother is not like to speak with me. ...772
PYTHON
spam = df[df['label']=='spam']
spam.head()
OUTPUT
labelmessagelengthpunct
2spamFree entry in 2 a wkly comp to win FA Cup fina...1556
5spamFreeMsg Hey there darling it's been 3 week's n...1478
8spamWINNER!! As a valued network customer you have...1576
9spamHad your mobile 11 months or more? U R entitle...1542
11spamSIX chances to win CASH! From 100 to 20,000 po...1368
PYTHON
ham.shape, spam.shape
OUTPUT
((4825, 4), (747, 4))
PYTHON
ham = ham.sample(spam.shape[0])
ham.shape, spam.shape
OUTPUT
((747, 4), (747, 4))
PYTHON
data = ham.append(spam, ignore_index=True)
data.tail()
OUTPUT
labelmessagelengthpunct
1489spamWant explicit SEX in 30 secs? Ring 02073162414...903
1490spamASKED 3MOBILE IF 0870 CHATLINES INCLU IN FREE ...1585
1491spamHad your contract mobile 11 Mnths? Latest Moto...1608
1492spamREMINDER FROM O2: To get 2.50 pounds free call...1473
1493spamThis is the 2nd time we have tried 2 contact u...1608

Exploratory Data Analysis

PYTHON
plt.hist(data[data['label']=='ham']['length'], bins = 100, alpha = 0.7,label='Ham')
plt.hist(data[data['label']=='spam']['length'], bins = 100, alpha = 0.7,label='Spam')
plt.xlabel('length of messages')
plt.ylabel('Frequency')
plt.legend()
plt.xlim(0,300)
plt.show()

Histogram comparing message length distribution for ham (blue) vs spam (orange) SMS messages

PYTHON
plt.hist(data[data['label']=='ham']['punct'], bins = 100, alpha = 0.7,label='Ham')
plt.hist(data[data['label']=='spam']['punct'], bins = 100, alpha = 0.7,label='Spam')
plt.xlabel('punctauations')
plt.ylabel('Frequency')
plt.legend()
plt.xlim(0,30)
plt.show()

Histogram comparing punctuation count distribution for ham vs spam SMS messages

Data Preparation

PYTHON
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
PYTHON
data.head()
OUTPUT
labelmessagelengthpunct
0hamNo problem with the renewal. I.ll do it right ...793
1hamGood afternoon, my love! How goes that day ? I...1605
2hamCan come my room but cannot come my house cos ...776
3hamI can send you a pic if you like :)352
4hamI am on the way to ur home260
  • Now we convert text data to word embeddings using Word2vec, for that we import gensim library
  • To convert each message into vector form first we tokenize each message.
  • Now we convert each token i.e. each word into wordembedding using word2vec which we import using gensim.
  • After converting each word to vector (embedding) we take average of all word vectors to obtain vector for message . which is final feature vector for message
PYTHON
import gensim
from nltk.tokenize import word_tokenize
import numpy as np
embedding_dim=100
text=data['message']

Text=[]
for i in range(data.shape[0]):
    text1=word_tokenize(text[i])
    Text=text1+Text

model= gensim.models.Word2Vec(sentences=[Text], size=embedding_dim, workers=4, min_count=1)
words=list(model.wv.vocab)

# print(text1)
# print(model[text1])
# vector=list(map(lambda x: sum(x)/len(x), zip(*model[text1])))
# print(vector)
def word_2_vec(x):
    t1=word_tokenize(x)
    model[t1]
    v=list(map(lambda y: sum(y)/len(y), zip(*model[t1])))
    a=np.array(v)
    return a.reshape(1,-1)
  • Applying word2vec to each text message
PYTHON
data['vec']=data['message'].apply(lambda x: word_2_vec(x))
data.head()
OUTPUT
labelmessagelengthpunctvec
0hamNo problem with the renewal. I.ll do it right ...793[[-0.028193846775037754, -0.000255275213728762...
1hamGood afternoon, my love! How goes that day ? I...1605[[-0.03519534362069527, -0.0009830875404938859...
2hamCan come my room but cannot come my house cos ...776[[-0.004332678098257424, -0.000847288884769012...
3hamI can send you a pic if you like :)352[[-0.04251667247577147, -0.002708293984390118,...
4hamI am on the way to ur home260[[-0.040913782135248766, 0.0017535838996991515...
  • Here we are converting each feature vector pf a message in columns of dataframe
PYTHON
w_vec=np.concatenate(data['vec'].to_numpy(), axis=0)
w_vec.shape
OUTPUT
(1494, 100)
PYTHON
word_vec=pd.DataFrame(w_vec)
word_vec.head()
OUTPUT
0123456789...90919293949596979899
0-0.028194-0.000255-0.0059060.0015430.0284900.0085460.0239540.0102500.0432730.017577...0.0079820.0102250.0314040.004743-0.0068250.008575-0.0125750.006858-0.012980-0.005203
1-0.035195-0.000983-0.0066630.0020520.0338740.0100260.0281980.0110130.0524980.021042...0.0066370.0122240.0363870.006481-0.0083040.012149-0.0159570.009656-0.015783-0.005901
2-0.004333-0.000847-0.0017100.0004310.0038130.0003880.0033630.0007890.0062930.003545...0.0000060.0011900.0036740.001200-0.0002050.002153-0.0009520.001631-0.0032320.001002
3-0.042517-0.002708-0.0072340.0019330.0399800.0105900.0328370.0119870.0621390.024177...0.0082010.0145430.0432880.008478-0.0095700.012796-0.0186380.011211-0.018574-0.007211
4-0.0409140.001754-0.0091900.0036610.0403430.0119110.0364380.0134630.0614410.025578...0.0089120.0143500.0427320.009705-0.0103810.012885-0.0201830.011372-0.018638-0.005238

5 rows × 100 columns

PYTHON
X_train, X_test, y_train, y_test = train_test_split(word_vec, data['label'], test_size = 0.3, random_state=0, shuffle = True, stratify=data['label'])

Now we are using different machine learning models for classification of text messages into spam and ham classes

  • For hyperparameter tuning of each model we import GridsearchCV , which tune machine learning model by chossing the optimal paramters for machine learning model
  • Here we are using 5 fold cross validation in gridsearch method.
  • By using cross validation, model generalizes well, that is it performs well on test data

Support Vector Machine

  • In machine learning, support-vector machines (SVMs, also support-vector networks) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis.
  • In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces
PYTHON
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
Parameter_svc=[{'cache_size': [300,400,200],'tol': [0.0011,0.002,0.003],
                'kernel': ['rbf','poly'],
               'degree': [3,4,5]}]
scores = ['accuracy']

clf_svc = GridSearchCV(SVC(), Parameter_svc, scoring='accuracy', verbose=2, cv=5
                         )
clf_svc.fit(X_train,y_train)

print(clf_svc.best_params_)

y_pred1 = clf_svc.predict(X_test)

accuracy_score(y_pred1,y_test)
OUTPUT
Fitting 5 folds for each of 54 candidates, totalling 270 fits
[CV] cache_size=300, degree=3, kernel=rbf, tol=0.0011 ................

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.

[CV] . cache_size=300, degree=3, kernel=rbf, tol=0.0011, total=   0.3s
[CV] cache_size=300, degree=3, kernel=rbf, tol=0.0011 ................
[CV] . cache_size=300, degree=3, kernel=rbf, tol=0.0011, total=   0.1s
[CV] cache_size=300, degree=3, kernel=rbf, tol=0.0011 ................

[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.2s remaining:    0.0s

0.6993318485523385

LGBMclassifier

  • Light GBM is a gradient boosting framework that uses tree based learning algorithm.
  • Light GBM grows tree vertically while other algorithm grows trees horizontally meaning that Light GBM grows tree leaf-wise while other algorithm grows level-wise. It will choose the leaf with max delta loss to grow. When growing the same leaf, Leaf-wise algorithm can reduce more loss than a level-wise algorithm.
PYTHON
from lightgbm import LGBMClassifier
Parameter_lgbm=[{'num_leaves':[31,40,50],
              'max_depth':[3,4,5,6],
              'learning_rate':[0.1,0.05,0.2,0.15],
             'n_estimators':[700]}]
scores = ['accuracy']
clf_lgbm = GridSearchCV(LGBMClassifier(), Parameter_lgbm,
                    scoring='accuracy', verbose=2, cv=5)
clf_lgbm.fit(X_train,y_train)
print(clf_lgbm.best_params_)
y_pred2 = clf_lgbm.predict(X_test)
accuracy_score(y_pred2,y_test)
OUTPUT
Fitting 5 folds for each of 48 candidates, totalling 240 fits
[CV] learning_rate=0.1, max_depth=3, n_estimators=700, num_leaves=31 .

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.

[CV]  learning_rate=0.1, max_depth=3, n_estimators=700, num_leaves=31, total=   2.7s
[CV] learning_rate=0.1, max_depth=3, n_estimators=700, num_leaves=31 .

[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.6s remaining:    0.0s

[Parallel(n_jobs=1)]: Done 240 out of 240 | elapsed:  8.9min finished

{'learning_rate': 0.2, 'max_depth': 3, 'n_estimators': 700, 'num_leaves': 31}

0.821826280623608
PYTHON
print(classification_report(y_test, y_pred2))
OUTPUT
precision    recall  f1-score   support

         ham       0.82      0.82      0.82       225
        spam       0.82      0.82      0.82       224

    accuracy                           0.82       449
   macro avg       0.82      0.82      0.82       449
weighted avg       0.82      0.82      0.82       449

Random_Forest classifier

  • Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees' habit of overfitting to their training set.
PYTHON
from sklearn.ensemble import RandomForestClassifier
RF_Cl=RandomForestClassifier(n_estimators=900)
RF_Cl.fit(X_train,y_train)
y_pred=RF_Cl.predict(X_test)
accuracy_score(y_pred,y_test)
OUTPUT
0.8329621380846325
PYTHON
print(classification_report(y_test, y_pred))
OUTPUT
precision    recall  f1-score   support

         ham       0.83      0.84      0.83       225
        spam       0.83      0.83      0.83       224

    accuracy                           0.83       449
   macro avg       0.83      0.83      0.83       449
weighted avg       0.83      0.83      0.83       449

Classification of texts using TFidf

PYTHON
X_train, X_test, y_train, y_test = train_test_split(data['message'], data['label'], test_size = 0.3, random_state=0, shuffle = True, stratify=data['label'])
vectorizer = TfidfVectorizer()
X_train_vect = vectorizer.fit_transform(X_train)
X_train_vect.shape
OUTPUT
(1045, 3596)

Pipeline and Random_Forest classifier

PYTHON
clf_rf = Pipeline([('tfidf', TfidfVectorizer()), ('clf', RandomForestClassifier(n_estimators=100, n_jobs=-1))])
clf_rf.fit(X_train, y_train)
y_pred = clf_rf.predict(X_test)
confusion_matrix(y_test, y_pred)
OUTPUT
array([[224,   1],
       [ 24, 200]], dtype=int64)
PYTHON
print(classification_report(y_test, y_pred))
OUTPUT
precision    recall  f1-score   support

         ham       0.91      1.00      0.95       225
        spam       1.00      0.91      0.95       224

    accuracy                           0.95       449
   macro avg       0.95      0.95      0.95       449
weighted avg       0.95      0.95      0.95       449
PYTHON
accuracy_score(y_test, y_pred)
OUTPUT
0.9510022271714922
PYTHON
clf_rf.predict(["Hey, whassup?"])
OUTPUT
array(['ham'], dtype=object)
PYTHON
clf_rf.predict(["you have won tickets to the USA this summer."])
OUTPUT
array(['ham'], dtype=object)

Support Vector Machine

PYTHON
clf_svc = Pipeline([('tfidf', TfidfVectorizer()), ('clf', SVC(C = 1000, gamma = 'auto'))])
clf_svc.fit(X_train, y_train)
y_pred = clf_svc.predict(X_test)
confusion_matrix(y_test, y_pred)
OUTPUT
array([[221,   4],
       [ 16, 208]], dtype=int64)
PYTHON
print(classification_report(y_test, y_pred))
OUTPUT
precision    recall  f1-score   support

         ham       0.93      0.98      0.96       225
        spam       0.98      0.93      0.95       224

    accuracy                           0.96       449
   macro avg       0.96      0.96      0.96       449
weighted avg       0.96      0.96      0.96       449
PYTHON
accuracy_score(y_test, y_pred)
OUTPUT
0.955456570155902
PYTHON
clf_svc.predict(["Hey, whassup?"])
OUTPUT
array(['ham'], dtype=object)
PYTHON
clf_svc.predict(["you have got free tickets to the USA this summer."])
OUTPUT
array(['spam'], dtype=object)

Conclusion

In this tutorial you built and compared spam classification pipelines using Word2Vec embeddings and TF-IDF vectorization on a balanced 1,494-message SMS dataset. TF-IDF consistently outperformed Word2Vec across all models — Random Forest with TF-IDF reached 95% accuracy and SVM with TF-IDF reached 95.5%, versus 83% for Word2Vec — highlighting TF-IDF's advantage for short, keyword-driven texts like SMS messages.

Key takeaways:

  • Word2Vec averages word vectors per message, losing positional and keyword-frequency information that matters for spam detection — this explains its 15-point accuracy gap versus TF-IDF.
  • Pipeline([('tfidf', TfidfVectorizer()), ('clf', classifier)]) chains vectorization and classification in a single estimator, making prediction on new text a one-call operation: clf.predict(["new text"]).
  • Balancing classes (ham.sample(spam.shape[0])) prevents models from learning to always predict the majority class — critical here since ham outnumbers spam ~6:1 in the raw data.
  • GridSearchCV with 5-fold cross-validation (cv=5) provides reliable hyperparameter estimates but multiplies training time by 5×; for production, fix hyperparameters from a validation set first.

Next steps:

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments