NLP Tutorial – Spam Text Message Classification using NLP

Published by ignacioberrios on 24 August 202024 August 2020

Spam Ham text classification

Watch Full Video Here

Objective

Our objective of this code is to classify texts into two classes spam and ham.

What is Natural Language Processing

Natural Language Processing (NLP) is the field of Artificial Intelligence, where we analyse text using machine learning models

Application of NLP

Text Classification
Spam Filters
Voice text messaging
Sentiment analysis
Spell or grammar check
Chat bot
Search Suggestion
Search Autocorrect
Automatic Review Analysis system
Machine translation
And so much more

Natural Language Understanding (Text classification)
Natural Language Generation (Text Generation)

The Process of Natural Language Understanding (Text Classification)

Sentence Breakdown

Natural Language Generation

How to get started with NLP

Following are the libraries which are generally used in Natural Language Processing.

Sklearn
Spacy
NLTK
Gensim
Tensorflow and Keras

!pip install scikit-learn
!python -m spacy download en
!pip install -U spacy
!pip install gensim
!pip install lightgbm

Application of these libraries

Tokenization
Parts of Speech Tagging
Entity Detection
Dependency Parsing
Noun Phrases
Words-to-Vectors Integration
Context Derivation
and so much more

Data Cleaning Options

Case Normalization
Removing Stop Words
Removing Punctuations or Special Symbols
Lemmatization and Stemming (word normalization)
Parts of Speech Tagging
Entity Detection
Bag of Words
Word-to-Vec

Tokenization

Tokenization is breaking the raw text into small chunks. Tokenization breaks the raw text into words, sentences called tokens. These tokens help in understanding the context or developing the model for the NLP. The tokenization helps in interpreting the meaning of the text by analyzing the sequence of the words.

Parts of Speech Tagging

A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc., although generally computational applications use more fine-grained POS tags like 'noun-plural.

Entity Detection

Named entity recognition (NER), also known as entity chunking/extraction , is a popular technique used in information extraction to identify and segment the named entities and classify or categorize them under various predefined classes.

Dependency Parsing

Syntactic Parsing or Dependency Parsing is the task of recognizing a sentence and assigning a syntactic structure to it. The most widely used syntactic structure is the parse tree which can be generated using some parsing algorithms. These parse trees are useful in various applications like grammar checking or more importantly it plays a critical role in the semantic analysis stage.

Noun Phrases

Noun phrases are part of speech patterns that include a noun. They can also include whatever other parts of speech make grammatical sense, and can include multiple nouns. Some common noun phrase patterns are:

Noun
Noun-Noun..… -Noun
Adjective(s)-Noun
Verb-(Adjectives-)Noun

Words-to-Vectors Integration

Computers interact with humans in programming languages which are unambiguous, precise and often structured. However, natural (human) language has a lot of ambiguity. There are multiple words with same meaning (synonyms), words with multiple meanings (polysemy) some of which are entirely opposite in nature (auto-antonyms), and words which behave differently when used as noun and verb. These words make sense contextually in natural language which humans can comprehend and distinguish easily, but machines can’t. That’s what makes NLP one of the most difficult and interesting tasks in AI.
Word2Vec is a group of models which helps derive relations between a word and its contextual words.

Case Normalization

Normalization is a process that converts a list of words to a more uniform sequence. For example, converting all words to lowercase will simplify the searching process.

Removing Stop Words

A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query.
To check the list of stopwords you can type the following commands in the python shell.

import nltk

from nltk.corpus import stopwords

print(stopwords.words('english'))

Stemming

Stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language.

Playing -------> Play

Plays ---------> Play

Played --------> Play

Lemmatisation

Lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.

is,am, are ------> be

Example

the boy's cars are different colors ------> the boy car be differ color

What will be covered in this blog

Introduction of NLP and Spam detection using sklearn
Reading Text and PDF files in Python
Tokenization
Parts of Speech Tagging
Word-to-Vectors
Then real-world practical examples

Bag of Words - The Simples Word Embedding Technique

doc1 = "I am high"
doc2 = "Yes I am high"
doc3 = "I am kidding"

By comparing the vectors we see that some words are common

Bag of Words and Tf-idf

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html

tf–idf for “Term Frequency times Inverse Document Frequency

Let's start now the coding part

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('spam.tsv', sep='\t')
df.head()

	label	message	length	punct
0	ham	Go until jurong point, crazy.. Available only ...	111	9
1	ham	Ok lar... Joking wif u oni...	29	6
2	spam	Free entry in 2 a wkly comp to win FA Cup fina...	155	6
3	ham	U dun say so early hor... U c already then say...	49	6
4	ham	Nah I don't think he goes to usf, he lives aro...	61	2

Now to check whether there is any NULL value in data set we run following code

df.isnull().sum()

label      0
message    0
length     0
punct      0
dtype: int64

len(df)

df['label'].value_counts()

ham     4825
spam     747
Name: label, dtype: int64

Balancing the data so that we have euqal number of spam and ham messages, so that our machine learning model learns well about both classes while training .

ham = df[df['label']=='ham']
ham.head()

	label	message	length	punct
0	ham	Go until jurong point, crazy.. Available only ...	111	9
1	ham	Ok lar... Joking wif u oni...	29	6
3	ham	U dun say so early hor... U c already then say...	49	6
4	ham	Nah I don't think he goes to usf, he lives aro...	61	2
6	ham	Even my brother is not like to speak with me. ...	77	2

spam = df[df['label']=='spam']
spam.head()

	label	message	length	punct
2	spam	Free entry in 2 a wkly comp to win FA Cup fina...	155	6
5	spam	FreeMsg Hey there darling it's been 3 week's n...	147	8
8	spam	WINNER!! As a valued network customer you have...	157	6
9	spam	Had your mobile 11 months or more? U R entitle...	154	2
11	spam	SIX chances to win CASH! From 100 to 20,000 po...	136	8

ham.shape, spam.shape

((4825, 4), (747, 4))

ham = ham.sample(spam.shape[0])
ham.shape, spam.shape

((747, 4), (747, 4))

data = ham.append(spam, ignore_index=True)
data.tail()

	label	message	length	punct
1489	spam	Want explicit SEX in 30 secs? Ring 02073162414...	90	3
1490	spam	ASKED 3MOBILE IF 0870 CHATLINES INCLU IN FREE ...	158	5
1491	spam	Had your contract mobile 11 Mnths? Latest Moto...	160	8
1492	spam	REMINDER FROM O2: To get 2.50 pounds free call...	147	3
1493	spam	This is the 2nd time we have tried 2 contact u...	160	8

Exploratory Data Analysis

plt.hist(data[data['label']=='ham']['length'], bins = 100, alpha = 0.7,label='Ham')
plt.hist(data[data['label']=='spam']['length'], bins = 100, alpha = 0.7,label='Spam')
plt.xlabel('length of messages')
plt.ylabel('Frequency')
plt.legend()
plt.xlim(0,300)
plt.show()

plt.hist(data[data['label']=='ham']['punct'], bins = 100, alpha = 0.7,label='Ham')
plt.hist(data[data['label']=='spam']['punct'], bins = 100, alpha = 0.7,label='Spam')
plt.xlabel('punctauations')
plt.ylabel('Frequency')
plt.legend()
plt.xlim(0,30)
plt.show()

Data Preparation

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

data.head()

	label	message	length	punct
0	ham	No problem with the renewal. I.ll do it right ...	79	3
1	ham	Good afternoon, my love! How goes that day ? I...	160	5
2	ham	Can come my room but cannot come my house cos ...	77	6
3	ham	I can send you a pic if you like 🙂	35	2
4	ham	I am on the way to ur home	26	0

Now we convert text data to word embeddings using Word2vec, for that we import gensim library
To convert each message into vector form first we tokenize each message.
Now we convert each token i.e. each word into wordembedding using word2vec which we import using gensim.
After converting each word to vector (embedding) we take average of all word vectors to obtain vector for message . which is final feature vector for message

import gensim
from nltk.tokenize import word_tokenize
import numpy as np
embedding_dim=100
text=data['message']

Text=[]
for i in range(data.shape[0]):
    text1=word_tokenize(text[i])
    Text=text1+Text



model= gensim.models.Word2Vec(sentences=[Text], size=embedding_dim, workers=4, min_count=1)
words=list(model.wv.vocab)

# print(text1)
# print(model[text1])
# vector=list(map(lambda x: sum(x)/len(x), zip(*model[text1])))
# print(vector)
def word_2_vec(x):
    t1=word_tokenize(x)
    model[t1]
    v=list(map(lambda y: sum(y)/len(y), zip(*model[t1])))
    a=np.array(v)
    return a.reshape(1,-1)

Applying word2vec to each text message

data['vec']=data['message'].apply(lambda x: word_2_vec(x))
data.head()

	label	message	length	punct	vec
0	ham	No problem with the renewal. I.ll do it right ...	79	3	[[-0.028193846775037754, -0.000255275213728762...
1	ham	Good afternoon, my love! How goes that day ? I...	160	5	[[-0.03519534362069527, -0.0009830875404938859...
2	ham	Can come my room but cannot come my house cos ...	77	6	[[-0.004332678098257424, -0.000847288884769012...
3	ham	I can send you a pic if you like 🙂	35	2	[[-0.04251667247577147, -0.002708293984390118,...
4	ham	I am on the way to ur home	26	0	[[-0.040913782135248766, 0.0017535838996991515...

Here we are converting each feature vector pf a message in columns of dataframe

w_vec=np.concatenate(data['vec'].to_numpy(), axis=0)
w_vec.shape

(1494, 100)

word_vec=pd.DataFrame(w_vec)
word_vec.head()

	0	1	2	3	4	5	6	7	8	9	...	90	91	92	93	94	95	96	97	98	99
0	-0.028194	-0.000255	-0.005906	0.001543	0.028490	0.008546	0.023954	0.010250	0.043273	0.017577	...	0.007982	0.010225	0.031404	0.004743	-0.006825	0.008575	-0.012575	0.006858	-0.012980	-0.005203
1	-0.035195	-0.000983	-0.006663	0.002052	0.033874	0.010026	0.028198	0.011013	0.052498	0.021042	...	0.006637	0.012224	0.036387	0.006481	-0.008304	0.012149	-0.015957	0.009656	-0.015783	-0.005901
2	-0.004333	-0.000847	-0.001710	0.000431	0.003813	0.000388	0.003363	0.000789	0.006293	0.003545	...	0.000006	0.001190	0.003674	0.001200	-0.000205	0.002153	-0.000952	0.001631	-0.003232	0.001002
3	-0.042517	-0.002708	-0.007234	0.001933	0.039980	0.010590	0.032837	0.011987	0.062139	0.024177	...	0.008201	0.014543	0.043288	0.008478	-0.009570	0.012796	-0.018638	0.011211	-0.018574	-0.007211
4	-0.040914	0.001754	-0.009190	0.003661	0.040343	0.011911	0.036438	0.013463	0.061441	0.025578	...	0.008912	0.014350	0.042732	0.009705	-0.010381	0.012885	-0.020183	0.011372	-0.018638	-0.005238

5 rows × 100 columns

X_train, X_test, y_train, y_test = train_test_split(word_vec, data['label'], test_size = 0.3, random_state=0, shuffle = True, stratify=data['label'])

Now we are using different machine learning models for classification of text messages into spam and ham classes

For hyperparameter tuning of each model we import GridsearchCV , which tune machine learning model by chossing the optimal paramters for machine learning model
Here we are using 5 fold cross validation in gridsearch method.
By using cross validation, model generalizes well, that is it performs well on test data

Support Vector Machine

In machine learning, support-vector machines (SVMs, also support-vector networks) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis.
In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces

from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
Parameter_svc=[{'cache_size': [300,400,200],'tol': [0.0011,0.002,0.003],
                'kernel': ['rbf','poly'],
               'degree': [3,4,5]}]
scores = ['accuracy']

clf_svc = GridSearchCV(SVC(), Parameter_svc, scoring='accuracy', verbose=2, cv=5
                         )
clf_svc.fit(X_train,y_train)


print(clf_svc.best_params_)

y_pred1 = clf_svc.predict(X_test)

accuracy_score(y_pred1,y_test)

Fitting 5 folds for each of 54 candidates, totalling 270 fits
[CV] cache_size=300, degree=3, kernel=rbf, tol=0.0011 ................

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.

[CV] . cache_size=300, degree=3, kernel=rbf, tol=0.0011, total=   0.3s
[CV] cache_size=300, degree=3, kernel=rbf, tol=0.0011 ................
[CV] . cache_size=300, degree=3, kernel=rbf, tol=0.0011, total=   0.1s
[CV] cache_size=300, degree=3, kernel=rbf, tol=0.0011 ................

[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.2s remaining:    0.0s

0.6993318485523385

LGBMclassifier

Light GBM is a gradient boosting framework that uses tree based learning algorithm.
Light GBM grows tree vertically while other algorithm grows trees horizontally meaning that Light GBM grows tree leaf-wise while other algorithm grows level-wise. It will choose the leaf with max delta loss to grow. When growing the same leaf, Leaf-wise algorithm can reduce more loss than a level-wise algorithm.

from lightgbm import LGBMClassifier
Parameter_lgbm=[{'num_leaves':[31,40,50],
              'max_depth':[3,4,5,6],
              'learning_rate':[0.1,0.05,0.2,0.15],
             'n_estimators':[700]}]
scores = ['accuracy']
clf_lgbm = GridSearchCV(LGBMClassifier(), Parameter_lgbm,
                    scoring='accuracy', verbose=2, cv=5)                         
clf_lgbm.fit(X_train,y_train)
print(clf_lgbm.best_params_)
y_pred2 = clf_lgbm.predict(X_test)
accuracy_score(y_pred2,y_test)

Fitting 5 folds for each of 48 candidates, totalling 240 fits
[CV] learning_rate=0.1, max_depth=3, n_estimators=700, num_leaves=31 .

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.

[CV]  learning_rate=0.1, max_depth=3, n_estimators=700, num_leaves=31, total=   2.7s
[CV] learning_rate=0.1, max_depth=3, n_estimators=700, num_leaves=31 .

[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.6s remaining:    0.0s

[Parallel(n_jobs=1)]: Done 240 out of 240 | elapsed:  8.9min finished

{'learning_rate': 0.2, 'max_depth': 3, 'n_estimators': 700, 'num_leaves': 31}

0.821826280623608

print(classification_report(y_test, y_pred2))

              precision    recall  f1-score   support

         ham       0.82      0.82      0.82       225
        spam       0.82      0.82      0.82       224

    accuracy                           0.82       449
   macro avg       0.82      0.82      0.82       449
weighted avg       0.82      0.82      0.82       449

Random_Forest classifier

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees' habit of overfitting to their training set.

from sklearn.ensemble import RandomForestClassifier
RF_Cl=RandomForestClassifier(n_estimators=900)
RF_Cl.fit(X_train,y_train)
y_pred=RF_Cl.predict(X_test)
accuracy_score(y_pred,y_test)

0.8329621380846325

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         ham       0.83      0.84      0.83       225
        spam       0.83      0.83      0.83       224

    accuracy                           0.83       449
   macro avg       0.83      0.83      0.83       449
weighted avg       0.83      0.83      0.83       449

Classification of texts using TFidf

X_train, X_test, y_train, y_test = train_test_split(data['message'], data['label'], test_size = 0.3, random_state=0, shuffle = True, stratify=data['label'])
vectorizer = TfidfVectorizer()
X_train_vect = vectorizer.fit_transform(X_train)
X_train_vect.shape

(1045, 3596)

Pipeline and Random_Forest classifier

clf_rf = Pipeline([('tfidf', TfidfVectorizer()), ('clf', RandomForestClassifier(n_estimators=100, n_jobs=-1))])
clf_rf.fit(X_train, y_train)
y_pred = clf_rf.predict(X_test)
confusion_matrix(y_test, y_pred)

array([[224,   1],
       [ 24, 200]], dtype=int64)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         ham       0.91      1.00      0.95       225
        spam       1.00      0.91      0.95       224

    accuracy                           0.95       449
   macro avg       0.95      0.95      0.95       449
weighted avg       0.95      0.95      0.95       449

accuracy_score(y_test, y_pred)

0.9510022271714922

clf_rf.predict(["Hey, whassup?"])

array(['ham'], dtype=object)

clf_rf.predict(["you have won tickets to the USA this summer."])

array(['ham'], dtype=object)

Support Vector Machine

clf_svc = Pipeline([('tfidf', TfidfVectorizer()), ('clf', SVC(C = 1000, gamma = 'auto'))])
clf_svc.fit(X_train, y_train)
y_pred = clf_svc.predict(X_test)
confusion_matrix(y_test, y_pred)

array([[221,   4],
       [ 16, 208]], dtype=int64)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         ham       0.93      0.98      0.96       225
        spam       0.98      0.93      0.95       224

    accuracy                           0.96       449
   macro avg       0.96      0.96      0.96       449
weighted avg       0.96      0.96      0.96       449

accuracy_score(y_test, y_pred)

0.955456570155902

clf_svc.predict(["Hey, whassup?"])

array(['ham'], dtype=object)

clf_svc.predict(["you have got free tickets to the USA this summer."])

array(['spam'], dtype=object)

Summary

From above results we can conclude that TFidf performs better than word embeddings , since even without hyperparamter tuning all models performed well on test data with accuracy above 93% in all models

NLP Tutorial – Spam Text Message Classification using NLP

Spam Ham text classification

Objective

What is Natural Language Processing

Application of NLP

The Process of Natural Language Understanding (Text Classification)

Sentence Breakdown

Natural Language Generation

How to get started with NLP

Application of these libraries

Data Cleaning Options

Tokenization

Parts of Speech Tagging

Entity Detection

Dependency Parsing

Noun Phrases

Words-to-Vectors Integration

Case Normalization

Removing Stop Words

Stemming

Lemmatisation

What will be covered in this blog

Bag of Words - The Simples Word Embedding Technique

Bag of Words and Tf-idf

Let's start now the coding part

Exploratory Data Analysis

Data Preparation

Now we are using different machine learning models for classification of text messages into spam and ham classes

Support Vector Machine

LGBMclassifier

Random_Forest classifier

Classification of texts using TFidf

Pipeline and Random_Forest classifier

Support Vector Machine

Summary

1 Comment

Leave a Reply Cancel reply

Related Posts

Machine Learning

How to Become a Successful Machine Learning Engineer

Machine Learning

Interview Questions and Answers on TF-IDF in NLP and Machine Learning

Machine Learning

Top 10 Interview Questions and Answers for MLOps Engineers