NLP Tutorial – Spam Text Message Classification using NLP

Published by ignacioberrios on

Spam Ham text classification

Watch Full Video Here

Objective

  • Our objective of this code is to classify texts into two classes spam and ham.

What is Natural Language Processing

  • Natural Language Processing (NLP) is the field of Artificial Intelligence, where we analyse text using machine learning models

Application of NLP

  • Text Classification
  • Spam Filters
  • Voice text messaging
  • Sentiment analysis
  • Spell or grammar check
  • Chat bot
  • Search Suggestion
  • Search Autocorrect
  • Automatic Review Analysis system
  • Machine translation
  • And so much more
  • Natural Language Understanding (Text classification)
  • Natural Language Generation (Text Generation)

The Process of Natural Language Understanding (Text Classification)

image.png

Sentence Breakdown

image.png

Natural Language Generation

image.png

How to get started with NLP

Following are the libraries which are generally used in Natural Language Processing.

  • Sklearn
  • Spacy
  • NLTK
  • Gensim
  • Tensorflow and Keras
!pip install scikit-learn
!python -m spacy download en
!pip install -U spacy
!pip install gensim
!pip install lightgbm
image.png

Application of these libraries

  • Tokenization
  • Parts of Speech Tagging
  • Entity Detection
  • Dependency Parsing
  • Noun Phrases
  • Words-to-Vectors Integration
  • Context Derivation
  • and so much more

Data Cleaning Options

  • Case Normalization
  • Removing Stop Words
  • Removing Punctuations or Special Symbols
  • Lemmatization and Stemming (word normalization)
  • Parts of Speech Tagging
  • Entity Detection
  • Bag of Words
  • Word-to-Vec

Tokenization

Tokenization is breaking the raw text into small chunks. Tokenization breaks the raw text into words, sentences called tokens. These tokens help in understanding the context or developing the model for the NLP. The tokenization helps in interpreting the meaning of the text by analyzing the sequence of the words.

Parts of Speech Tagging

A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc., although generally computational applications use more fine-grained POS tags like 'noun-plural.

Entity Detection

Named entity recognition (NER), also known as entity chunking/extraction , is a popular technique used in information extraction to identify and segment the named entities and classify or categorize them under various predefined classes.

Dependency Parsing

Syntactic Parsing or Dependency Parsing is the task of recognizing a sentence and assigning a syntactic structure to it. The most widely used syntactic structure is the parse tree which can be generated using some parsing algorithms. These parse trees are useful in various applications like grammar checking or more importantly it plays a critical role in the semantic analysis stage.

Noun Phrases

Noun phrases are part of speech patterns that include a noun. They can also include whatever other parts of speech make grammatical sense, and can include multiple nouns. Some common noun phrase patterns are:

  • Noun
  • Noun-Noun..… -Noun
  • Adjective(s)-Noun
  • Verb-(Adjectives-)Noun

Words-to-Vectors Integration

  • Computers interact with humans in programming languages which are unambiguous, precise and often structured. However, natural (human) language has a lot of ambiguity. There are multiple words with same meaning (synonyms), words with multiple meanings (polysemy) some of which are entirely opposite in nature (auto-antonyms), and words which behave differently when used as noun and verb. These words make sense contextually in natural language which humans can comprehend and distinguish easily, but machines can’t. That’s what makes NLP one of the most difficult and interesting tasks in AI.
  • Word2Vec is a group of models which helps derive relations between a word and its contextual words.

Case Normalization

Normalization is a process that converts a list of words to a more uniform sequence. For example, converting all words to lowercase will simplify the searching process.

Removing Stop Words

  • A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query.
  • To check the list of stopwords you can type the following commands in the python shell.

import nltk

from nltk.corpus import stopwords

print(stopwords.words('english'))

Stemming

  • Stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language.

Playing -------> Play

Plays ---------> Play

Played --------> Play

Lemmatisation

  • Lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.

is,am, are ------> be

Example

the boy's cars are different colors ------> the boy car be differ color

What will be covered in this blog

  • Introduction of NLP and Spam detection using sklearn
  • Reading Text and PDF files in Python
  • Tokenization
  • Parts of Speech Tagging
  • Word-to-Vectors
  • Then real-world practical examples

Bag of Words - The Simples Word Embedding Technique

doc1 = "I am high"
doc2 = "Yes I am high"
doc3 = "I am kidding" 
image.png

By comparing the vectors we see that some words are common

Bag of Words and Tf-idf

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html

tf–idf for “Term Frequency times Inverse Document Frequency

image.png

Let's start now the coding part

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('spam.tsv', sep='\t')
df.head()
labelmessagelengthpunct
0hamGo until jurong point, crazy.. Available only ...1119
1hamOk lar... Joking wif u oni...296
2spamFree entry in 2 a wkly comp to win FA Cup fina...1556
3hamU dun say so early hor... U c already then say...496
4hamNah I don't think he goes to usf, he lives aro...612
  • Now to check whether there is any NULL value in data set we run following code
df.isnull().sum()
label      0
message    0
length     0
punct      0
dtype: int64
len(df)
5572
df['label'].value_counts()
ham     4825
spam     747
Name: label, dtype: int64
  • Balancing the data so that we have euqal number of spam and ham messages, so that our machine learning model learns well about both classes while training .
ham = df[df['label']=='ham']
ham.head()
labelmessagelengthpunct
0hamGo until jurong point, crazy.. Available only ...1119
1hamOk lar... Joking wif u oni...296
3hamU dun say so early hor... U c already then say...496
4hamNah I don't think he goes to usf, he lives aro...612
6hamEven my brother is not like to speak with me. ...772
spam = df[df['label']=='spam']
spam.head()
labelmessagelengthpunct
2spamFree entry in 2 a wkly comp to win FA Cup fina...1556
5spamFreeMsg Hey there darling it's been 3 week's n...1478
8spamWINNER!! As a valued network customer you have...1576
9spamHad your mobile 11 months or more? U R entitle...1542
11spamSIX chances to win CASH! From 100 to 20,000 po...1368
ham.shape, spam.shape
((4825, 4), (747, 4))
ham = ham.sample(spam.shape[0])
ham.shape, spam.shape
((747, 4), (747, 4))
data = ham.append(spam, ignore_index=True)
data.tail()
labelmessagelengthpunct
1489spamWant explicit SEX in 30 secs? Ring 02073162414...903
1490spamASKED 3MOBILE IF 0870 CHATLINES INCLU IN FREE ...1585
1491spamHad your contract mobile 11 Mnths? Latest Moto...1608
1492spamREMINDER FROM O2: To get 2.50 pounds free call...1473
1493spamThis is the 2nd time we have tried 2 contact u...1608

Exploratory Data Analysis

plt.hist(data[data['label']=='ham']['length'], bins = 100, alpha = 0.7,label='Ham')
plt.hist(data[data['label']=='spam']['length'], bins = 100, alpha = 0.7,label='Spam')
plt.xlabel('length of messages')
plt.ylabel('Frequency')
plt.legend()
plt.xlim(0,300)
plt.show()
plt.hist(data[data['label']=='ham']['punct'], bins = 100, alpha = 0.7,label='Ham')
plt.hist(data[data['label']=='spam']['punct'], bins = 100, alpha = 0.7,label='Spam')
plt.xlabel('punctauations')
plt.ylabel('Frequency')
plt.legend()
plt.xlim(0,30)
plt.show()

Data Preparation

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
data.head()
labelmessagelengthpunct
0hamNo problem with the renewal. I.ll do it right ...793
1hamGood afternoon, my love! How goes that day ? I...1605
2hamCan come my room but cannot come my house cos ...776
3hamI can send you a pic if you like 🙂352
4hamI am on the way to ur home260
  • Now we convert text data to word embeddings using Word2vec, for that we import gensim library
  • To convert each message into vector form first we tokenize each message.
  • Now we convert each token i.e. each word into wordembedding using word2vec which we import using gensim.
  • After converting each word to vector (embedding) we take average of all word vectors to obtain vector for message . which is final feature vector for message
import gensim
from nltk.tokenize import word_tokenize
import numpy as np
embedding_dim=100
text=data['message']

Text=[]
for i in range(data.shape[0]):
    text1=word_tokenize(text[i])
    Text=text1+Text



model= gensim.models.Word2Vec(sentences=[Text], size=embedding_dim, workers=4, min_count=1)
words=list(model.wv.vocab)

# print(text1)
# print(model[text1])
# vector=list(map(lambda x: sum(x)/len(x), zip(*model[text1])))
# print(vector)
def word_2_vec(x):
    t1=word_tokenize(x)
    model[t1]
    v=list(map(lambda y: sum(y)/len(y), zip(*model[t1])))
    a=np.array(v)
    return a.reshape(1,-1)
  • Applying word2vec to each text message
data['vec']=data['message'].apply(lambda x: word_2_vec(x))
data.head()
labelmessagelengthpunctvec
0hamNo problem with the renewal. I.ll do it right ...793[[-0.028193846775037754, -0.000255275213728762...
1hamGood afternoon, my love! How goes that day ? I...1605[[-0.03519534362069527, -0.0009830875404938859...
2hamCan come my room but cannot come my house cos ...776[[-0.004332678098257424, -0.000847288884769012...
3hamI can send you a pic if you like 🙂352[[-0.04251667247577147, -0.002708293984390118,...
4hamI am on the way to ur home260[[-0.040913782135248766, 0.0017535838996991515...
  • Here we are converting each feature vector pf a message in columns of dataframe
w_vec=np.concatenate(data['vec'].to_numpy(), axis=0)
w_vec.shape
(1494, 100)
word_vec=pd.DataFrame(w_vec)
word_vec.head()
0123456789...90919293949596979899
0-0.028194-0.000255-0.0059060.0015430.0284900.0085460.0239540.0102500.0432730.017577...0.0079820.0102250.0314040.004743-0.0068250.008575-0.0125750.006858-0.012980-0.005203
1-0.035195-0.000983-0.0066630.0020520.0338740.0100260.0281980.0110130.0524980.021042...0.0066370.0122240.0363870.006481-0.0083040.012149-0.0159570.009656-0.015783-0.005901
2-0.004333-0.000847-0.0017100.0004310.0038130.0003880.0033630.0007890.0062930.003545...0.0000060.0011900.0036740.001200-0.0002050.002153-0.0009520.001631-0.0032320.001002
3-0.042517-0.002708-0.0072340.0019330.0399800.0105900.0328370.0119870.0621390.024177...0.0082010.0145430.0432880.008478-0.0095700.012796-0.0186380.011211-0.018574-0.007211
4-0.0409140.001754-0.0091900.0036610.0403430.0119110.0364380.0134630.0614410.025578...0.0089120.0143500.0427320.009705-0.0103810.012885-0.0201830.011372-0.018638-0.005238

5 rows × 100 columns

X_train, X_test, y_train, y_test = train_test_split(word_vec, data['label'], test_size = 0.3, random_state=0, shuffle = True, stratify=data['label'])

Now we are using different machine learning models for classification of text messages into spam and ham classes

  • For hyperparameter tuning of each model we import GridsearchCV , which tune machine learning model by chossing the optimal paramters for machine learning model
  • Here we are using 5 fold cross validation in gridsearch method.
  • By using cross validation, model generalizes well, that is it performs well on test data

Support Vector Machine

  • In machine learning, support-vector machines (SVMs, also support-vector networks) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis.
  • In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
Parameter_svc=[{'cache_size': [300,400,200],'tol': [0.0011,0.002,0.003],
                'kernel': ['rbf','poly'],
               'degree': [3,4,5]}]
scores = ['accuracy']

clf_svc = GridSearchCV(SVC(), Parameter_svc, scoring='accuracy', verbose=2, cv=5
                         )
clf_svc.fit(X_train,y_train)


print(clf_svc.best_params_)

y_pred1 = clf_svc.predict(X_test)

accuracy_score(y_pred1,y_test)
Fitting 5 folds for each of 54 candidates, totalling 270 fits
[CV] cache_size=300, degree=3, kernel=rbf, tol=0.0011 ................
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[CV] . cache_size=300, degree=3, kernel=rbf, tol=0.0011, total=   0.3s
[CV] cache_size=300, degree=3, kernel=rbf, tol=0.0011 ................
[CV] . cache_size=300, degree=3, kernel=rbf, tol=0.0011, total=   0.1s
[CV] cache_size=300, degree=3, kernel=rbf, tol=0.0011 ................
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.2s remaining:    0.0s
0.6993318485523385

LGBMclassifier

  • Light GBM is a gradient boosting framework that uses tree based learning algorithm.
  • Light GBM grows tree vertically while other algorithm grows trees horizontally meaning that Light GBM grows tree leaf-wise while other algorithm grows level-wise. It will choose the leaf with max delta loss to grow. When growing the same leaf, Leaf-wise algorithm can reduce more loss than a level-wise algorithm.
from lightgbm import LGBMClassifier
Parameter_lgbm=[{'num_leaves':[31,40,50],
              'max_depth':[3,4,5,6],
              'learning_rate':[0.1,0.05,0.2,0.15],
             'n_estimators':[700]}]
scores = ['accuracy']
clf_lgbm = GridSearchCV(LGBMClassifier(), Parameter_lgbm,
                    scoring='accuracy', verbose=2, cv=5)                         
clf_lgbm.fit(X_train,y_train)
print(clf_lgbm.best_params_)
y_pred2 = clf_lgbm.predict(X_test)
accuracy_score(y_pred2,y_test)
Fitting 5 folds for each of 48 candidates, totalling 240 fits
[CV] learning_rate=0.1, max_depth=3, n_estimators=700, num_leaves=31 .
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[CV]  learning_rate=0.1, max_depth=3, n_estimators=700, num_leaves=31, total=   2.7s
[CV] learning_rate=0.1, max_depth=3, n_estimators=700, num_leaves=31 .
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.6s remaining:    0.0s
[Parallel(n_jobs=1)]: Done 240 out of 240 | elapsed:  8.9min finished
{'learning_rate': 0.2, 'max_depth': 3, 'n_estimators': 700, 'num_leaves': 31}
0.821826280623608
print(classification_report(y_test, y_pred2))
              precision    recall  f1-score   support

         ham       0.82      0.82      0.82       225
        spam       0.82      0.82      0.82       224

    accuracy                           0.82       449
   macro avg       0.82      0.82      0.82       449
weighted avg       0.82      0.82      0.82       449

Random_Forest classifier

  • Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees' habit of overfitting to their training set.
from sklearn.ensemble import RandomForestClassifier
RF_Cl=RandomForestClassifier(n_estimators=900)
RF_Cl.fit(X_train,y_train)
y_pred=RF_Cl.predict(X_test)
accuracy_score(y_pred,y_test)
0.8329621380846325
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

         ham       0.83      0.84      0.83       225
        spam       0.83      0.83      0.83       224

    accuracy                           0.83       449
   macro avg       0.83      0.83      0.83       449
weighted avg       0.83      0.83      0.83       449

Classification of texts using TFidf

X_train, X_test, y_train, y_test = train_test_split(data['message'], data['label'], test_size = 0.3, random_state=0, shuffle = True, stratify=data['label'])
vectorizer = TfidfVectorizer()
X_train_vect = vectorizer.fit_transform(X_train)
X_train_vect.shape
(1045, 3596)

Pipeline and Random_Forest classifier

clf_rf = Pipeline([('tfidf', TfidfVectorizer()), ('clf', RandomForestClassifier(n_estimators=100, n_jobs=-1))])
clf_rf.fit(X_train, y_train)
y_pred = clf_rf.predict(X_test)
confusion_matrix(y_test, y_pred)
array([[224,   1],
       [ 24, 200]], dtype=int64)
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

         ham       0.91      1.00      0.95       225
        spam       1.00      0.91      0.95       224

    accuracy                           0.95       449
   macro avg       0.95      0.95      0.95       449
weighted avg       0.95      0.95      0.95       449

accuracy_score(y_test, y_pred)
0.9510022271714922
clf_rf.predict(["Hey, whassup?"])
array(['ham'], dtype=object)
clf_rf.predict(["you have won tickets to the USA this summer."])
array(['ham'], dtype=object)

Support Vector Machine

clf_svc = Pipeline([('tfidf', TfidfVectorizer()), ('clf', SVC(C = 1000, gamma = 'auto'))])
clf_svc.fit(X_train, y_train)
y_pred = clf_svc.predict(X_test)
confusion_matrix(y_test, y_pred)
array([[221,   4],
       [ 16, 208]], dtype=int64)
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

         ham       0.93      0.98      0.96       225
        spam       0.98      0.93      0.95       224

    accuracy                           0.96       449
   macro avg       0.96      0.96      0.96       449
weighted avg       0.96      0.96      0.96       449

accuracy_score(y_test, y_pred)
0.955456570155902
clf_svc.predict(["Hey, whassup?"])
array(['ham'], dtype=object)
clf_svc.predict(["you have got free tickets to the USA this summer."])
array(['spam'], dtype=object)

Summary

  • From above results we can conclude that TFidf performs better than word embeddings , since even without hyperparamter tuning all models performed well on test data with accuracy above 93% in all models