NLP: End to End Text Processing for Beginners

Complete Text Processing for Beginners

Everything we express (either verbally or in written) carries huge amounts of information. The topic we choose, our tone, our selection of words, everything adds some type of information that can be interpreted and value can be extracted from it. In theory, we can understand and even predict human behavior using that information.

But there is a problem: one person may generate hundreds or thousands of words in a declaration, each sentence with its corresponding complexity. If you want to scale and analyze several hundreds, thousands, or millions of people or declarations in a given geography, then the situation is unmanageable.

Data generated from conversations, declarations, or even tweets are examples of unstructured data. Unstructured data doesn’t fit neatly into the traditional row and column structure of relational databases, and represent the vast majority of data available in the actual world. It is messy and hard to manipulate. Nevertheless, thanks to the advances in disciplines like machine learning, a big revolution is going on regarding this topic. Nowadays it is no longer about trying to interpret text or speech based on its keywords (the old fashioned mechanical way), but about understanding the meaning behind those words (the cognitive way). This way, it is possible to detect figures of speech like irony or even perform sentiment analysis.

Natural Language Processing or NLP is a field of artificial intelligence that gives the machines the ability to read, understand and derive meaning from human languages.



Installing libraries

SpaCy is an open-source software library that is published and distributed under MIT license, and is developed for performing simple to advanced Natural Language Processing (N.L.P) tasks such as tokenization, part-of-speech tagging, named entity recognition, text classification, calculating semantic similarities between text, lemmatization, and dependency parsing, among others.

# pip install -U spacy
# pip install -U spacy-lookups-data
# python -m spacy download en_core_web_sm
# python -m spacy download en_core_web_md
# python -m spacy download en_core_web_lg

In this article, we are going to perform the below tasks.

General Feature Extraction

  • File loading
  • Word counts
  • Characters count
  • Average characters per word
  • Stop words count
  • Count #HashTags and @Mentions
  • If numeric digits are present in twitts
  • Upper case word counts

Preprocessing and Cleaning

  • Lower case
  • Contraction to Expansion
  • Emails removal and counts
  • URLs removal and counts
  • Removal of RT
  • Removal of Special Characters
  • Removal of multiple spaces
  • Removal of HTML tags
  • Removal of accented characters
  • Removal of Stop Words
  • Conversion into base form of words
  • Common Occuring words Removal
  • Rare Occuring words Removal
  • Word Cloud
  • Spelling Correction
  • Tokenization
  • Lemmatization
  • Detecting Entities using NER
  • Noun Detection
  • Language Detection
  • Sentence Translation
  • Using Inbuilt Sentiment Classifier

Advanced Text Processing and Feature Extraction

  • N-Gram, Bi-Gram etc
  • Bag of Words (BoW)
  • Term Frequency Calculation TF
  • Inverse Document Frequency IDF
  • TFIDF Term Frequency - Inverse Document Frequency
  • Word Embedding Word2Vec using SpaCy

Machine Learning Models for Text Classification

  • SGDClassifier
  • LogisticRegression
  • LogisticRegressionCV
  • LinearSVC
  • RandomForestClassifier

Importing libraries

import pandas as pd
import numpy as np

import spacy
from spacy.lang.en.stop_words import STOP_WORDS
df = pd.read_csv('twitter16m.csv', encoding = 'latin1', header = None)
sent_map = {0: 'negative', 4: 'positive'}

Word Counts

In this step, we are splitting the sentences into words using split() function which converts the sentence into the list and on top of that we are using len() function to calculate the number of token or words.

df['word_counts'] = df['twitts'].apply(lambda x: len(str(x).split()))
Characters Count

In this step, we are using len() function to calculate the number characters inside each sentences.

df['char_counts'] = df['twitts'].apply(lambda x: len(x))
Average Word Length

In this step, we have created a function get_avg_word_len() in which we are calculating the average word length inside each sentences.

def get_avg_word_len(x):
    words = x.split()
    word_len = 0
    for word in words:
        word_len = word_len + len(word)
    return word_len/len(words) # != len(x)/len(words)
df['avg_word_len'] = df['twitts'].apply(lambda x: get_avg_word_len(x))
len('this is nlp lesson')/4
Stop Words Count

In this section, we are calculating the number of stop words for each sentences.

{'one', 'up', 'further', 'herself', 'nevertheless', 'their', 'when', 'a', 'bottom', 'both', 'also', 'i', 'sometime', 'ours', "'d", 'him', 'together', 'former', 'hereafter', 'whereby', "'ll", 'three', 'same', 'is', 'say', 'hers', 'must', 'five', 'you', 'across', 'n‘t', 'mostly', 'into', 'am', 'myself', 'something', 'could', 'being', 'seems', 'go', 'only', 'fifteen', 'either', 'us', 'than', 'latter', 'so', 'after', 'name', 'there', 'that', 'next', 'even', 'without', 'along', 'behind', 'very', 'whereas', 'off', 'herein', 'although', 'such', 'themselves', 'then', 'in', 'under', 'of', 'onto', 'really', 'due', 'otherwise', 'give', 'yourself', 'indeed', 'my', 'mine', 'show', 'via', 'elsewhere', 'be', 'just', 'thence', 'them', 'beside', 'though', 'as', 'out', 'third', 'however', 'twelve', 'except', '‘d', 'anything', 'move', 'side', 'everything', 'all', 'towards', 'whatever', 'will', 'n’t', 'toward', 'keep', 'hereupon', 'might', 'no', 'own', 'itself', 'for', 'can', 'rather', 'whether', 'while', 'and', 'part', 'over', 'else', 'has', 'forty', 'about', 'hereby', 'sixty', 'using', 'here', 'please', 'often', '’re', 'any', 'ca', 'per', 'whole', 'it', 'are', 'from', 'had', 'thru', '’m', 'two', 'fifty', 'your', 'latterly', 'again', 'or', 'few', 'against', 'much', 'somewhere', 'but', '’d', 'somehow', 'never', 'becoming', 'down', 'regarding', 'always', 'other', 'amount', 'because', 'noone', 'anyone', 'six', 'each', 'thus', 'alone', 'why', 'his', 'sometimes', 'now', 'since', 'become', 'see', 'she', 'where', 'whereafter', 'various', 'perhaps', 'another', 'who', 'anyhow', 'yourselves', 'someone', 'ten', 'became', 'nothing', 'front', 'an', 'anyway', 'get', 'thereafter', "'re", 'our', 'call', 'therein', 'have', 'this', 'above', 'some', 'namely', '‘re', 'seem', 'until', '’ll', 'more', 'still', "n't", 'the', 'does', 'himself', 'take', 'he', 'which', 'seeming', 'been', 'beforehand', 'may', 'do', 'well', 'ever', 'used', 'enough', 'every', 'top', 'made', "'m", 'hundred', 'almost', 'her', 'moreover', 'wherever', '’s', 'amongst', 'meanwhile', 'nobody', 'ourselves', 'whenever', 'at', 'wherein', 'nowhere', 'around', 'between', 'last', 'others', 'becomes', 'they', 'full', 'below', 'nor', 'before', 'what', 'within', 'these', 'besides', 'whereupon', 'how', 'throughout', 'eight', "'s", 'on', 'most', 'if', '‘ve', 'should', 'four', 'serious', 'thereby', '‘ll', 'whence', 'done', 'anywhere', 'yours', 'formerly', 'everyone', 'whose', 'back', 'make', 'among', 'first', 'we', '‘s', 'neither', 'doing', 'already', 'those', 'empty', 'did', 'not', '‘m', 'less', 'to', 'during', 'twenty', 'too', 'put', 'nine', 'yet', 'everywhere', 'quite', 'were', 'seemed', '’ve', 'through', 'once', 'whither', 'thereupon', 'whoever', "'ve", 'therefore', 'me', 'unless', 'whom', 'cannot', 'afterwards', 'none', 'least', 'hence', 'eleven', 'with', 'upon', 'was', 'would', 'by', 'beyond', 'several', 'its', 'many', 're'}
x = 'this is text data'
['this', 'is', 'text', 'data']
len([t for t in x.split() if t in STOP_WORDS])
df['stop_words_len'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t in STOP_WORDS]))
Count #HashTags and @Mentions

In this section, we are calculating the number of words staring with Hashtags and @.

x = 'this #hashtag and this is @mention'
# x = x.split()
# x
[t for t in x.split() if t.startswith('@')]
df['hashtags_count'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t.startswith('#')]))
df['mentions_count'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t.startswith('@')]))
If numeric digits are present in twitts

In this section, we are calculating the number of digits in each sentences.

df['numerics_count'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t.isdigit()]))
UPPER case words count

In this section, we are calculating the number of UPPERcase words in each sentences if length is more than 3.

df['upper_counts'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t.isupper() and len(x)>3]))
"so rylee,grace...wana go steve's party or not?? SADLY SINCE ITS EASTER I WNT B ABLE 2 DO MUCH  BUT OHH WELL....."

Preprocessing and Cleaning

In this section, we are converting the words to LOWERcase words in each sentences.

Lower case conversion

df['twitts'] = df['twitts'].apply(lambda x: x.lower())
Contraction to Expansion

In this section, we are converting all short words to their respective fullwords based on the words defined in the dictionary and using function cont_to_exp().

x = "i don't know what you want, can't, he'll, i'd"
contractions = { 
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he'll've": "he will have",
"he's": "he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how does",
"i'd": "i would",
"i'd've": "i would have",
"i'll": "i will",
"i'll've": "i will have",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'd've": "it would have",
"it'll": "it will",
"it'll've": "it will have",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she would",
"she'd've": "she would have",
"she'll": "she will",
"she'll've": "she will have",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so is",
"that'd": "that would",
"that'd've": "that would have",
"that's": "that is",
"there'd": "there would",
"there'd've": "there would have",
"there's": "there is",
"they'd": "they would",
"they'd've": "they would have",
"they'll": "they will",
"they'll've": "they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
" u ": " you ",
" ur ": " your ",
" n ": " and "}
def cont_to_exp(x):
    if type(x) is str:
        for key in contractions:
            value = contractions[key]
            x = x.replace(key, value)
        return x
        return x
x = "hi, i'd be happy"
'hi, i would be happy'
df['twitts'] = df['twitts'].apply(lambda x: cont_to_exp(x))
Count and Remove Emails

In this section, we are removing as well as counting the emails.

import re
x = 'hi my email me at'
re.findall(r'([a-zA-Z0-9+._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)', x)
['', '']
df['emails'] = df['twitts'].apply(lambda x: re.findall(r'([a-zA-Z0-9+._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)', x))
df['emails_count'] = df['emails'].apply(lambda x: len(x))
re.sub(r'([a-zA-Z0-9+._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)', '', x)
'hi my email me at  '
df['twitts'] = df['twitts'].apply(lambda x: re.sub(r'([a-zA-Z0-9+._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)', '', x))
Count URLs and Remove it

In this section, we are removing as well as counting the URLs using regex functions.

x = 'hi, to watch more visit'
re.findall(r'(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', x)
[('https', '', '/kgptalkie')]
df['urls_flag'] = df['twitts'].apply(lambda x: len(re.findall(r'(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', x)))
re.sub(r'(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '', x)
'hi, to watch more visit '
df['twitts'] = df['twitts'].apply(lambda x: re.sub(r'(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '', x))
'@switchfoot  - awww, that is a bummer.  you shoulda got david carr of third day to do it. ;d'

Remove RT

In this section, we are removing retweet characters.

df['twitts'] = df['twitts'].apply(lambda x: re.sub('RT', "", x))

Special Chars removal or punctuation removal

In this section, we are removing special characters and punctuations

df['twitts'] = df['twitts'].apply(lambda x: re.sub('[^A-Z a-z 0-9-]+', '', x))
Remove multiple spaces "hi hello "

In this section, we are removing the multiple spaces.

x = 'thanks    for    watching and    please    like this video'
" ".join(x.split())
'thanks for watching and please like this video'
df['twitts'] = df['twitts'].apply(lambda x: " ".join(x.split()))
Remove HTML tags

In this section, we are removing the HTML tags.

from bs4 import BeautifulSoup
x = '<html><h2>Thanks for watching</h2></html>'
BeautifulSoup(x, 'lxml').get_text()
'Thanks for watching'
df['twitts'] = df['twitts'].apply(lambda x: BeautifulSoup(x, 'lxml').get_text())
Remove Accented Chars

In this section, we are removing the accented characters.

import unicodedata
x = 'Áccěntěd těxt'
def remove_accented_chars(x):
    x = unicodedata.normalize('NFKD', x).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return x
'Accented text'

SpaCy and NLP

Remove Stop Words

In this section, we are removing the stop words from text document.

import spacy
x = 'this is stop words removal code is a the an how what'
" ".join([t for t in x.split() if t not in STOP_WORDS])
'stop words removal code'
df['twitts'] = df['twitts'].apply(lambda x: " ".join([t for t in x.split() if t not in STOP_WORDS]))
Convert into base or root form of word

In this section, we are converting the words to their forms.

nlp = spacy.load('en_core_web_sm')
x = 'kenichan dived times ball managed save 50 rest'
# dive = dived, time = times, manage = managed
# x = 'i you he she they is am are'
def make_to_base(x):
    x_list = []
    doc = nlp(x)
    for token in doc:
        lemma = str(token.lemma_)
        if lemma == '-PRON-' or lemma == 'be':
            lemma = token.text
    print(" ".join(x_list))
kenichan dive time ball manage save 50 rest

Common words removal

In this section, we are removing top 20 most occured word from text corpus.

' '.join(df.head()['twitts'])
'switchfoot - awww bummer shoulda got david carr day d upset update facebook texting cry result school today blah kenichan dived times ball managed save 50 rest bounds body feels itchy like fire nationwideclass behaving mad'
text = ' '.join(df['twitts'])
text = text.split()
freq_comm = pd.Series(text).value_counts()
f20 = freq_comm[:20]
good      89366
day       82299
like      77735
-         69662
today     64512
going     64078
love      63421
work      62804
got       60749
time      56081
lol       55094
know      51172
im        50147
want      42070
new       41995
think     41040
night     41029
amp       40616
thanks    39311
home      39168
dtype: int64
df['twitts'] = df['twitts'].apply(lambda x: " ".join([t for t in x.split() if t not in f20]))

Rare words removal

In this section, we are removing least 20 most occured word from text corpus.

rare20 = freq_comm[-20:]
veru              1
80-90f            1
refrigerant       1
demaisss          1
knittingsci-fi    1
wendireed         1
danielletuazon    1
chacha8           1
a-zquot           1
krustythecat      1
westmount         1
-appreciate       1
motocycle         1
madamhow          1
felspoon          1
fastbloke         1
900pmno           1
nxec              1
laassssttt        1
update-uri        1
dtype: int64
rare = freq_comm[freq_comm.values == 1]
mamat             1
fiive             1
music-festival    1
leenahyena        1
11517             1
fastbloke         1
900pmno           1
nxec              1
laassssttt        1
update-uri        1
Length: 536196, dtype: int64
df['twitts'] = df['twitts'].apply(lambda x: ' '.join([t for t in x.split() if t not in rare20]))
Word Cloud Visualization

In this section, we are visualizing the text corpus using library WordCloud.

# !pip install wordcloud
from wordcloud import WordCloud
import matplotlib.pyplot as plt
%matplotlib inline
x = ' '.join(text[:20000])
wc = WordCloud(width = 800, height=400).generate(x)

Spelling Correction

In this section, we are correcting the spelling of each words.

# !pip install -U textblob
# !python -m textblob.download_corpora
from textblob import TextBlob
x = 'tanks forr waching this vidio carri'
x = TextBlob(x).correct()
TextBlob("tanks for watching this video carry")


Tokenization is all about breaking the sentences into individual words.

x = 'thanks#watching this video. please like it'
WordList(['thanks', 'watching', 'this', 'video', 'please', 'like', 'it'])
doc = nlp(x)
for token in doc:


Lemmatization is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. Lemmatization is similar to stemming but it brings context to the words. So it links words with similar meaning to one word.

x = 'runs run running ran'
from textblob import Word
for token in x.split():
doc = nlp(x)
for token in doc:

Detect Entities using NER of SpaCy

Named Entity Recognition (NER) is a standard NLP problem which involves spotting named entities (people, places, organizations etc.) from a chunk of text, and classifying them into a predefined set of categories. Some of the practical applications of NER include:

  • Scanning news articles for the people, organizations and locations reported.
  • Providing concise features for search optimization: instead of searching the entire content, one may simply search for the major entities involved.
  • Quickly retrieving geographical locations talked about in Twitter posts.
x = "Breaking News: Donald Trump, the president of the USA is looking to sign a deal to mine the moon"
doc = nlp(x)
for ent in doc.ents:
    print(ent.text + ' - ' + ent.label_ + ' - ' + str(spacy.explain(ent.label_)))
Donald Trump - PERSON - People, including fictional
USA - GPE - Countries, cities, states
from spacy import displacy
displacy.render(doc, style = 'ent')

Breaking News: Donald Trump PERSON , the president of the USA GPE is looking to sign a deal to mine the moon

Detecting Nouns

In this section, we are detecting nouns.

'Breaking News: Donald Trump, the president of the USA is looking to sign a deal to mine the moon'
for noun in doc.noun_chunks:
Breaking News
Donald Trump
the president
the USA
a deal
the moon

Translation and Language Detection

Language Code:

'Breaking News: Donald Trump, the president of the USA is looking to sign a deal to mine the moon'
tb = TextBlob(x)
TextBlob("ব্রেকিং নিউজ: যুক্তরাষ্ট্রের রাষ্ট্রপতি ডোনাল্ড ট্রাম্প চাঁদটি খনির জন্য একটি চুক্তিতে সই করতে চাইছেন")

Use inbuilt sentiment classifier

TextBlob library also comes with a NaiveBayesAnalyzer, Naive Bayes is a commonly used machine learning text-classification algorithm.

from textblob.sentiments import NaiveBayesAnalyzer
x = 'we all stands together to fight with corona virus. we will win together'
tb = TextBlob(x, analyzer=NaiveBayesAnalyzer())
Sentiment(classification='pos', p_pos=0.8259779151942094, p_neg=0.17402208480578962)
x = 'we all are sufering from corona'
tb = TextBlob(x, analyzer=NaiveBayesAnalyzer())
Sentiment(classification='pos', p_pos=0.75616044472398, p_neg=0.2438395552760203)

Advanced Text Processing


An N-gram means a sequence of N words. So for example, “KGPtalkie blog” is a 2-gram (a bigram), “A KGPtalkie blog post” is a 4-gram, and “Write on KGPtalkie” is a 3-gram (trigram). Well, that wasn’t very interesting or exciting. True, but we still have to look at the probability used with n-grams, which is quite interesting.

x = 'thanks for watching'
tb = TextBlob(x)
[WordList(['thanks', 'for', 'watching'])]

Bag of Words BoW

In this section, we are going to discuss a Natural Language Processing technique of text modeling known as the Bag of Words model. Whenever we apply any algorithm in NLP, it works on numbers. We cannot directly feed our text into that algorithm. Hence, the Bag of Words model is used to preprocess the text by converting it into a bag of words, which keeps a count of the total occurrences of most frequently used words.

This model can be visualized using a table, which contains the count of words corresponding to the word itself.

x = ['this is first sentence this is', 'this is second', 'this is last']
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(1,1))
text_counts = cv.fit_transform(x)
array([[1, 2, 0, 0, 1, 2],
       [0, 1, 0, 1, 0, 1],
       [0, 1, 1, 0, 0, 1]], dtype=int64)
['first', 'is', 'last', 'second', 'sentence', 'this']
bow = pd.DataFrame(text_counts.toarray(), columns = cv.get_feature_names())
['this is first sentence this is', 'this is second', 'this is last']

Term Frequency

Term frequency (TF) often used in Text Mining, NLP, and Information Retrieval tells you how frequently a term occurs in a document. In the context of natural language, terms correspond to words or phrases. Since every document is different in length, it is possible that a term would appear more often in longer documents than shorter ones. Thus, term frequency is often divided by the total number of terms in the document as a way of normalization.

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).

['this is first sentence this is', 'this is second', 'this is last']
(3, 6)
tf = bow.copy()
for index, row in enumerate(tf.iterrows()):
    for col in row[1].index:
        tf.loc[index, col] = tf.loc[index, col]/sum(row[1].values)

Inverse Document Frequency IDF

Inverse Document Frequency (IDF) is a weight indicating how commonly a word is used. The more frequent its usage across documents, the lower its score. The lower the score, the less important the word becomes.

For example, the word the appears in almost all English texts and would thus have a very low IDF score as it carries very little “topic” information. In contrast, if you take the word coffee, while it is common, it’s not used as widely as the word the. Thus, coffee would have a higher IDF score than the.

idf = log( (1 + N)/(n + 1)) + 1 used in sklearn when smooth_idf = True

where, N is the total number of rows and n is the number of rows in which the word was present.

import numpy as np
x_df = pd.DataFrame(x, columns=['words'])
0this is first sentence this is
1this is second
2this is last
N = bow.shape[0]
bb = bow.astype('bool')
cols = bb.columns
Index(['first', 'is', 'last', 'second', 'sentence', 'this'], dtype='object')
nz = []
for col in cols:
[1, 3, 1, 1, 1, 3]
idf = []
for index, col in enumerate(cols):
    idf.append(np.log((N + 1)/(nz[index] + 1)) + 1)


TF-IDF which stands for Term Frequency – Inverse Document Frequency. It is one of the most important techniques used for information retrieval to represent how important a specific word or phrase is to a given document. Let’s take an example, we have a string or Bag of Words (BOW) and we have to extract information from it, then we can use this approach.

The tf-idf value increases in proportion to the number of times a word appears in the document but is often offset by the frequency of the word in the corpus, which helps to adjust with respect to the fact that some words appear more frequently in general.

TF-IDF use two statistical methods, first is Term Frequency and the other is Inverse Document Frequency. Term frequency refers to the total number of times a given term t appears in the document doc against (per) the total number of all words in the document and The inverse document frequency measure of how much information the word provides. It measures the weight of a given word in the entire document. IDF show how common or rare a given word is across all documents.

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
x_tfidf = tfidf.fit_transform(x_df['words'])
array([[0.45688214, 0.5396839 , 0.        , 0.        , 0.45688214,
        0.5396839 ],
       [0.        , 0.45329466, 0.        , 0.76749457, 0.        ,
       [0.        , 0.45329466, 0.76749457, 0.        , 0.        ,
array([1.69314718, 1.        , 1.69314718, 1.69314718, 1.69314718,
       1.        ])

Word Embeddings

Word Embedding is a language modeling technique used for mapping words to vectors of real numbers. It represents words or phrases in vector space with several dimensions. Word embeddings can be generated using various methods like neural networks, co-occurrence matrix, probabilistic models, etc.

SpaCy Word2Vec

# !python -m spacy download en_core_web_lg
nlp = spacy.load('en_core_web_lg')
doc = nlp('thank you! dog cat lion dfasaa')
for token in doc:
    print(token.text, token.has_vector)
thank True
you True
! True
dog True
cat True
lion True
dfasaa False
for token1 in doc:
    for token2 in doc:
        print(token1.text, token2.text, token1.similarity(token2))
Machine Learning Models for Text Classification


#displaying the shape of the dimension

(1600000, 13)
#sampling the number of rows

df0 = df[df['sentiment']==0].sample(2000)
df4 = df[df['sentiment']==4].sample(2000)
dfr = df0.append(df4)
(4000, 13)
#removing the twitts,sentiment and emails columns

dfr_feat = dfr.drop(labels=['twitts','sentiment','emails'], axis = 1).reset_index(drop=True)

4000 rows × 10 columns

y = dfr['sentiment']
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
text_counts = cv.fit_transform(dfr['twitts'])
(4000, 9750)
dfr_bow = pd.DataFrame(text_counts.toarray(), columns=cv.get_feature_names())

2 rows × 9750 columns


ML Algorithms

Importing Libraries for ML algorithms

from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.preprocessing import MinMaxScaler
sgd = SGDClassifier(n_jobs=-1, random_state=42, max_iter=200)
lgr = LogisticRegression(random_state=42, max_iter=200)
lgrcv = LogisticRegressionCV(cv = 2, random_state=42, max_iter=1000)
svm = LinearSVC(random_state=42, max_iter=200)
rfc = RandomForestClassifier(random_state=42, n_jobs=-1, n_estimators=200)
clf = {'SGD': sgd, 'LGR': lgr, 'LGR-CV': lgrcv, 'SVM': svm, 'RFC': rfc}
dict_keys(['SGD', 'LGR', 'LGR-CV', 'SVM', 'RFC'])
#here, we are training our model by defining the function classify.

def classify(X, y):
    scaler = MinMaxScaler(feature_range=(0, 1))
    X = scaler.fit_transform(X)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)
    for key in clf.keys():
        clf[key].fit(X_train, y_train)
        y_pred = clf[key].predict(X_test)
        ac = accuracy_score(y_test, y_pred)
        print(key, " ---> ", ac)
classify(dfr_bow, y)
SGD  --->  0.62375
LGR  --->  0.65375
LGR-CV  --->  0.6525
SVM  --->  0.6325
RFC  --->  0.6525
Manual Feature
#passing all the manual features.

classify(dfr_feat, y)
SGD  --->  0.64125
LGR  --->  0.645
LGR-CV  --->  0.65
SVM  --->  0.6475
RFC  --->  0.5675
Manual + Bow
#passing all the manual features along with bag of words features.

X = dfr_feat.join(dfr_bow)
classify(X, y)
SGD  --->  0.64875
LGR  --->  0.67125
LGR-CV  --->  0.66125
SVM  --->  0.64375
RFC  --->  0.705
#passing all the manual features along with tfidf features.

from sklearn.feature_extraction.text import TfidfVectorizer
(4000, 13)
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(dfr['twitts'])
classify(pd.DataFrame(X.toarray()), y)
SGD  --->  0.635
LGR  --->  0.65125
LGR-CV  --->  0.6475
SVM  --->  0.63875
RFC  --->  0.6425
def get_vec(x):
    doc = nlp(x)
    return doc.vector.reshape(1, -1)
dfr['vec'] = dfr['twitts'].apply(lambda x: get_vec(x))
X = np.concatenate(dfr['vec'].to_numpy(), axis = 0)
(4000, 300)
classify(pd.DataFrame(X), y)
SGD  --->  0.5925
LGR  --->  0.70625
LGR-CV  --->  0.69375
SVM  --->  0.70125
RFC  --->  0.66625
def predict_w2v(x):
    for key in clf.keys():
        y_pred = clf[key].predict(get_vec(x))
        print(key, "-->", y_pred)
predict_w2v('hi, thanks for watching this video. please like and subscribe')
SGD --> [0]
LGR --> [4]
LGR-CV --> [0]
SVM --> [4]
RFC --> [0]
predict_w2v('please let me know if you want more video')
SGD --> [0]
LGR --> [0]
LGR-CV --> [0]
SVM --> [0]
RFC --> [0]
predict_w2v('congratulation looking good congrats')
SGD --> [4]
LGR --> [4]
LGR-CV --> [4]
SVM --> [4]
RFC --> [0]


1. In this article, firstly we have cleared the texts like removing URLs and various tags.

2. Also, we have used various text featurization techniques like bag-of-wordstf-idf and word2vec.

3. After doing text featurization, we building machine learning models on top of those features.