NLP: End to End Text Processing for Beginners

Published by berryedelson on

Complete Text Processing for Beginners


Everything we express (either verbally or in written) carries huge amounts of information. The topic we choose, our tone, our selection of words, everything adds some type of information that can be interpreted and value can be extracted from it. In theory, we can understand and even predict human behavior using that information.

But there is a problem: one person may generate hundreds or thousands of words in a declaration, each sentence with its corresponding complexity. If you want to scale and analyze several hundreds, thousands, or millions of people or declarations in a given geography, then the situation is unmanageable.

Data generated from conversations, declarations, or even tweets are examples of unstructured data. Unstructured data doesn’t fit neatly into the traditional row and column structure of relational databases, and represent the vast majority of data available in the actual world. It is messy and hard to manipulate. Nevertheless, thanks to the advances in disciplines like machine learning, a big revolution is going on regarding this topic. Nowadays it is no longer about trying to interpret text or speech based on its keywords (the old fashioned mechanical way), but about understanding the meaning behind those words (the cognitive way). This way, it is possible to detect figures of speech like irony or even perform sentiment analysis.

Natural Language Processing or NLP is a field of artificial intelligence that gives the machines the ability to read, understand and derive meaning from human languages.

1.jpg

Ref:

Watch Full Video:

Dataset:

https://www.kaggle.com/kazanova/sentiment140/data#


Installing libraries

SpaCy is an open-source software library that is published and distributed under MIT license, and is developed for performing simple to advanced Natural Language Processing (N.L.P) tasks such as tokenization, part-of-speech tagging, named entity recognition, text classification, calculating semantic similarities between text, lemmatization, and dependency parsing, among others.

# pip install -U spacy
# pip install -U spacy-lookups-data
# python -m spacy download en_core_web_sm
# python -m spacy download en_core_web_md
# python -m spacy download en_core_web_lg

In this article, we are going to perform the below tasks.

General Feature Extraction

  • File loading
  • Word counts
  • Characters count
  • Average characters per word
  • Stop words count
  • Count #HashTags and @Mentions
  • If numeric digits are present in twitts
  • Upper case word counts

Preprocessing and Cleaning

  • Lower case
  • Contraction to Expansion
  • Emails removal and counts
  • URLs removal and counts
  • Removal of RT
  • Removal of Special Characters
  • Removal of multiple spaces
  • Removal of HTML tags
  • Removal of accented characters
  • Removal of Stop Words
  • Conversion into base form of words
  • Common Occuring words Removal
  • Rare Occuring words Removal
  • Word Cloud
  • Spelling Correction
  • Tokenization
  • Lemmatization
  • Detecting Entities using NER
  • Noun Detection
  • Language Detection
  • Sentence Translation
  • Using Inbuilt Sentiment Classifier

Advanced Text Processing and Feature Extraction

  • N-Gram, Bi-Gram etc
  • Bag of Words (BoW)
  • Term Frequency Calculation TF
  • Inverse Document Frequency IDF
  • TFIDF Term Frequency - Inverse Document Frequency
  • Word Embedding Word2Vec using SpaCy

Machine Learning Models for Text Classification

  • SGDClassifier
  • LogisticRegression
  • LogisticRegressionCV
  • LinearSVC
  • RandomForestClassifier

Importing libraries

import pandas as pd
import numpy as np

import spacy
from spacy.lang.en.stop_words import STOP_WORDS
df = pd.read_csv('twitter16m.csv', encoding = 'latin1', header = None)
df.head()
012345
001467810369Mon Apr 06 22:19:45 PDT 2009NO_QUERY_TheSpecialOne_@switchfoot http://twitpic.com/2y1zl - Awww, t...
101467810672Mon Apr 06 22:19:49 PDT 2009NO_QUERYscotthamiltonis upset that he can't update his Facebook by ...
201467810917Mon Apr 06 22:19:53 PDT 2009NO_QUERYmattycus@Kenichan I dived many times for the ball. Man...
301467811184Mon Apr 06 22:19:57 PDT 2009NO_QUERYElleCTFmy whole body feels itchy and like its on fire
401467811193Mon Apr 06 22:19:57 PDT 2009NO_QUERYKaroli@nationwideclass no, it's not behaving at all....
df = df[[5, 0]]
df.columns = ['twitts', 'sentiment']
df.head()
twittssentiment
0@switchfoot http://twitpic.com/2y1zl - Awww, t...0
1is upset that he can't update his Facebook by ...0
2@Kenichan I dived many times for the ball. Man...0
3my whole body feels itchy and like its on fire0
4@nationwideclass no, it's not behaving at all....0
df['sentiment'].value_counts()
4    800000
0    800000
Name: sentiment, dtype: int64
sent_map = {0: 'negative', 4: 'positive'}

Word Counts

In this step, we are splitting the sentences into words using split() function which converts the sentence into the list and on top of that we are using len() function to calculate the number of token or words.

df['word_counts'] = df['twitts'].apply(lambda x: len(str(x).split()))
df.head()
twittssentimentword_counts
0@switchfoot http://twitpic.com/2y1zl - Awww, t...019
1is upset that he can't update his Facebook by ...021
2@Kenichan I dived many times for the ball. Man...018
3my whole body feels itchy and like its on fire010
4@nationwideclass no, it's not behaving at all....021

Characters Count

In this step, we are using len() function to calculate the number characters inside each sentences.

df['char_counts'] = df['twitts'].apply(lambda x: len(x))
df.head()
twittssentimentword_countschar_counts
0@switchfoot http://twitpic.com/2y1zl - Awww, t...019115
1is upset that he can't update his Facebook by ...021111
2@Kenichan I dived many times for the ball. Man...01889
3my whole body feels itchy and like its on fire01047
4@nationwideclass no, it's not behaving at all....021111

Average Word Length

In this step, we have created a function get_avg_word_len() in which we are calculating the average word length inside each sentences.

def get_avg_word_len(x):
    words = x.split()
    word_len = 0
    for word in words:
        word_len = word_len + len(word)
    return word_len/len(words) # != len(x)/len(words)
df['avg_word_len'] = df['twitts'].apply(lambda x: get_avg_word_len(x))
len('this is nlp lesson')/4
4.5
df.head()
twittssentimentword_countschar_countsavg_word_len
0@switchfoot http://twitpic.com/2y1zl - Awww, t...0191155.052632
1is upset that he can't update his Facebook by ...0211114.285714
2@Kenichan I dived many times for the ball. Man...018893.944444
3my whole body feels itchy and like its on fire010473.700000
4@nationwideclass no, it's not behaving at all....0211114.285714
115/19
6.052631578947368

Stop Words Count

In this section, we are calculating the number of stop words for each sentences.

print(STOP_WORDS)
{'one', 'up', 'further', 'herself', 'nevertheless', 'their', 'when', 'a', 'bottom', 'both', 'also', 'i', 'sometime', 'ours', "'d", 'him', 'together', 'former', 'hereafter', 'whereby', "'ll", 'three', 'same', 'is', 'say', 'hers', 'must', 'five', 'you', 'across', 'n‘t', 'mostly', 'into', 'am', 'myself', 'something', 'could', 'being', 'seems', 'go', 'only', 'fifteen', 'either', 'us', 'than', 'latter', 'so', 'after', 'name', 'there', 'that', 'next', 'even', 'without', 'along', 'behind', 'very', 'whereas', 'off', 'herein', 'although', 'such', 'themselves', 'then', 'in', 'under', 'of', 'onto', 'really', 'due', 'otherwise', 'give', 'yourself', 'indeed', 'my', 'mine', 'show', 'via', 'elsewhere', 'be', 'just', 'thence', 'them', 'beside', 'though', 'as', 'out', 'third', 'however', 'twelve', 'except', '‘d', 'anything', 'move', 'side', 'everything', 'all', 'towards', 'whatever', 'will', 'n’t', 'toward', 'keep', 'hereupon', 'might', 'no', 'own', 'itself', 'for', 'can', 'rather', 'whether', 'while', 'and', 'part', 'over', 'else', 'has', 'forty', 'about', 'hereby', 'sixty', 'using', 'here', 'please', 'often', '’re', 'any', 'ca', 'per', 'whole', 'it', 'are', 'from', 'had', 'thru', '’m', 'two', 'fifty', 'your', 'latterly', 'again', 'or', 'few', 'against', 'much', 'somewhere', 'but', '’d', 'somehow', 'never', 'becoming', 'down', 'regarding', 'always', 'other', 'amount', 'because', 'noone', 'anyone', 'six', 'each', 'thus', 'alone', 'why', 'his', 'sometimes', 'now', 'since', 'become', 'see', 'she', 'where', 'whereafter', 'various', 'perhaps', 'another', 'who', 'anyhow', 'yourselves', 'someone', 'ten', 'became', 'nothing', 'front', 'an', 'anyway', 'get', 'thereafter', "'re", 'our', 'call', 'therein', 'have', 'this', 'above', 'some', 'namely', '‘re', 'seem', 'until', '’ll', 'more', 'still', "n't", 'the', 'does', 'himself', 'take', 'he', 'which', 'seeming', 'been', 'beforehand', 'may', 'do', 'well', 'ever', 'used', 'enough', 'every', 'top', 'made', "'m", 'hundred', 'almost', 'her', 'moreover', 'wherever', '’s', 'amongst', 'meanwhile', 'nobody', 'ourselves', 'whenever', 'at', 'wherein', 'nowhere', 'around', 'between', 'last', 'others', 'becomes', 'they', 'full', 'below', 'nor', 'before', 'what', 'within', 'these', 'besides', 'whereupon', 'how', 'throughout', 'eight', "'s", 'on', 'most', 'if', '‘ve', 'should', 'four', 'serious', 'thereby', '‘ll', 'whence', 'done', 'anywhere', 'yours', 'formerly', 'everyone', 'whose', 'back', 'make', 'among', 'first', 'we', '‘s', 'neither', 'doing', 'already', 'those', 'empty', 'did', 'not', '‘m', 'less', 'to', 'during', 'twenty', 'too', 'put', 'nine', 'yet', 'everywhere', 'quite', 'were', 'seemed', '’ve', 'through', 'once', 'whither', 'thereupon', 'whoever', "'ve", 'therefore', 'me', 'unless', 'whom', 'cannot', 'afterwards', 'none', 'least', 'hence', 'eleven', 'with', 'upon', 'was', 'would', 'by', 'beyond', 'several', 'its', 'many', 're'}
x = 'this is text data'
x.split()
['this', 'is', 'text', 'data']
len([t for t in x.split() if t in STOP_WORDS])
2
df['stop_words_len'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t in STOP_WORDS]))
df.head()
twittssentimentword_countschar_countsavg_word_lenstop_words_len
0@switchfoot http://twitpic.com/2y1zl - Awww, t...0191155.0526324
1is upset that he can't update his Facebook by ...0211114.2857149
2@Kenichan I dived many times for the ball. Man...018893.9444447
3my whole body feels itchy and like its on fire010473.7000005
4@nationwideclass no, it's not behaving at all....0211114.28571410

Count #HashTags and @Mentions

In this section, we are calculating the number of words staring with Hashtags and @.

x = 'this #hashtag and this is @mention'
# x = x.split()
# x
[t for t in x.split() if t.startswith('@')]
    
['@mention']
df['hashtags_count'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t.startswith('#')]))
df['mentions_count'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t.startswith('@')]))
df.head()
twittssentimentword_countschar_countsavg_word_lenstop_words_lenhashtags_countmentions_count
0@switchfoot http://twitpic.com/2y1zl - Awww, t...0191155.052632401
1is upset that he can't update his Facebook by ...0211114.285714900
2@Kenichan I dived many times for the ball. Man...018893.944444701
3my whole body feels itchy and like its on fire010473.700000500
4@nationwideclass no, it's not behaving at all....0211114.2857141001

If numeric digits are present in twitts

In this section, we are calculating the number of digits in each sentences.

df['numerics_count'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t.isdigit()]))
df.head()
twittssentimentword_countschar_countsavg_word_lenstop_words_lenhashtags_countmentions_countnumerics_count
0@switchfoot http://twitpic.com/2y1zl - Awww, t...0191155.0526324010
1is upset that he can't update his Facebook by ...0211114.2857149000
2@Kenichan I dived many times for the ball. Man...018893.9444447010
3my whole body feels itchy and like its on fire010473.7000005000
4@nationwideclass no, it's not behaving at all....0211114.28571410010

UPPER case words count

In this section, we are calculating the number of UPPERcase words in each sentences if length is more than 3.

df['upper_counts'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t.isupper() and len(x)>3]))
df.head()
twittssentimentword_countschar_countsavg_word_lenstop_words_lenhashtags_countmentions_countnumerics_countupper_counts
0@switchfoot http://twitpic.com/2y1zl - Awww, t...0191155.05263240101
1is upset that he can't update his Facebook by ...0211114.28571490000
2@Kenichan I dived many times for the ball. Man...018893.94444470101
3my whole body feels itchy and like its on fire010473.70000050000
4@nationwideclass no, it's not behaving at all....0211114.285714100101
df.loc[96]['twitts']
"so rylee,grace...wana go steve's party or not?? SADLY SINCE ITS EASTER I WNT B ABLE 2 DO MUCH  BUT OHH WELL....."

Preprocessing and Cleaning

In this section, we are converting the words to LOWERcase words in each sentences.

Lower case conversion

df['twitts'] = df['twitts'].apply(lambda x: x.lower())
df.head(2)
twittssentimentword_countschar_countsavg_word_lenstop_words_lenhashtags_countmentions_countnumerics_countupper_counts
0@switchfoot http://twitpic.com/2y1zl - awww, t...0191155.05263240101
1is upset that he can't update his facebook by ...0211114.28571490000

Contraction to Expansion

In this section, we are converting all short words to their respective fullwords based on the words defined in the dictionary and using function cont_to_exp().

x = "i don't know what you want, can't, he'll, i'd"
contractions = { 
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he'll've": "he will have",
"he's": "he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how does",
"i'd": "i would",
"i'd've": "i would have",
"i'll": "i will",
"i'll've": "i will have",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'd've": "it would have",
"it'll": "it will",
"it'll've": "it will have",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she would",
"she'd've": "she would have",
"she'll": "she will",
"she'll've": "she will have",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so is",
"that'd": "that would",
"that'd've": "that would have",
"that's": "that is",
"there'd": "there would",
"there'd've": "there would have",
"there's": "there is",
"they'd": "they would",
"they'd've": "they would have",
"they'll": "they will",
"they'll've": "they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
" u ": " you ",
" ur ": " your ",
" n ": " and "}
def cont_to_exp(x):
    if type(x) is str:
        for key in contractions:
            value = contractions[key]
            x = x.replace(key, value)
        return x
    else:
        return x
x = "hi, i'd be happy"
cont_to_exp(x)
'hi, i would be happy'
%%time
df['twitts'] = df['twitts'].apply(lambda x: cont_to_exp(x))
Wall time: 52.7 s
df.head()
twittssentimentword_countschar_countsavg_word_lenstop_words_lenhashtags_countmentions_countnumerics_countupper_counts
0@switchfoot http://twitpic.com/2y1zl - awww, t...0191155.05263240101
1is upset that he cannot update his facebook by...0211114.28571490000
2@kenichan i dived many times for the ball. man...018893.94444470101
3my whole body feels itchy and like its on fire010473.70000050000
4@nationwideclass no, it is not behaving at all...0211114.285714100101

Count and Remove Emails

In this section, we are removing as well as counting the emails.

import re
x = 'hi my email me at email@email.com another@email.com'
re.findall(r'([a-zA-Z0-9+._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)', x)
['email@email.com', 'another@email.com']
df['emails'] = df['twitts'].apply(lambda x: re.findall(r'([a-zA-Z0-9+._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)', x))
df['emails_count'] = df['emails'].apply(lambda x: len(x))
df[df['emails_count']>0].head()
twittssentimentword_countschar_countsavg_word_lenstop_words_lenhashtags_countmentions_countnumerics_countupper_countsemailsemails_count
4054i want a new laptop. hp tx2000 is the bomb. :...0201034.15000060004[gabbehhramos@yahoo.com]1
7917who stole elledell@gmail.com?03319.00000010000[elledell@gmail.com]1
8496@alexistehpom really? did you send out all th...0201305.500000110100[missataari@gmail.com]1
10290@laureystack awh...that is kinda sad lol add ...08768.50000000100[hello.kitty.65@hotmail.com]1
16413@jilliancyork got 2 bottom of it, human error...0211375.42857170110[press@linkedin.com]1
re.sub(r'([a-zA-Z0-9+._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)', '', x)
'hi my email me at  '
df['twitts'] = df['twitts'].apply(lambda x: re.sub(r'([a-zA-Z0-9+._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)', '', x))
df[df['emails_count']>0].head()
twittssentimentword_countschar_countsavg_word_lenstop_words_lenhashtags_countmentions_countnumerics_countupper_countsemailsemails_count
4054i want a new laptop. hp tx2000 is the bomb. :...0201034.15000060004[gabbehhramos@yahoo.com]1
7917who stole ?03319.00000010000[elledell@gmail.com]1
8496@alexistehpom really? did you send out all th...0201305.500000110100[missataari@gmail.com]1
10290@laureystack awh...that is kinda sad lol add ...08768.50000000100[hello.kitty.65@hotmail.com]1
16413@jilliancyork got 2 bottom of it, human error...0211375.42857170110[press@linkedin.com]1

Count URLs and Remove it

In this section, we are removing as well as counting the URLs using regex functions.

x = 'hi, to watch more visit https://youtube.com/kgptalkie'
re.findall(r'(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', x)
[('https', 'youtube.com', '/kgptalkie')]
df['urls_flag'] = df['twitts'].apply(lambda x: len(re.findall(r'(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', x)))
re.sub(r'(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '', x)
'hi, to watch more visit '
df['twitts'] = df['twitts'].apply(lambda x: re.sub(r'(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '', x))
df.head()
twittssentimentword_countschar_countsavg_word_lenstop_words_lenhashtags_countmentions_countnumerics_countupper_countsemailsemails_counturls_flag
0@switchfoot - awww, that is a bummer. you sh...0191155.05263240101[]01
1is upset that he cannot update his facebook by...0211114.28571490000[]00
2@kenichan i dived many times for the ball. man...018893.94444470101[]00
3my whole body feels itchy and like its on fire010473.70000050000[]00
4@nationwideclass no, it is not behaving at all...0211114.285714100101[]00
df.loc[0]['twitts']
'@switchfoot  - awww, that is a bummer.  you shoulda got david carr of third day to do it. ;d'

Remove RT

In this section, we are removing retweet characters.

df['twitts'] = df['twitts'].apply(lambda x: re.sub('RT', "", x))

Special Chars removal or punctuation removal

In this section, we are removing special characters and punctuations

df['twitts'] = df['twitts'].apply(lambda x: re.sub('[^A-Z a-z 0-9-]+', '', x))
df.head()
twittssentimentword_countschar_countsavg_word_lenstop_words_lenhashtags_countmentions_countnumerics_countupper_countsemailsemails_counturls_flag
0switchfoot - awww that is a bummer you shoul...0191155.05263240101[]01
1is upset that he cannot update his facebook by...0211114.28571490000[]00
2kenichan i dived many times for the ball manag...018893.94444470101[]00
3my whole body feels itchy and like its on fire010473.70000050000[]00
4nationwideclass no it is not behaving at all i...0211114.285714100101[]00
 

Remove multiple spaces "hi hello "

In this section, we are removing the multiple spaces.

x = 'thanks    for    watching and    please    like this video'
" ".join(x.split())
'thanks for watching and please like this video'
df['twitts'] = df['twitts'].apply(lambda x: " ".join(x.split()))
df.head(2)
twittssentimentword_countschar_countsavg_word_lenstop_words_lenhashtags_countmentions_countnumerics_countupper_countsemailsemails_counturls_flag
0switchfoot - awww that is a bummer you shoulda...0191155.05263240101[]01
1is upset that he cannot update his facebook by...0211114.28571490000[]00

Remove HTML tags

In this section, we are removing the HTML tags.

from bs4 import BeautifulSoup
x = '<html><h2>Thanks for watching</h2></html>'
BeautifulSoup(x, 'lxml').get_text()
'Thanks for watching'
%%time
df['twitts'] = df['twitts'].apply(lambda x: BeautifulSoup(x, 'lxml').get_text())
Wall time: 11min 37s

Remove Accented Chars

In this section, we are removing the accented characters.

import unicodedata
x = 'Áccěntěd těxt'
def remove_accented_chars(x):
    x = unicodedata.normalize('NFKD', x).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return x
remove_accented_chars(x)
'Accented text'

SpaCy and NLP

Remove Stop Words

In this section, we are removing the stop words from text document.

import spacy
x = 'this is stop words removal code is a the an how what'
" ".join([t for t in x.split() if t not in STOP_WORDS])
'stop words removal code'
df['twitts'] = df['twitts'].apply(lambda x: " ".join([t for t in x.split() if t not in STOP_WORDS]))
df.head()
twittssentimentword_countschar_countsavg_word_lenstop_words_lenhashtags_countmentions_countnumerics_countupper_countsemailsemails_counturls_flag
0switchfoot - awww bummer shoulda got david car...0191155.05263240101[]01
1upset update facebook texting cry result schoo...0211114.28571490000[]00
2kenichan dived times ball managed save 50 rest...018893.94444470101[]00
3body feels itchy like fire010473.70000050000[]00
4nationwideclass behaving mad0211114.285714100101[]00

Convert into base or root form of word

In this section, we are converting the words to their forms.

nlp = spacy.load('en_core_web_sm')
x = 'kenichan dived times ball managed save 50 rest'
# dive = dived, time = times, manage = managed
# x = 'i you he she they is am are'
def make_to_base(x):
    x_list = []
    doc = nlp(x)
    
    for token in doc:
        lemma = str(token.lemma_)
        if lemma == '-PRON-' or lemma == 'be':
            lemma = token.text
        x_list.append(lemma)
    print(" ".join(x_list))
        
make_to_base(x)
kenichan dive time ball manage save 50 rest

Common words removal

In this section, we are removing top 20 most occured word from text corpus.

' '.join(df.head()['twitts'])
'switchfoot - awww bummer shoulda got david carr day d upset update facebook texting cry result school today blah kenichan dived times ball managed save 50 rest bounds body feels itchy like fire nationwideclass behaving mad'
text = ' '.join(df['twitts'])
text = text.split()
freq_comm = pd.Series(text).value_counts()
f20 = freq_comm[:20]
f20
good      89366
day       82299
like      77735
-         69662
today     64512
going     64078
love      63421
work      62804
got       60749
time      56081
lol       55094
know      51172
im        50147
want      42070
new       41995
think     41040
night     41029
amp       40616
thanks    39311
home      39168
dtype: int64
df['twitts'] = df['twitts'].apply(lambda x: " ".join([t for t in x.split() if t not in f20]))

Rare words removal

In this section, we are removing least 20 most occured word from text corpus.

rare20 = freq_comm[-20:]
rare20
veru              1
80-90f            1
refrigerant       1
demaisss          1
knittingsci-fi    1
wendireed         1
danielletuazon    1
chacha8           1
a-zquot           1
krustythecat      1
westmount         1
-appreciate       1
motocycle         1
madamhow          1
felspoon          1
fastbloke         1
900pmno           1
nxec              1
laassssttt        1
update-uri        1
dtype: int64
rare = freq_comm[freq_comm.values == 1]
rare
mamat             1
fiive             1
music-festival    1
leenahyena        1
11517             1
                 ..
fastbloke         1
900pmno           1
nxec              1
laassssttt        1
update-uri        1
Length: 536196, dtype: int64
df['twitts'] = df['twitts'].apply(lambda x: ' '.join([t for t in x.split() if t not in rare20]))
df.head()
twittssentimentword_countschar_countsavg_word_lenstop_words_lenhashtags_countmentions_countnumerics_countupper_countsemailsemails_counturls_flag
0switchfoot awww bummer shoulda david carr d0191155.05263240101[]01
1upset update facebook texting cry result schoo...0211114.28571490000[]00
2kenichan dived times ball managed save 50 rest...018893.94444470101[]00
3body feels itchy fire010473.70000050000[]00
4nationwideclass behaving mad0211114.285714100101[]00

Word Cloud Visualization

In this section, we are visualizing the text corpus using library WordCloud.

# !pip install wordcloud
from wordcloud import WordCloud
import matplotlib.pyplot as plt
%matplotlib inline
x = ' '.join(text[:20000])
len(text)
10837079
wc = WordCloud(width = 800, height=400).generate(x)
plt.imshow(wc)
plt.axis('off')
plt.show()

Spelling Correction

In this section, we are correcting the spelling of each words.

# !pip install -U textblob
# !python -m textblob.download_corpora
from textblob import TextBlob
x = 'tanks forr waching this vidio carri'
x = TextBlob(x).correct()
x
TextBlob("tanks for watching this video carry")

Tokenization

Tokenization is all about breaking the sentences into individual words.

x = 'thanks#watching this video. please like it'
TextBlob(x).words
WordList(['thanks', 'watching', 'this', 'video', 'please', 'like', 'it'])
doc = nlp(x)
for token in doc:
    print(token)
thanks#watching
this
video
.
please
like
it

Lemmatization

Lemmatization is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. Lemmatization is similar to stemming but it brings context to the words. So it links words with similar meaning to one word.

x = 'runs run running ran'
from textblob import Word
for token in x.split():
    print(Word(token).lemmatize())
run
run
running
ran
doc = nlp(x)
for token in doc:
    print(token.lemma_)
run
run
run
run

Detect Entities using NER of SpaCy

Named Entity Recognition (NER) is a standard NLP problem which involves spotting named entities (people, places, organizations etc.) from a chunk of text, and classifying them into a predefined set of categories. Some of the practical applications of NER include:

  • Scanning news articles for the people, organizations and locations reported.
  • Providing concise features for search optimization: instead of searching the entire content, one may simply search for the major entities involved.
  • Quickly retrieving geographical locations talked about in Twitter posts.
x = "Breaking News: Donald Trump, the president of the USA is looking to sign a deal to mine the moon"
doc = nlp(x)
for ent in doc.ents:
    print(ent.text + ' - ' + ent.label_ + ' - ' + str(spacy.explain(ent.label_)))
Donald Trump - PERSON - People, including fictional
USA - GPE - Countries, cities, states
from spacy import displacy
displacy.render(doc, style = 'ent')

Breaking News: Donald Trump PERSON , the president of the USA GPE is looking to sign a deal to mine the moon

Detecting Nouns

In this section, we are detecting nouns.

x
'Breaking News: Donald Trump, the president of the USA is looking to sign a deal to mine the moon'
for noun in doc.noun_chunks:
    print(noun)
Breaking News
Donald Trump
the president
the USA
a deal
the moon

Translation and Language Detection

Language Code: https://www.loc.gov/standards/iso639-2/php/code_list.php

x
'Breaking News: Donald Trump, the president of the USA is looking to sign a deal to mine the moon'
tb = TextBlob(x)
tb.detect_language()
'en'
tb.translate(to='bn')
TextBlob("ব্রেকিং নিউজ: যুক্তরাষ্ট্রের রাষ্ট্রপতি ডোনাল্ড ট্রাম্প চাঁদটি খনির জন্য একটি চুক্তিতে সই করতে চাইছেন")

Use inbuilt sentiment classifier

TextBlob library also comes with a NaiveBayesAnalyzer, Naive Bayes is a commonly used machine learning text-classification algorithm.

from textblob.sentiments import NaiveBayesAnalyzer
x = 'we all stands together to fight with corona virus. we will win together'
tb = TextBlob(x, analyzer=NaiveBayesAnalyzer())
tb.sentiment
Sentiment(classification='pos', p_pos=0.8259779151942094, p_neg=0.17402208480578962)
x = 'we all are sufering from corona'
tb = TextBlob(x, analyzer=NaiveBayesAnalyzer())
tb.sentiment
Sentiment(classification='pos', p_pos=0.75616044472398, p_neg=0.2438395552760203)

Advanced Text Processing

N-Grams

An N-gram means a sequence of N words. So for example, “KGPtalkie blog” is a 2-gram (a bigram), “A KGPtalkie blog post” is a 4-gram, and “Write on KGPtalkie” is a 3-gram (trigram). Well, that wasn’t very interesting or exciting. True, but we still have to look at the probability used with n-grams, which is quite interesting.

x = 'thanks for watching'
tb = TextBlob(x)
tb.ngrams(3)
[WordList(['thanks', 'for', 'watching'])]

Bag of Words BoW

In this section, we are going to discuss a Natural Language Processing technique of text modeling known as the Bag of Words model. Whenever we apply any algorithm in NLP, it works on numbers. We cannot directly feed our text into that algorithm. Hence, the Bag of Words model is used to preprocess the text by converting it into a bag of words, which keeps a count of the total occurrences of most frequently used words.

This model can be visualized using a table, which contains the count of words corresponding to the word itself.

x = ['this is first sentence this is', 'this is second', 'this is last']
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(1,1))
text_counts = cv.fit_transform(x)
text_counts.toarray()
array([[1, 2, 0, 0, 1, 2],
       [0, 1, 0, 1, 0, 1],
       [0, 1, 1, 0, 0, 1]], dtype=int64)
cv.get_feature_names()
['first', 'is', 'last', 'second', 'sentence', 'this']
bow = pd.DataFrame(text_counts.toarray(), columns = cv.get_feature_names())
bow
firstislastsecondsentencethis
0120012
1010101
2011001
x
['this is first sentence this is', 'this is second', 'this is last']

Term Frequency

Term frequency (TF) often used in Text Mining, NLP, and Information Retrieval tells you how frequently a term occurs in a document. In the context of natural language, terms correspond to words or phrases. Since every document is different in length, it is possible that a term would appear more often in longer documents than shorter ones. Thus, term frequency is often divided by the total number of terms in the document as a way of normalization.

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).

x
['this is first sentence this is', 'this is second', 'this is last']
bow
firstislastsecondsentencethis
0120012
1010101
2011001
bow.shape
(3, 6)
tf = bow.copy()
for index, row in enumerate(tf.iterrows()):
    for col in row[1].index:
        tf.loc[index, col] = tf.loc[index, col]/sum(row[1].values)
tf
firstislastsecondsentencethis
00.1666670.3333330.0000000.0000000.1666670.333333
10.0000000.3333330.0000000.3333330.0000000.333333
20.0000000.3333330.3333330.0000000.0000000.333333

Inverse Document Frequency IDF

Inverse Document Frequency (IDF) is a weight indicating how commonly a word is used. The more frequent its usage across documents, the lower its score. The lower the score, the less important the word becomes.

For example, the word the appears in almost all English texts and would thus have a very low IDF score as it carries very little “topic” information. In contrast, if you take the word coffee, while it is common, it’s not used as widely as the word the. Thus, coffee would have a higher IDF score than the.

idf = log( (1 + N)/(n + 1)) + 1 used in sklearn when smooth_idf = True

where, N is the total number of rows and n is the number of rows in which the word was present.

import numpy as np
x_df = pd.DataFrame(x, columns=['words'])
x_df
words
0this is first sentence this is
1this is second
2this is last
bow
firstislastsecondsentencethis
0120012
1010101
2011001
N = bow.shape[0]
N
3
bb = bow.astype('bool')
bb
firstislastsecondsentencethis
0TrueTrueFalseFalseTrueTrue
1FalseTrueFalseTrueFalseTrue
2FalseTrueTrueFalseFalseTrue
bb['is'].sum()
3
cols = bb.columns
cols
Index(['first', 'is', 'last', 'second', 'sentence', 'this'], dtype='object')
nz = []
for col in cols:
    nz.append(bb[col].sum())
nz
[1, 3, 1, 1, 1, 3]
idf = []
for index, col in enumerate(cols):
    idf.append(np.log((N + 1)/(nz[index] + 1)) + 1)
idf
[1.6931471805599454,
 1.0,
 1.6931471805599454,
 1.6931471805599454,
 1.6931471805599454,
 1.0]
bow
firstislastsecondsentencethis
0120012
1010101
2011001

TFIDF

TF-IDF which stands for Term Frequency – Inverse Document Frequency. It is one of the most important techniques used for information retrieval to represent how important a specific word or phrase is to a given document. Let’s take an example, we have a string or Bag of Words (BOW) and we have to extract information from it, then we can use this approach.

The tf-idf value increases in proportion to the number of times a word appears in the document but is often offset by the frequency of the word in the corpus, which helps to adjust with respect to the fact that some words appear more frequently in general.

TF-IDF use two statistical methods, first is Term Frequency and the other is Inverse Document Frequency. Term frequency refers to the total number of times a given term t appears in the document doc against (per) the total number of all words in the document and The inverse document frequency measure of how much information the word provides. It measures the weight of a given word in the entire document. IDF show how common or rare a given word is across all documents.

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
x_tfidf = tfidf.fit_transform(x_df['words'])
x_tfidf.toarray()
array([[0.45688214, 0.5396839 , 0.        , 0.        , 0.45688214,
        0.5396839 ],
       [0.        , 0.45329466, 0.        , 0.76749457, 0.        ,
        0.45329466],
       [0.        , 0.45329466, 0.76749457, 0.        , 0.        ,
        0.45329466]])
tfidf.idf_
array([1.69314718, 1.        , 1.69314718, 1.69314718, 1.69314718,
       1.        ])
idf
[1.6931471805599454,
 1.0,
 1.6931471805599454,
 1.6931471805599454,
 1.6931471805599454,
 1.0]

Word Embeddings

Word Embedding is a language modeling technique used for mapping words to vectors of real numbers. It represents words or phrases in vector space with several dimensions. Word embeddings can be generated using various methods like neural networks, co-occurrence matrix, probabilistic models, etc.

SpaCy Word2Vec

# !python -m spacy download en_core_web_lg
nlp = spacy.load('en_core_web_lg')
doc = nlp('thank you! dog cat lion dfasaa')
for token in doc:
    print(token.text, token.has_vector)
thank True
you True
! True
dog True
cat True
lion True
dfasaa False
token.vector.shape
(300,)
nlp('cat').vector.shape
(300,)
for token1 in doc:
    for token2 in doc:
        print(token1.text, token2.text, token1.similarity(token2))
    print()
thank thank 1.0
thank you 0.5647585
thank ! 0.52147406
thank dog 0.2504265
thank cat 0.20648485
thank lion 0.13629764
C:\ProgramData\Anaconda3\lib\runpy.py:193: UserWarning: [W008] Evaluating Token.similarity based on empty vectors.
  "__main__", mod_spec)
thank dfasaa 0.0

you thank 0.5647585
you you 1.0
you ! 0.4390223
you dog 0.36494097
you cat 0.3080798
you lion 0.20392051
C:\ProgramData\Anaconda3\lib\runpy.py:193: UserWarning: [W008] Evaluating Token.similarity based on empty vectors.
  "__main__", mod_spec)
you dfasaa 0.0

! thank 0.52147406
! you 0.4390223
! ! 1.0
! dog 0.29852203
! cat 0.29702348
! lion 0.19601382
C:\ProgramData\Anaconda3\lib\runpy.py:193: UserWarning: [W008] Evaluating Token.similarity based on empty vectors.
  "__main__", mod_spec)
! dfasaa 0.0

dog thank 0.2504265
dog you 0.36494097
dog ! 0.29852203
dog dog 1.0
dog cat 0.80168545
dog lion 0.47424486
C:\ProgramData\Anaconda3\lib\runpy.py:193: UserWarning: [W008] Evaluating Token.similarity based on empty vectors.
  "__main__", mod_spec)
dog dfasaa 0.0

cat thank 0.20648485
cat you 0.3080798
cat ! 0.29702348
cat dog 0.80168545
cat cat 1.0
cat lion 0.52654374
C:\ProgramData\Anaconda3\lib\runpy.py:193: UserWarning: [W008] Evaluating Token.similarity based on empty vectors.
  "__main__", mod_spec)
cat dfasaa 0.0

lion thank 0.13629764
lion you 0.20392051
lion ! 0.19601382
lion dog 0.47424486
lion cat 0.52654374
lion lion 1.0
C:\ProgramData\Anaconda3\lib\runpy.py:193: UserWarning: [W008] Evaluating Token.similarity based on empty vectors.
  "__main__", mod_spec)
lion dfasaa 0.0

C:\ProgramData\Anaconda3\lib\runpy.py:193: UserWarning: [W008] Evaluating Token.similarity based on empty vectors.
  "__main__", mod_spec)
dfasaa thank 0.0
C:\ProgramData\Anaconda3\lib\runpy.py:193: UserWarning: [W008] Evaluating Token.similarity based on empty vectors.
  "__main__", mod_spec)
dfasaa you 0.0
C:\ProgramData\Anaconda3\lib\runpy.py:193: UserWarning: [W008] Evaluating Token.similarity based on empty vectors.
  "__main__", mod_spec)
dfasaa ! 0.0
C:\ProgramData\Anaconda3\lib\runpy.py:193: UserWarning: [W008] Evaluating Token.similarity based on empty vectors.
  "__main__", mod_spec)
dfasaa dog 0.0
C:\ProgramData\Anaconda3\lib\runpy.py:193: UserWarning: [W008] Evaluating Token.similarity based on empty vectors.
  "__main__", mod_spec)
dfasaa cat 0.0
C:\ProgramData\Anaconda3\lib\runpy.py:193: UserWarning: [W008] Evaluating Token.similarity based on empty vectors.
  "__main__", mod_spec)
dfasaa lion 0.0
dfasaa dfasaa 1.0

 

Machine Learning Models for Text Classification

BoW

#displaying the shape of the dimension

df.shape
(1600000, 13)
#sampling the number of rows

df0 = df[df['sentiment']==0].sample(2000)
df4 = df[df['sentiment']==4].sample(2000)
dfr = df0.append(df4)
dfr.shape
(4000, 13)
#removing the twitts,sentiment and emails columns

dfr_feat = dfr.drop(labels=['twitts','sentiment','emails'], axis = 1).reset_index(drop=True)
dfr_feat
word_countschar_countsavg_word_lenstop_words_lenhashtags_countmentions_countnumerics_countupper_countsemails_counturls_flag
015814.4000006000000
18474.8750004010000
215693.6000006010000
39423.6666674000200
414774.5000005000000
.................................
39953339.6666671010000
399616783.8750004010200
3997271343.9629639010200
39986446.3333330011100
39995254.0000003010000

4000 rows × 10 columns

y = dfr['sentiment']
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
text_counts = cv.fit_transform(dfr['twitts'])
text_counts.toarray().shape
(4000, 9750)
dfr_bow = pd.DataFrame(text_counts.toarray(), columns=cv.get_feature_names())
dfr_bow.head(2)
007peter05060594091010010001000000000000000000000000000010381041...zomgzonkedzoozooeyzrovnazsharezskzwelzzzzzzzz
00000000000...0000000000
10000000000...0000000000

2 rows × 9750 columns

 

ML Algorithms

Importing Libraries for ML algorithms

from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier


from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.preprocessing import MinMaxScaler
sgd = SGDClassifier(n_jobs=-1, random_state=42, max_iter=200)
lgr = LogisticRegression(random_state=42, max_iter=200)
lgrcv = LogisticRegressionCV(cv = 2, random_state=42, max_iter=1000)
svm = LinearSVC(random_state=42, max_iter=200)
rfc = RandomForestClassifier(random_state=42, n_jobs=-1, n_estimators=200)
clf = {'SGD': sgd, 'LGR': lgr, 'LGR-CV': lgrcv, 'SVM': svm, 'RFC': rfc}
clf.keys()
dict_keys(['SGD', 'LGR', 'LGR-CV', 'SVM', 'RFC'])
#here, we are training our model by defining the function classify.

def classify(X, y):
    scaler = MinMaxScaler(feature_range=(0, 1))
    X = scaler.fit_transform(X)
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)
    
    for key in clf.keys():
        clf[key].fit(X_train, y_train)
        y_pred = clf[key].predict(X_test)
        ac = accuracy_score(y_test, y_pred)
        print(key, " ---> ", ac)
 
%%time
classify(dfr_bow, y)
SGD  --->  0.62375
LGR  --->  0.65375
LGR-CV  --->  0.6525
SVM  --->  0.6325
RFC  --->  0.6525
Wall time: 1min 42s
Manual Feature
#passing all the manual features.

dfr_feat.head(2)
word_countschar_countsavg_word_lenstop_words_lenhashtags_countmentions_countnumerics_countupper_countsemails_counturls_flag
45384315814.4006000000
3882808474.8754010000
%%time
classify(dfr_feat, y)
SGD  --->  0.64125
LGR  --->  0.645
LGR-CV  --->  0.65
SVM  --->  0.6475
RFC  --->  0.5675
Wall time: 1.35 s
Manual + Bow
#passing all the manual features along with bag of words features.

X = dfr_feat.join(dfr_bow)
%%time
classify(X, y)
SGD  --->  0.64875
LGR  --->  0.67125
LGR-CV  --->  0.66125
SVM  --->  0.64375
RFC  --->  0.705
Wall time: 1min 18s

TFIDF

#passing all the manual features along with tfidf features.

from sklearn.feature_extraction.text import TfidfVectorizer
dfr.shape
(4000, 13)
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(dfr['twitts'])
%%time
classify(pd.DataFrame(X.toarray()), y)
SGD  --->  0.635
LGR  --->  0.65125
LGR-CV  --->  0.6475
SVM  --->  0.63875
RFC  --->  0.6425
Wall time: 1min 37s

Word2Vec

def get_vec(x):
    doc = nlp(x)
    return doc.vector.reshape(1, -1)
%%time
dfr['vec'] = dfr['twitts'].apply(lambda x: get_vec(x))
Wall time: 51.8 s
X = np.concatenate(dfr['vec'].to_numpy(), axis = 0)
X.shape
(4000, 300)
classify(pd.DataFrame(X), y)
SGD  --->  0.5925
LGR  --->  0.70625
LGR-CV  --->  0.69375
C:\Users\Laxmi\AppData\Roaming\Python\Python37\site-packages\sklearn\svm\_base.py:947: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)
SVM  --->  0.70125
RFC  --->  0.66625
def predict_w2v(x):
    for key in clf.keys():
        y_pred = clf[key].predict(get_vec(x))
        print(key, "-->", y_pred)
predict_w2v('hi, thanks for watching this video. please like and subscribe')
SGD --> [0]
LGR --> [4]
LGR-CV --> [0]
SVM --> [4]
RFC --> [0]
predict_w2v('please let me know if you want more video')
SGD --> [0]
LGR --> [0]
LGR-CV --> [0]
SVM --> [0]
RFC --> [0]
predict_w2v('congratulation looking good congrats')
SGD --> [4]
LGR --> [4]
LGR-CV --> [4]
SVM --> [4]
RFC --> [0]

Summary

1. In this article, firstly we have cleared the texts like removing URLs and various tags.

2. Also, we have used various text featurization techniques like bag-of-wordstf-idf and word2vec.

3. After doing text featurization, we building machine learning models on top of those features.