NLP: End to End Text Processing for Beginners

Complete Text Processing for Beginners

Natural Language Processing (NLP) enables computers to understand, interpret, and manipulate human language. While raw text from tweets, documents, and transcripts is highly unstructured and messy, modern text processing pipelines can extract clean semantic representations to predict sentiment or classify topics.

In this tutorial, you will build an end-to-end text processing pipeline in Python. You will clean and preprocess the Sentiment140 Twitter dataset, extract features using Bag of Words, TF-IDF, and Word2Vec, and train multiple machine learning classifiers to predict sentiment.

Prerequisites: Python 3.x, Numpy, Pandas, spaCy, Scikit-learn, TextBlob.

Dataset used in this tutorial: Sentiment140 Dataset on Kaggle

The diagram below illustrates the general workflow of an end-to-end NLP pipeline:

Cognitive NLP vs keyword matching visual

Installing Libraries

You can install spaCy and its associated English language models using pip:

PYTHON

# pip install -U spacy
# pip install -U spacy-lookups-data
# python -m spacy download en_core_web_sm
# python -m spacy download en_core_web_md
# python -m spacy download en_core_web_lg

Importing Libraries

We begin by importing the basic data manipulation libraries and the spaCy stop word list:

PYTHON

import pandas as pd
import numpy as np

import spacy
from spacy.lang.en.stop_words import STOP_WORDS

Next, load the Sentiment140 CSV dataset with Latin-1 encoding:

PYTHON

df = pd.read_csv('twitter16m.csv', encoding = 'latin1', header = None)

Display the first few rows of the loaded DataFrame:

PYTHON

df.head()

OUTPUT

	1	2	3	4	5
0	1467810369	Mon Apr 06 22:19:45 PDT 2009	NO_QUERY	_TheSpecialOne_	@switchfoot http://twitpic.com/2y1zl - Awww, t...
1	1467810672	Mon Apr 06 22:19:49 PDT 2009	NO_QUERY	scotthamilton	is upset that he can't update his Facebook by ...
2	1467810917	Mon Apr 06 22:19:53 PDT 2009	NO_QUERY	mattycus	@Kenichan I dived many times for the ball. Man...
3	1467811184	Mon Apr 06 22:19:57 PDT 2009	NO_QUERY	ElleCTF	my whole body feels itchy and like its on fire
4	1467811193	Mon Apr 06 22:19:57 PDT 2009	NO_QUERY	Karoli	@nationwideclass no, it's not behaving at all....

Keep only the text content (column index 5) and the target sentiment label (column index 0):

PYTHON

df = df[[5, 0]]

Assign descriptive column names to the target DataFrame:

PYTHON

df.columns = ['twitts', 'sentiment']
df.head()

OUTPUT

	twitts	sentiment
0	@switchfoot http://twitpic.com/2y1zl - Awww, t...	0
1	is upset that he can't update his Facebook by ...	0
2	@Kenichan I dived many times for the ball. Man...	0
3	my whole body feels itchy and like its on fire	0
4	@nationwideclass no, it's not behaving at all....	0

Check the class balance for positive and negative sentiment labels:

PYTHON

df['sentiment'].value_counts()

OUTPUT

4    800000
0    800000
Name: sentiment, dtype: int64

Create a lookup dictionary mapping sentiment label integers to descriptive string categories:

PYTHON

sent_map = {0: 'negative', 4: 'positive'}

Word Counts

We can calculate word counts by splitting each tweet sentence on whitespace and finding the length of the resulting word list:

PYTHON

df['word_counts'] = df['twitts'].apply(lambda x: len(str(x).split()))

Display the DataFrame with the new word counts column:

PYTHON

df.head()

OUTPUT

	twitts	word_counts
0	@switchfoot http://twitpic.com/2y1zl - Awww, t...	19
1	is upset that he can't update his Facebook by ...	21
2	@Kenichan I dived many times for the ball. Man...	18
3	my whole body feels itchy and like its on fire	10
4	@nationwideclass no, it's not behaving at all....	21

Characters Count

Next, we count the total number of characters in each tweet string:

PYTHON

df['char_counts'] = df['twitts'].apply(lambda x: len(x))

Display the DataFrame with the character counts included:

PYTHON

df.head()

OUTPUT

	twitts	word_counts	char_counts
0	@switchfoot http://twitpic.com/2y1zl - Awww, t...	19	115
1	is upset that he can't update his Facebook by ...	21	111
2	@Kenichan I dived many times for the ball. Man...	18	89
3	my whole body feels itchy and like its on fire	10	47
4	@nationwideclass no, it's not behaving at all....	21	111

Average Word Length

We define a helper function to compute the average character length of words inside each tweet:

PYTHON

def get_avg_word_len(x):
    words = x.split()
    word_len = 0
    for word in words:
        word_len = word_len + len(word)
    return word_len/len(words) # != len(x)/len(words)

Apply the function to generate average word lengths:

PYTHON

df['avg_word_len'] = df['twitts'].apply(lambda x: get_avg_word_len(x))

Confirm the calculation logic on a dummy string:

PYTHON

len('this is nlp lesson')/4

OUTPUT

4.5

Display the head of the updated DataFrame:

PYTHON

df.head()

OUTPUT

	twitts	word_counts	char_counts	avg_word_len
0	@switchfoot http://twitpic.com/2y1zl - Awww, t...	19	115	5.052632
1	is upset that he can't update his Facebook by ...	21	111	4.285714
2	@Kenichan I dived many times for the ball. Man...	18	89	3.944444
3	my whole body feels itchy and like its on fire	10	47	3.700000
4	@nationwideclass no, it's not behaving at all....	21	111	4.285714

Verify the mathematical relation between character count and word count on the first row:

PLAINTEXT

115/19

PLAINTEXT

6.052631578947368

Stop Words Count

Check the default list of stop words provided by spaCy:

PYTHON

print(STOP_WORDS)

OUTPUT

{'one', 'up', 'further', 'herself', 'nevertheless', 'their', 'when', 'a', 'bottom', 'both', 'also', 'i', 'sometime', 'ours', "'d", 'him', 'together', 'former', 'hereafter', 'whereby', "'ll", 'three', 'same', 'is', 'say', 'hers', 'must', 'five', 'you', 'across', 'n‘t', 'mostly', 'into', 'am', 'myself', 'something', 'could', 'being', 'seems', 'go', 'only', 'fifteen', 'either', 'us', 'than', 'latter', 'so', 'after', 'name', 'there', 'that', 'next', 'even', 'without', 'along', 'behind', 'very', 'whereas', 'off', 'herein', 'although', 'such', 'themselves', 'then', 'in', 'under', 'of', 'onto', 'really', 'due', 'otherwise', 'give', 'yourself', 'indeed', 'my', 'mine', 'show', 'via', 'elsewhere', 'be', 'just', 'thence', 'them', 'beside', 'though', 'as', 'out', 'third', 'however', 'twelve', 'except', '‘d', 'anything', 'move', 'side', 'everything', 'all', 'towards', 'whatever', 'will', 'n’t', 'toward', 'keep', 'hereupon', 'might', 'no', 'own', 'itself', 'for', 'can', 'rather', 'whether', 'while', 'and', 'part', 'over', 'else', 'has', 'forty', 'about', 'hereby', 'sixty', 'using', 'here', 'please', 'often', '’re', 'any', 'ca', 'per', 'whole', 'it', 'are', 'from', 'had', 'thru', '’m', 'two', 'fifty', 'your', 'latterly', 'again', 'or', 'few', 'against', 'much', 'somewhere', 'but', '’d', 'somehow', 'never', 'becoming', 'down', 'regarding', 'always', 'other', 'amount', 'because', 'noone', 'anyone', 'six', 'each', 'thus', 'alone', 'why', 'his', 'sometimes', 'now', 'since', 'become', 'see', 'she', 'where', 'whereafter', 'various', 'perhaps', 'another', 'who', 'anyhow', 'yourselves', 'someone', 'ten', 'became', 'nothing', 'front', 'an', 'anyway', 'get', 'thereafter', "'re", 'our', 'call', 'therein', 'have', 'this', 'above', 'some', 'namely', '‘re', 'seem', 'until', '’ll', 'more', 'still', "n't", 'the', 'does', 'himself', 'take', 'he', 'which', 'seeming', 'been', 'beforehand', 'may', 'do', 'well', 'ever', 'used', 'enough', 'every', 'top', 'made', "'m", 'hundred', 'almost', 'her', 'moreover', 'wherever', '’s', 'amongst', 'meanwhile', 'nobody', 'ourselves', 'whenever', 'at', 'wherein', 'nowhere', 'around', 'between', 'last', 'others', 'becomes', 'they', 'full', 'below', 'nor', 'before', 'what', 'within', 'these', 'besides', 'whereupon', 'how', 'throughout', 'eight', "'s", 'on', 'most', 'if', '‘ve', 'should', 'four', 'serious', 'thereby', '‘ll', 'whence', 'done', 'anywhere', 'yours', 'formerly', 'everyone', 'whose', 'back', 'make', 'among', 'first', 'we', '‘s', 'neither', 'doing', 'already', 'those', 'empty', 'did', 'not', '‘m', 'less', 'to', 'during', 'twenty', 'too', 'put', 'nine', 'yet', 'everywhere', 'quite', 'were', 'seemed', '’ve', 'through', 'once', 'whither', 'thereupon', 'whoever', "'ve", 'therefore', 'me', 'unless', 'whom', 'cannot', 'afterwards', 'none', 'least', 'hence', 'eleven', 'with', 'upon', 'was', 'would', 'by', 'beyond', 'several', 'its', 'many', 're'}

Initialize a test string:

PYTHON

x = 'this is text data'

Tokenize the test string:

PYTHON

x.split()

OUTPUT

['this', 'is', 'text', 'data']

Filter out and count stop words in the token list:

PYTHON

len([t for t in x.split() if t in STOP_WORDS])

OUTPUT

Compute the number of stop words contained in each tweet:

PYTHON

df['stop_words_len'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t in STOP_WORDS]))

Display the updated DataFrame:

PYTHON

df.head()

OUTPUT

	twitts	word_counts	char_counts	avg_word_len	stop_words_len
0	@switchfoot http://twitpic.com/2y1zl - Awww, t...	19	115	5.052632	4
1	is upset that he can't update his Facebook by ...	21	111	4.285714	9
2	@Kenichan I dived many times for the ball. Man...	18	89	3.944444	7
3	my whole body feels itchy and like its on fire	10	47	3.700000	5
4	@nationwideclass no, it's not behaving at all....	21	111	4.285714	10

Count #HashTags and @Mentions

Initialize a sample string with a hashtag and a mention:

PYTHON

x = 'this #hashtag and this is @mention'
# x = x.split()
# x

Find all tokens starting with @:

PLAINTEXT

[t for t in x.split() if t.startswith('@')]

PLAINTEXT

['@mention']

Calculate the occurrences of hashtags (#) and user mentions (@) across all tweets:

PYTHON

df['hashtags_count'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t.startswith('#')]))
df['mentions_count'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t.startswith('@')]))

Inspect the head of the updated DataFrame:

PYTHON

df.head()

OUTPUT

	twitts	word_counts	char_counts	avg_word_len	stop_words_len	mentions_count
0	@switchfoot http://twitpic.com/2y1zl - Awww, t...	19	115	5.052632	4	1
1	is upset that he can't update his Facebook by ...	21	111	4.285714	9	0
2	@Kenichan I dived many times for the ball. Man...	18	89	3.944444	7	1
3	my whole body feels itchy and like its on fire	10	47	3.700000	5	0
4	@nationwideclass no, it's not behaving at all....	21	111	4.285714	10	1

If numeric digits are present in twitts

Find and count space-separated digit tokens in the tweets:

PYTHON

df['numerics_count'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t.isdigit()]))

Inspect the head of the DataFrame:

PYTHON

df.head()

OUTPUT

	twitts	word_counts	char_counts	avg_word_len	stop_words_len	mentions_count
0	@switchfoot http://twitpic.com/2y1zl - Awww, t...	19	115	5.052632	4	1
1	is upset that he can't update his Facebook by ...	21	111	4.285714	9	0
2	@Kenichan I dived many times for the ball. Man...	18	89	3.944444	7	1
3	my whole body feels itchy and like its on fire	10	47	3.700000	5	0
4	@nationwideclass no, it's not behaving at all....	21	111	4.285714	10	1

UPPER case words count

Count uppercase tokens that have a string length greater than 3:

PYTHON

df['upper_counts'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t.isupper() and len(x)>3]))

Inspect the head of the DataFrame:

PYTHON

df.head()

OUTPUT

	twitts	word_counts	char_counts	avg_word_len	stop_words_len	mentions_count	upper_counts
0	@switchfoot http://twitpic.com/2y1zl - Awww, t...	19	115	5.052632	4	1	1
1	is upset that he can't update his Facebook by ...	21	111	4.285714	9	0	0
2	@Kenichan I dived many times for the ball. Man...	18	89	3.944444	7	1	1
3	my whole body feels itchy and like its on fire	10	47	3.700000	5	0	0
4	@nationwideclass no, it's not behaving at all....	21	111	4.285714	10	1	1

View a specific tweet index to check formatting:

PYTHON

df.loc[96]['twitts']

OUTPUT

"so rylee,grace...wana go steve's party or not?? SADLY SINCE ITS EASTER I WNT B ABLE 2 DO MUCH  BUT OHH WELL....."

Preprocessing and Cleaning

In this section, we apply standard normalization, contraction expansion, and regex filters to prepare the dataset.

Lower case conversion

Normalize the text by mapping all characters in the tweet column to lower case:

PYTHON

df['twitts'] = df['twitts'].apply(lambda x: x.lower())

Verify case conversion on the first two rows:

PYTHON

df.head(2)

OUTPUT

	twitts	sentiment	word_counts	char_counts	avg_word_len	stop_words_len	hashtags_count	mentions_count	numerics_count	upper_counts
0	@switchfoot http://twitpic.com/2y1zl - awww, t...	0	19	115	5.052632	4	0	1	0	1
1	is upset that he can't update his facebook by ...	0	21	111	4.285714	9	0	0	0	0

Contraction to Expansion

Initialize a raw test string with various common contractions:

PYTHON

x = "i don't know what you want, can't, he'll, i'd"

Define a dictionary mapping english contractions to their full expanded variants:

PYTHON

contractions = {
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he'll've": "he will have",
"he's": "he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how does",
"i'd": "i would",
"i'd've": "i would have",
"i'll": "i will",
"i'll've": "i will have",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'd've": "it would have",
"it'll": "it will",
"it'll've": "it will have",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she would",
"she'd've": "she would have",
"she'll": "she will",
"she'll've": "she will have",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so is",
"that'd": "that would",
"that'd've": "that would have",
"that's": "that is",
"there'd": "there would",
"there'd've": "there would have",
"there's": "there is",
"they'd": "they would",
"they'd've": "they would have",
"they'll": "they will",
"they'll've": "they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he'll've": "he will have",
"he's": "he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how does",
"i'd": "i would",
"i'd've": "i would have",
"i'll": "i will",
"i'll've": "i will have",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'd've": "it would have",
"it'll": "it will",
"it'll've": "it will have",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she would",
"she'd've": "she would have",
"she'll": "she will",
"she'll've": "she will have",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so is",
"that'd": "that would",
"that'd've": "that would have",
"that's": "that is",
"there'd": "there would",
"there'd've": "there would have",
"there's": "there is",
"they'd": "they would",
"they'd've": "they would have",
"they'll": "they will",
"they'll've": "they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
" u ": " you ",
" ur ": " your ",
" n ": " and "}

Define a utility function to apply string replacements from the contractions dictionary:

PYTHON

def cont_to_exp(x):
    if type(x) is str:
        for key in contractions:
            value = contractions[key]
            x = x.replace(key, value)
        return x
    else:
        return x

Test the expansion on a sample string:

PYTHON

x = "hi, i'd be happy"

Run the contraction expansion function:

PYTHON

cont_to_exp(x)

OUTPUT

'hi, i would be happy'

Apply contraction expansion across all tweet texts:

PYTHON

%%time
df['twitts'] = df['twitts'].apply(lambda x: cont_to_exp(x))

OUTPUT

Wall time: 52.7 s

Inspect the expanded tweet entries:

PYTHON

df.head()

OUTPUT

	twitts	word_counts	char_counts	avg_word_len	stop_words_len	mentions_count	upper_counts
0	@switchfoot http://twitpic.com/2y1zl - awww, t...	19	115	5.052632	4	1	1
1	is upset that he cannot update his facebook by...	21	111	4.285714	9	0	0
2	@kenichan i dived many times for the ball. man...	18	89	3.944444	7	1	1
3	my whole body feels itchy and like its on fire	10	47	3.700000	5	0	0
4	@nationwideclass no, it is not behaving at all...	21	111	4.285714	10	1	1

Count and Remove Emails

Import the Python regular expressions library:

PYTHON

import re

Initialize a test string containing email addresses:

PYTHON

x = 'hi my email me at email@email.com another@email.com'

Find all email addresses using regex pattern matching:

PYTHON

re.findall(r'([a-zA-Z0-9+._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)', x)

OUTPUT

['email@email.com', 'another@email.com']

Extract lists of emails into a temporary metadata column:

PYTHON

df['emails'] = df['twitts'].apply(lambda x: re.findall(r'([a-zA-Z0-9+._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)', x))

Calculate the length of the email address list:

PYTHON

df['emails_count'] = df['emails'].apply(lambda x: len(x))

Display rows that contain at least one email address:

PYTHON

df[df['emails_count']>0].head()

OUTPUT

	twitts	word_counts	char_counts	avg_word_len	stop_words_len	mentions_count	numerics_count	upper_counts	emails	emails_count
4054	i want a new laptop. hp tx2000 is the bomb. :...	20	103	4.150000	6	0	0	4	[gabbehhramos@yahoo.com]	1
7917	who stole elledell@gmail.com?	3	31	9.000000	1	0	0	0	[elledell@gmail.com]	1
8496	@alexistehpom really? did you send out all th...	20	130	5.500000	11	1	0	0	[missataari@gmail.com]	1
10290	@laureystack awh...that is kinda sad lol add ...	8	76	8.500000	0	1	0	0	[hello.kitty.65@hotmail.com]	1
16413	@jilliancyork got 2 bottom of it, human error...	21	137	5.428571	7	1	1	0	[press@linkedin.com]	1

Remove emails from the test string using regex substitution:

PYTHON

re.sub(r'([a-zA-Z0-9+._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)', '', x)

OUTPUT

'hi my email me at  '

Apply the email removal substitution across all tweet entries:

PYTHON

df['twitts'] = df['twitts'].apply(lambda x: re.sub(r'([a-zA-Z0-9+._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)', '', x))

Verify email removal on the subset:

PYTHON

df[df['emails_count']>0].head()

OUTPUT

	twitts	word_counts	char_counts	avg_word_len	stop_words_len	mentions_count	numerics_count	upper_counts	emails	emails_count
4054	i want a new laptop. hp tx2000 is the bomb. :...	20	103	4.150000	6	0	0	4	[gabbehhramos@yahoo.com]	1
7917	who stole ?	3	31	9.000000	1	0	0	0	[elledell@gmail.com]	1
8496	@alexistehpom really? did you send out all th...	20	130	5.500000	11	1	0	0	[missataari@gmail.com]	1
10290	@laureystack awh...that is kinda sad lol add ...	8	76	8.500000	0	1	0	0	[hello.kitty.65@hotmail.com]	1
16413	@jilliancyork got 2 bottom of it, human error...	21	137	5.428571	7	1	1	0	[press@linkedin.com]	1

Count URLs and Remove it

Initialize a test string containing a URL:

PYTHON

x = 'hi, to watch more visit https://youtube.com/kgptalkie'

Identify any URL links using regular expression matching:

PYTHON

re.findall(r'(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', x)

OUTPUT

[('https', 'youtube.com', '/kgptalkie')]

Count the total URLs present in each tweet string:

PYTHON

df['urls_flag'] = df['twitts'].apply(lambda x: len(re.findall(r'(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', x)))

Test URL removal with regex substitution on our test string:

PYTHON

re.sub(r'(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '', x)

OUTPUT

'hi, to watch more visit '

Apply URL removal across all tweet entries in the DataFrame:

PYTHON

df['twitts'] = df['twitts'].apply(lambda x: re.sub(r'(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '', x))

Display the head of the updated DataFrame:

PYTHON

df.head()

OUTPUT

	twitts	word_counts	char_counts	avg_word_len	stop_words_len	mentions_count	upper_counts	emails	urls_flag
0	@switchfoot - awww, that is a bummer. you sh...	19	115	5.052632	4	1	1	[]	1
1	is upset that he cannot update his facebook by...	21	111	4.285714	9	0	0	[]	0
2	@kenichan i dived many times for the ball. man...	18	89	3.944444	7	1	1	[]	0
3	my whole body feels itchy and like its on fire	10	47	3.700000	5	0	0	[]	0
4	@nationwideclass no, it is not behaving at all...	21	111	4.285714	10	1	1	[]	0

Display the first row text entry:

PYTHON

df.loc[0]['twitts']

OUTPUT

'@switchfoot  - awww, that is a bummer.  you shoulda got david carr of third day to do it. ;d'

Remove RT

Remove retweet indicators (RT) from all tweets:

PYTHON

df['twitts'] = df['twitts'].apply(lambda x: re.sub('RT', "", x))

Special Chars removal or punctuation removal

Remove all special characters, symbols, and punctuation from the tweet text:

PYTHON

df['twitts'] = df['twitts'].apply(lambda x: re.sub('[^A-Z a-z 0-9-]+', '', x))

Verify punctuation removal:

PYTHON

df.head()

OUTPUT

	twitts	word_counts	char_counts	avg_word_len	stop_words_len	mentions_count	upper_counts	emails	urls_flag
0	switchfoot - awww that is a bummer you shoul...	19	115	5.052632	4	1	1	[]	1
1	is upset that he cannot update his facebook by...	21	111	4.285714	9	0	0	[]	0
2	kenichan i dived many times for the ball manag...	18	89	3.944444	7	1	1	[]	0
3	my whole body feels itchy and like its on fire	10	47	3.700000	5	0	0	[]	0
4	nationwideclass no it is not behaving at all i...	21	111	4.285714	10	1	1	[]	0

Remove multiple spaces

Initialize a test string with multiple consecutive spaces:

PYTHON

x = 'thanks    for    watching and    please    like this video'

Split on whitespace and rejoin using a single space character to strip extra spaces:

PLAINTEXT

" ".join(x.split())

PLAINTEXT

'thanks for watching and please like this video'

Clean up multiple spaces across all tweet texts:

PYTHON

df['twitts'] = df['twitts'].apply(lambda x: " ".join(x.split()))

Display the first two rows:

PYTHON

df.head(2)

OUTPUT

	twitts	sentiment	word_counts	char_counts	avg_word_len	stop_words_len	hashtags_count	mentions_count	numerics_count	upper_counts	emails	emails_count	urls_flag
0	switchfoot - awww that is a bummer you shoulda...	0	19	115	5.052632	4	0	1	0	1	[]	0	1
1	is upset that he cannot update his facebook by...	0	21	111	4.285714	9	0	0	0	0	[]	0	0

Remove HTML tags

Import BeautifulSoup to handle HTML stripping:

PYTHON

from bs4 import BeautifulSoup

Initialize a test string with HTML content:

PYTHON

x = 'Thanks for watching'

Strip HTML tags using lxml parser:

PYTHON

BeautifulSoup(x, 'lxml').get_text()

OUTPUT

'Thanks for watching'

Apply HTML tag stripping to all tweet documents:

PYTHON

%%time
df['twitts'] = df['twitts'].apply(lambda x: BeautifulSoup(x, 'lxml').get_text())

OUTPUT

Wall time: 11min 37s

Remove Accented Chars

Import the unicodedata library:

PYTHON

import unicodedata

Initialize a string with accented characters:

PYTHON

x = 'Áccěntěd těxt'

Normalize and encode to ASCII to remove accent marks:

PYTHON

def remove_accented_chars(x):
    x = unicodedata.normalize('NFKD', x).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return x

Verify normalization output:

PYTHON

remove_accented_chars(x)

OUTPUT

'Accented text'

SpaCy and NLP

In this section, we leverage spaCy's language models to perform tokenization, lemmatization, and linguistic parsing.

Remove Stop Words

Import spaCy library:

PYTHON

import spacy

Initialize a test string with multiple stop words:

PYTHON

x = 'this is stop words removal code is a the an how what'

Filter out any tokens listed in the standard stop words set:

PLAINTEXT

" ".join([t for t in x.split() if t not in STOP_WORDS])

PLAINTEXT

'stop words removal code'

Apply stop word removal to all tweet texts:

PYTHON

df['twitts'] = df['twitts'].apply(lambda x: " ".join([t for t in x.split() if t not in STOP_WORDS]))

Inspect the stop-word-free DataFrame:

PYTHON

df.head()

OUTPUT

	twitts	word_counts	char_counts	avg_word_len	stop_words_len	mentions_count	upper_counts	emails	urls_flag
0	switchfoot - awww bummer shoulda got david car...	19	115	5.052632	4	1	1	[]	1
1	upset update facebook texting cry result schoo...	21	111	4.285714	9	0	0	[]	0
2	kenichan dived times ball managed save 50 rest...	18	89	3.944444	7	1	1	[]	0
3	body feels itchy like fire	10	47	3.700000	5	0	0	[]	0
4	nationwideclass behaving mad	21	111	4.285714	10	1	1	[]	0

spaCy Lemmatization

Load the small English core spaCy pipeline:

PYTHON

nlp = spacy.load('en_core_web_sm')

Initialize a test string with varying inflected forms:

PYTHON

x = 'kenichan dived times ball managed save 50 rest'

Create a custom function to parse the text and extract lemma representations:

PYTHON

def make_to_base(x):
    x_list = []
    doc = nlp(x)

    for token in doc:
        lemma = str(token.lemma_)
        if lemma == '-PRON-' or lemma == 'be':
            lemma = token.text
        x_list.append(lemma)
    print(" ".join(x_list))

Verify lemmatization:

PYTHON

make_to_base(x)

OUTPUT

kenichan dive time ball manage save 50 rest

Common words removal

Join the first five cleaned tweets together to check word frequencies:

PLAINTEXT

' '.join(df.head()['twitts'])

PLAINTEXT

'switchfoot - awww bummer shoulda got david carr day d upset update facebook texting cry result school today blah kenichan dived times ball managed save 50 rest bounds body feels itchy like fire nationwideclass behaving mad'

Aggregate and compute the most common words in our dataset:

PYTHON

text = ' '.join(df['twitts'])

Split string on spaces:

PYTHON

text = text.split()

Retrieve value count series:

PYTHON

freq_comm = pd.Series(text).value_counts()

Select the top 20 most frequent words:

PYTHON

f20 = freq_comm[:20]

Display top 20 list:

PLAINTEXT

f20

PLAINTEXT

good      89366
day       82299
like      77735
-         69662
today     64512
going     64078
love      63421
work      62804
got       60749
time      56081
lol       55094
know      51172
im        50147
want      42070
new       41995
think     41040
night     41029
amp       40616
thanks    39311
home      39168
dtype: int64

Remove these 20 high-frequency words from all tweets:

PYTHON

df['twitts'] = df['twitts'].apply(lambda x: " ".join([t for t in x.split() if t not in f20]))

Rare words removal

Find the 20 least common words in the dataset:

PYTHON

rare20 = freq_comm[-20:]

Display the bottom 20 list:

PLAINTEXT

rare20

PLAINTEXT

veru              1
80-90f            1
refrigerant       1
demaisss          1
knittingsci-fi    1
wendireed         1
danielletuazon    1
chacha8           1
a-zquot           1
krustythecat      1
westmount         1
-appreciate       1
motocycle         1
madamhow          1
felspoon          1
fastbloke         1
900pmno           1
nxec              1
laassssttt        1
update-uri        1
dtype: int64

Isolate all words that occur exactly once:

PYTHON

rare = freq_comm[freq_comm.values == 1]

Display stats on words with single occurrences:

PLAINTEXT

rare

PLAINTEXT

mamat             1
fiive             1
music-festival    1
leenahyena        1
11517             1
                 ..
fastbloke         1
900pmno           1
nxec              1
laassssttt        1
update-uri        1
Length: 536196, dtype: int64

Remove the rare words from all tweets:

PYTHON

df['twitts'] = df['twitts'].apply(lambda x: ' '.join([t for t in x.split() if t not in rare20]))

Verify removal on the first 5 rows:

PYTHON

df.head()

OUTPUT

	twitts	word_counts	char_counts	avg_word_len	stop_words_len	mentions_count	upper_counts	emails	urls_flag
0	switchfoot awww bummer shoulda david carr d	19	115	5.052632	4	1	1	[]	1
1	upset update facebook texting cry result schoo...	21	111	4.285714	9	0	0	[]	0
2	kenichan dived times ball managed save 50 rest...	18	89	3.944444	7	1	1	[]	0
3	body feels itchy fire	10	47	3.700000	5	0	0	[]	0
4	nationwideclass behaving mad	21	111	4.285714	10	1	1	[]	0

Word Cloud Visualization

Import the WordCloud library to visually inspect the text tokens:

PYTHON

from wordcloud import WordCloud
import matplotlib.pyplot as plt
%matplotlib inline

Create a combined text corpus string of the first 20,000 words:

PYTHON

x = ' '.join(text[:20000])

Print the total word count in the dataset:

PYTHON

len(text)

OUTPUT

10837079

Generate and render the Word Cloud representation:

PYTHON

wc = WordCloud(width = 800, height=400).generate(x)
plt.imshow(wc)
plt.axis('off')
plt.show()

The Word Cloud below displays the most prominent terms across the Twitter dataset:

Word cloud visualization of top Twitter dataset words

Spelling Correction

Import TextBlob:

PYTHON

from textblob import TextBlob

Initialize a misspelled test string:

PYTHON

x = 'tanks forr waching this vidio carri'

Correct spelling using TextBlob:

PYTHON

x = TextBlob(x).correct()

Check corrected output:

PLAINTEXT

PYTHON

TextBlob("tanks for watching this video carry")

Tokenization

Initialize a test string with no spacing around a hashtag:

PYTHON

x = 'thanks#watching this video. please like it'

Tokenize using TextBlob:

PYTHON

TextBlob(x).words

PYTHON

WordList(['thanks', 'watching', 'this', 'video', 'please', 'like', 'it'])

Parse and print tokens using spaCy:

PYTHON

doc = nlp(x)
for token in doc:
    print(token)

OUTPUT

thanks#watching
this
video
.
please
like
it

TextBlob Lemmatization

Initialize inflected variants:

PYTHON

x = 'runs run running ran'

Import TextBlob Word wrapper:

PYTHON

from textblob import Word

Lemmatize using TextBlob:

PYTHON

for token in x.split():
    print(Word(token).lemmatize())

OUTPUT

run
run
running
ran

Lemmatize using spaCy:

PYTHON

doc = nlp(x)
for token in doc:
    print(token.lemma_)

OUTPUT

run
run
run
run

Detect Entities using NER of SpaCy

Named Entity Recognition (NER) identifies span elements in unstructured text and groups them into predefined categories like places, organizations, or dates.

Initialize a news string:

PYTHON

x = "Breaking News: Donald Trump, the president of the USA is looking to sign a deal to mine the moon"

Perform NER extraction:

PYTHON

doc = nlp(x)
for ent in doc.ents:
    print(ent.text + ' - ' + ent.label_ + ' - ' + str(spacy.explain(ent.label_)))

OUTPUT

Donald Trump - PERSON - People, including fictional
USA - GPE - Countries, cities, states

Import displacy:

PYTHON

from spacy import displacy

Render named entities visually in the document:

PYTHON

displacy.render(doc, style = 'ent')

Breaking News: Donald Trump PERSON , the president of the USA GPE is looking to sign a deal to mine the moon

Detecting Nouns

Verify the test string:

PLAINTEXT

'Breaking News: Donald Trump, the president of the USA is looking to sign a deal to mine the moon'

Extract the noun chunks using spaCy properties:

PYTHON

for noun in doc.noun_chunks:
    print(noun)

OUTPUT

Breaking News
Donald Trump
the president
the USA
a deal
the moon

Translation and Language Detection

Verify the test string:

PLAINTEXT

'Breaking News: Donald Trump, the president of the USA is looking to sign a deal to mine the moon'

Initialize TextBlob container:

PYTHON

tb = TextBlob(x)

Detect language:

PYTHON

tb.detect_language()

OUTPUT

'en'

Translate document text into Bengali:

PYTHON

tb.translate(to='bn')

PYTHON

TextBlob("ব্রেকিং নিউজ: যুক্তরাষ্ট্রের রাষ্ট্রপতি ডোনাল্ড ট্রাম্প চাঁদটি খনির জন্য একটি চুক্তিতে সই করতে চাইছেন")

Use inbuilt sentiment classifier

Import NaiveBayesAnalyzer:

PYTHON

from textblob.sentiments import NaiveBayesAnalyzer

Initialize a positive test sentence:

PYTHON

x = 'we all stands together to fight with corona virus. we will win together'

Evaluate sentiment predictions:

PYTHON

tb = TextBlob(x, analyzer=NaiveBayesAnalyzer())

Check sentiment distribution:

PYTHON

tb.sentiment

OUTPUT

Sentiment(classification='pos', p_pos=0.8259779151942094, p_neg=0.17402208480578962)

Initialize a second test string:

PYTHON

x = 'we all are sufering from corona'

Evaluate sentiment:

PYTHON

tb = TextBlob(x, analyzer=NaiveBayesAnalyzer())

Check sentiment prediction:

PYTHON

tb.sentiment

OUTPUT

Sentiment(classification='pos', p_pos=0.75616044472398, p_neg=0.2438395552760203)

Advanced Text Processing

In this section, we cover structural vectorization techniques to convert clean tokens into numeric feature matrices.

N-Grams

Initialize a test string:

PYTHON

x = 'thanks for watching'

Convert string to TextBlob object:

PYTHON

tb = TextBlob(x)

Extract 3-gram sequences:

PYTHON

tb.ngrams(3)

OUTPUT

[WordList(['thanks', 'for', 'watching'])]

Bag of Words (BoW) Representation

Initialize a list of dummy sentences:

PYTHON

x = ['this is first sentence this is', 'this is second', 'this is last']

Import CountVectorizer:

PYTHON

from sklearn.feature_extraction.text import CountVectorizer

Fit and generate BoW vectors:

PYTHON

cv = CountVectorizer(ngram_range=(1,1))
text_counts = cv.fit_transform(x)

Convert sparse representation to array:

PYTHON

text_counts.toarray()

OUTPUT

array([[1, 2, 0, 0, 1, 2],
       [0, 1, 0, 1, 0, 1],
       [0, 1, 1, 0, 0, 1]], dtype=int64)

Check feature names mapping:

PYTHON

cv.get_feature_names()

OUTPUT

['first', 'is', 'last', 'second', 'sentence', 'this']

Convert array to a structured DataFrame:

PYTHON

bow = pd.DataFrame(text_counts.toarray(), columns = cv.get_feature_names())

Display BoW DataFrame:

PLAINTEXT

bow

OUTPUT

	first	is	last	second	sentence	this
0	1	2	0	0	1	2
1	0	1	0	1	0	1
2	0	1	1	0	0	1

Verify documents:

PLAINTEXT

['this is first sentence this is', 'this is second', 'this is last']

Term Frequency

Term frequency measures the relative occurrence of a term in a specific document. The formula is:

TF (t) = \frac{Number of times term t appears in a document}{Total number of terms in the document}

Where:

$TF (t)$ — the normalized term frequency of term $t$ .

Verify documents:

PLAINTEXT

['this is first sentence this is', 'this is second', 'this is last']

Display BoW matrix:

PLAINTEXT

bow

OUTPUT

	first	is	last	second	sentence	this
0	1	2	0	0	1	2
1	0	1	0	1	0	1
2	0	1	1	0	0	1

Check matrix dimensions:

PYTHON

bow.shape

OUTPUT

(3, 6)

Copy the BoW DataFrame:

PYTHON

tf = bow.copy()

Compute normalized TF for each cell:

PYTHON

for index, row in enumerate(tf.iterrows()):
    for col in row[1].index:
        tf.loc[index, col] = tf.loc[index, col]/sum(row[1].values)

Display normalized TF DataFrame:

PLAINTEXT

tf

OUTPUT

	first	is	last	second	sentence	this
0	0.166667	0.333333	0.000000	0.000000	0.166667	0.333333
1	0.000000	0.333333	0.000000	0.333333	0.000000	0.333333
2	0.000000	0.333333	0.333333	0.000000	0.000000	0.333333

Inverse Document Frequency IDF

Inverse Document Frequency (IDF) measures how unique or rare a term is across the entire corpus. The formula used in scikit-learn when smooth_idf=True is:

IDF (t) = lo g (\frac{1 + N}{n + 1}) + 1

Where:

$N$ — total number of documents in the corpus.
$n$ — number of documents containing term $t$ .

Import NumPy:

PYTHON

import numpy as np

Convert string array to DataFrame:

PYTHON

x_df = pd.DataFrame(x, columns=['words'])

Display words:

PYTHON

x_df

OUTPUT

	words
0	this is first sentence this is
1	this is second
2	this is last

Display BoW DataFrame:

PLAINTEXT

bow

OUTPUT

	first	is	last	second	sentence	this
0	1	2	0	0	1	2
1	0	1	0	1	0	1
2	0	1	1	0	0	1

Get total document count $N$ :

PYTHON

N = bow.shape[0]
N

OUTPUT

Convert values to boolean flags to find document presence:

PYTHON

bb = bow.astype('bool')
bb

OUTPUT

	first	is	last	second	sentence	this
0	True	True	False	False	True	True
1	False	True	False	True	False	True
2	False	True	True	False	False	True

Sum occurrences of column "is":

PYTHON

bb['is'].sum()

OUTPUT

Retrieve columns:

PYTHON

cols = bb.columns
cols

OUTPUT

Index(['first', 'is', 'last', 'second', 'sentence', 'this'], dtype='object')

Calculate total document occurrences for each term:

PYTHON

nz = []
for col in cols:
    nz.append(bb[col].sum())

Check occurrences list:

PLAINTEXT

nz

PLAINTEXT

[1, 3, 1, 1, 1, 3]

Calculate IDF values:

PYTHON

idf = []
for index, col in enumerate(cols):
    idf.append(np.log((N + 1)/(nz[index] + 1)) + 1)

Check IDF scores:

PLAINTEXT

idf

PLAINTEXT

[1.6931471805599454, 1.0, 1.6931471805599454, 1.6931471805599454, 1.6931471805599454, 1.0]

Review BoW DataFrame:

PLAINTEXT

bow

OUTPUT

	first	is	last	second	sentence	this
0	1	2	0	0	1	2
1	0	1	0	1	0	1
2	0	1	1	0	0	1

TFIDF

Import TfidfVectorizer:

PYTHON

from sklearn.feature_extraction.text import TfidfVectorizer

Fit and transform text using TfidfVectorizer:

PYTHON

tfidf = TfidfVectorizer()
x_tfidf = tfidf.fit_transform(x_df['words'])

Convert sparse matrix to array:

PYTHON

x_tfidf.toarray()

OUTPUT

array([[0.45688214, 0.5396839 , 0.        , 0.        , 0.45688214,
        0.5396839 ],
       [0.        , 0.45329466, 0.        , 0.76749457, 0.        ,
        0.45329466],
       [0.        , 0.45329466, 0.76749457, 0.        , 0.        ,
        0.45329466]])

Print fitted IDF scores:

PYTHON

tfidf.idf_

OUTPUT

array([1.69314718, 1.        , 1.69314718, 1.69314718, 1.69314718,
       1.        ])

Print calculated manual IDF scores:

PLAINTEXT

idf

PLAINTEXT

[1.6931471805599454, 1.0, 1.6931471805599454, 1.6931471805599454, 1.6931471805599454, 1.0]

Word Embeddings

Word Embedding is a language modeling technique used for mapping words to vectors of real numbers. It represents words or phrases in vector space with several dimensions. Word embeddings can be generated using various methods like neural networks, co-occurrence matrix, probabilistic models, etc.

SpaCy Word2Vec

Load the large English pipeline model to retrieve semantic word vectors:

PYTHON

nlp = spacy.load('en_core_web_lg')

Initialize sample tokens:

PYTHON

doc = nlp('thank you! dog cat lion dfasaa')

Check if each token is associated with a pre-trained vector representation:

PYTHON

for token in doc:
    print(token.text, token.has_vector)

OUTPUT

thank True
you True
! True
dog True
cat True
lion True
dfasaa False

Check token vector dimensions:

PYTHON

token.vector.shape

OUTPUT

(300,)

Check vector dimension on "cat":

PYTHON

nlp('cat').vector.shape

OUTPUT

(300,)

Calculate similarity scores across combinations of words:

PYTHON

for token1 in doc:
    for token2 in doc:
        print(token1.text, token2.text, token1.similarity(token2))
    print()

OUTPUT

thank thank 1.0
thank you 0.5647585
thank ! 0.52147406
thank dog 0.2504265
thank cat 0.20648485
thank lion 0.13629764

Evaluate similarity metrics on empty vectors:

PLAINTEXT

thank dfasaa 0.0

you thank 0.5647585
you you 1.0
you ! 0.4390223
you dog 0.36494097
you cat 0.3080798
you lion 0.20392051

Evaluate similarity against empty tokens:

PLAINTEXT

you dfasaa 0.0

! thank 0.52147406
! you 0.4390223
! ! 1.0
! dog 0.29852203
! cat 0.29702348
! lion 0.19601382

Evaluate similarity metrics:

PLAINTEXT

! dfasaa 0.0

dog thank 0.2504265
dog you 0.36494097
dog ! 0.29852203
dog dog 1.0
dog cat 0.80168545
dog lion 0.47424486

Evaluate similarities:

PLAINTEXT

dog dfasaa 0.0

cat thank 0.20648485
cat you 0.3080798
cat ! 0.29702348
cat dog 0.80168545
cat cat 1.0
cat lion 0.52654374

Calculate similarities:

PLAINTEXT

cat dfasaa 0.0

lion thank 0.13629764
lion you 0.20392051
lion ! 0.19601382
lion dog 0.47424486
lion cat 0.52654374
lion lion 1.0

Verify comparisons:

PLAINTEXT

lion dfasaa 0.0

Check comparison results:

PLAINTEXT

dfasaa thank 0.0

Check comparisons:

PLAINTEXT

dfasaa you 0.0

Evaluate comparisons:

PLAINTEXT

dfasaa ! 0.0

Check evaluations:

PLAINTEXT

dfasaa dog 0.0

Verify similarity output:

PLAINTEXT

dfasaa cat 0.0

Check comparisons:

PLAINTEXT

dfasaa lion 0.0
dfasaa dfasaa 1.0

Machine Learning Models for Text Classification

In this section, we compare machine learning models trained on Bag of Words features, manual features, and Word2Vec semantic embeddings.

BoW Features Setup

Inspect the shape of the main dataset DataFrame:

PYTHON

df.shape

OUTPUT

(1600000, 13)

Create a balanced sample dataset consisting of 2000 positive and 2000 negative sentiment rows:

PYTHON

df0 = df[df['sentiment']==0].sample(2000)
df4 = df[df['sentiment']==4].sample(2000)

Concatenate positive and negative samples:

PYTHON

dfr = df0.append(df4)

Check the dimension of the concatenated sample dataset:

PYTHON

dfr.shape

OUTPUT

(4000, 13)

Drop non-feature labels and email text listings to isolate the manual feature set:

PYTHON

dfr_feat = dfr.drop(labels=['twitts','sentiment','emails'], axis = 1).reset_index(drop=True)

Inspect the manual feature DataFrame:

PYTHON

dfr_feat

OUTPUT

	word_counts	char_counts	avg_word_len	stop_words_len	hashtags_count	mentions_count	numerics_count	upper_counts	emails_count	urls_flag
0	15	81	4.400000	6	0	0	0	0	0	0
1	8	47	4.875000	4	0	1	0	0	0	0
2	15	69	3.600000	6	0	1	0	0	0	0
3	9	42	3.666667	4	0	0	0	2	0	0
4	14	77	4.500000	5	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...
3995	3	33	9.666667	1	0	1	0	0	0	0
3996	16	78	3.875000	4	0	1	0	2	0	0
3997	27	134	3.962963	9	0	1	0	2	0	0
3998	6	44	6.333333	0	0	1	1	1	0	0
3999	5	25	4.000000	3	0	1	0	0	0	0

4000 rows × 10 columns

Extract target sentiment labels:

PYTHON

y = dfr['sentiment']

Import CountVectorizer:

PYTHON

from sklearn.feature_extraction.text import CountVectorizer

Generate BoW representation for the sample dataset:

PYTHON

cv = CountVectorizer()
text_counts = cv.fit_transform(dfr['twitts'])

Check the size of the generated vocabulary feature space:

PYTHON

text_counts.toarray().shape

OUTPUT

(4000, 9750)

Construct a DataFrame from the generated vocabulary:

PYTHON

dfr_bow = pd.DataFrame(text_counts.toarray(), columns=cv.get_feature_names())

Inspect the generated BoW DataFrame:

PYTHON

dfr_bow.head(2)

OUTPUT

	007peter	05	060594	09	10	100	1000	10000000000000000000000000000	1038	1041	...	zomg	zonked	zoo	zooey	zrovna	zshare	zsk	zwel	zzz	zzzzz
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

2 rows × 9750 columns

Classifier Models Setup

Import the classifier algorithms, evaluation metrics, and preprocessing utilities from scikit-learn:

PYTHON

from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.preprocessing import MinMaxScaler

Initialize the five comparison models:

PYTHON

sgd = SGDClassifier(n_jobs=-1, random_state=42, max_iter=200)
lgr = LogisticRegression(random_state=42, max_iter=200)
lgrcv = LogisticRegressionCV(cv = 2, random_state=42, max_iter=1000)
svm = LinearSVC(random_state=42, max_iter=200)
rfc = RandomForestClassifier(random_state=42, n_jobs=-1, n_estimators=200)

Map models to standard shorthand keys:

PYTHON

clf = {'SGD': sgd, 'LGR': lgr, 'LGR-CV': lgrcv, 'SVM': svm, 'RFC': rfc}

Verify dictionary keys:

PYTHON

clf.keys()

PYTHON

dict_keys(['SGD', 'LGR', 'LGR-CV', 'SVM', 'RFC'])

Create a generic training and evaluation pipeline function:

PYTHON

def classify(X, y):
    scaler = MinMaxScaler(feature_range=(0, 1))
    X = scaler.fit_transform(X)

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)

    for key in clf.keys():
        clf[key].fit(X_train, y_train)
        y_pred = clf[key].predict(X_test)
        ac = accuracy_score(y_test, y_pred)
        print(key, " ---> ", ac)

Evaluate accuracy scores using the BoW features representation:

PYTHON

%%time
classify(dfr_bow, y)

OUTPUT

SGD  --->  0.62375
LGR  --->  0.65375
LGR-CV  --->  0.6525
SVM  --->  0.6325
RFC  --->  0.6525
Wall time: 1min 42s

Evaluate accuracy using the manual feature sets:

PYTHON

dfr_feat.head(2)

OUTPUT

	word_counts	char_counts	avg_word_len	stop_words_len	hashtags_count	mentions_count	numerics_count	upper_counts	emails_count	urls_flag
453843	15	81	4.400	6	0	0	0	0	0	0
388280	8	47	4.875	4	0	1	0	0	0	0

Run classifier evaluation on the manual features:

PYTHON

%%time
classify(dfr_feat, y)

OUTPUT

SGD  --->  0.64125
LGR  --->  0.645
LGR-CV  --->  0.65
SVM  --->  0.6475
RFC  --->  0.5675
Wall time: 1.35 s

Combine manual features and vocabulary-based BoW features:

PYTHON

X = dfr_feat.join(dfr_bow)

Evaluate accuracy on the combined features matrix:

PYTHON

%%time
classify(X, y)

OUTPUT

SGD  --->  0.64875
LGR  --->  0.67125
LGR-CV  --->  0.66125
SVM  --->  0.64375
RFC  --->  0.705
Wall time: 1min 18s

TF-IDF Features Setup

Import TfidfVectorizer:

PYTHON

from sklearn.feature_extraction.text import TfidfVectorizer

Verify dataset shape:

PYTHON

dfr.shape

OUTPUT

(4000, 13)

Vectorize using TfidfVectorizer:

PYTHON

tfidf = TfidfVectorizer()
X = tfidf.fit_transform(dfr['twitts'])

Train and evaluate the models using the generated TF-IDF features matrix:

PYTHON

%%time
classify(pd.DataFrame(X.toarray()), y)

OUTPUT

SGD  --->  0.635
LGR  --->  0.65125
LGR-CV  --->  0.6475
SVM  --->  0.63875
RFC  --->  0.6425
Wall time: 1min 37s

Word2Vec Features Setup

Create a function to calculate the average semantic vector representation of a sentence using spaCy:

PYTHON

def get_vec(x):
    doc = nlp(x)
    return doc.vector.reshape(1, -1)

Apply the vectorization function to all tweets:

PYTHON

%%time
dfr['vec'] = dfr['twitts'].apply(lambda x: get_vec(x))

OUTPUT

Wall time: 51.8 s

Concatenate individual vectors into a single feature array:

PYTHON

X = np.concatenate(dfr['vec'].to_numpy(), axis = 0)

Check the shape of the feature matrix:

PYTHON

X.shape

OUTPUT

(4000, 300)

Evaluate accuracy on Word2Vec feature vectors:

PYTHON

classify(pd.DataFrame(X), y)

OUTPUT

SGD  --->  0.5925
LGR  --->  0.70625
LGR-CV  --->  0.69375

Check predictions on classification outputs:

PLAINTEXT

SVM  --->  0.70125
RFC  --->  0.66625

Create a custom function to run predictions on Word2Vec inputs:

PYTHON

def predict_w2v(x):
    for key in clf.keys():
        y_pred = clf[key].predict(get_vec(x))
        print(key, "-->", y_pred)

Evaluate prediction on a positive review:

PYTHON

predict_w2v('hi, thanks for watching this video. please like and subscribe')

OUTPUT

SGD --> [0]
LGR --> [4]
LGR-CV --> [0]
SVM --> [4]
RFC --> [0]

Predict sentiment of a question:

PYTHON

predict_w2v('please let me know if you want more video')

OUTPUT

SGD --> [0]
LGR --> [0]
LGR-CV --> [0]
SVM --> [0]
RFC --> [0]

Predict sentiment on a highly positive feedback:

PYTHON

predict_w2v('congratulation looking good congrats')

OUTPUT

SGD --> [4]
LGR --> [4]
LGR-CV --> [4]
SVM --> [4]
RFC --> [0]

Conclusion

In this tutorial, you built a complete text preprocessing and feature engineering pipeline in Python. Starting with raw tweet logs from the Sentiment140 dataset, you calculated meta-features like word counts and mentions, cleaned the text by stripping HTML tags and accented characters, and extracted lemmas using spaCy. Using these cleaned tokens, you generated numerical representations using Bag of Words, TF-IDF, and Word2Vec models, then trained five different classifiers. The Logistic Regression and LinearSVC models achieved the highest overall accuracy of 70.6% when trained on dense Word2Vec semantic embeddings.

Key takeaways:

Cleaning operations like contraction expansion and lemmatization reduce the overall size of the vocabulary, helping classifiers avoid overfitting.
spaCy provides ready-to-use, high-quality, pre-trained word embeddings that capture semantic similarities better than basic Bag of Words arrays.
Combining manual features (like word length or stop word counts) with structural word frequencies improves classifier performance.

Next steps:

Apply the same workflow to movie and product reviews in Sentiment Classification with spaCy.
Build a binary classifier to identify spam messages in Spam Text Message Classification using NLP.
Explore summarization techniques on raw text blocks in Text Summarization using NLP.

Topics You Will Master

Complete Text Processing for Beginners

Installing Libraries

Importing Libraries

Word Counts

Characters Count

Average Word Length

Stop Words Count

Count #HashTags and @Mentions

If numeric digits are present in twitts

UPPER case words count

Preprocessing and Cleaning

Lower case conversion

Contraction to Expansion

Count and Remove Emails

Count URLs and Remove it

Remove RT

Special Chars removal or punctuation removal

Remove multiple spaces

Remove HTML tags

Remove Accented Chars

SpaCy and NLP

Remove Stop Words

spaCy Lemmatization

Common words removal

Rare words removal

Word Cloud Visualization

Spelling Correction

Tokenization

TextBlob Lemmatization

Detect Entities using NER of SpaCy

Detecting Nouns

Translation and Language Detection

Use inbuilt sentiment classifier

Advanced Text Processing

N-Grams

Bag of Words (BoW) Representation

Term Frequency

Inverse Document Frequency IDF

TFIDF

Word Embeddings

SpaCy Word2Vec

Machine Learning Models for Text Classification

BoW Features Setup

Classifier Models Setup

TF-IDF Features Setup

Word2Vec Features Setup

Conclusion

Latest recommendations you might like

Text Summarization using NLP

spaCy Introduction: Linguistic Feature Extraction

Find this tutorial useful?

Discussion & Comments