#bag of words#natural language processing#numpy#pandas#roshan#spacy#text#text processing#tfidf#word2vec

NLP: End to End Text Processing for Beginners

Master end-to-end NLP text processing in Python. Covers Bag of Words, TF-IDF, Word2Vec, spaCy tokenization, and classification with machine learning.

May 24, 2026 at 9:00 PM31 min readFollowFollow (Hindi)

Topics You Will Master

Text cleaning: lowercasing, punctuation removal, and stop-word filtering
Bag of Words vectorization with sklearn CountVectorizer
TF-IDF weighting and sklearn TfidfVectorizer implementation
Word2Vec embeddings with spaCy for semantic word similarity
End-to-end classification pipeline: vectorize, train, and evaluate
Best For

ML beginners learning the full NLP preprocessing stack before training classifiers.

Expected Outcome

A complete text processing pipeline ready to feed any machine learning classifier.

Complete Text Processing for Beginners

Natural Language Processing (NLP) enables computers to understand, interpret, and manipulate human language. While raw text from tweets, documents, and transcripts is highly unstructured and messy, modern text processing pipelines can extract clean semantic representations to predict sentiment or classify topics.

In this tutorial, you will build an end-to-end text processing pipeline in Python. You will clean and preprocess the Sentiment140 Twitter dataset, extract features using Bag of Words, TF-IDF, and Word2Vec, and train multiple machine learning classifiers to predict sentiment.

Prerequisites: Python 3.x, Numpy, Pandas, spaCy, Scikit-learn, TextBlob.

Dataset used in this tutorial: Sentiment140 Dataset on Kaggle

The diagram below illustrates the general workflow of an end-to-end NLP pipeline:

Cognitive NLP vs keyword matching visual

Installing Libraries

You can install spaCy and its associated English language models using pip:

PYTHON
# pip install -U spacy
# pip install -U spacy-lookups-data
# python -m spacy download en_core_web_sm
# python -m spacy download en_core_web_md
# python -m spacy download en_core_web_lg

Importing Libraries

We begin by importing the basic data manipulation libraries and the spaCy stop word list:

PYTHON
import pandas as pd
import numpy as np

import spacy
from spacy.lang.en.stop_words import STOP_WORDS

Next, load the Sentiment140 CSV dataset with Latin-1 encoding:

PYTHON
df = pd.read_csv('twitter16m.csv', encoding = 'latin1', header = None)

Display the first few rows of the loaded DataFrame:

PYTHON
df.head()
OUTPUT
012345
001467810369Mon Apr 06 22:19:45 PDT 2009NO_QUERY_TheSpecialOne_@switchfoot http://twitpic.com/2y1zl - Awww, t...
101467810672Mon Apr 06 22:19:49 PDT 2009NO_QUERYscotthamiltonis upset that he can't update his Facebook by ...
201467810917Mon Apr 06 22:19:53 PDT 2009NO_QUERYmattycus@Kenichan I dived many times for the ball. Man...
301467811184Mon Apr 06 22:19:57 PDT 2009NO_QUERYElleCTFmy whole body feels itchy and like its on fire
401467811193Mon Apr 06 22:19:57 PDT 2009NO_QUERYKaroli@nationwideclass no, it's not behaving at all....

Keep only the text content (column index 5) and the target sentiment label (column index 0):

PYTHON
df = df[[5, 0]]

Assign descriptive column names to the target DataFrame:

PYTHON
df.columns = ['twitts', 'sentiment']
df.head()
OUTPUT
twittssentiment
0@switchfoot http://twitpic.com/2y1zl - Awww, t...0
1is upset that he can't update his Facebook by ...0
2@Kenichan I dived many times for the ball. Man...0
3my whole body feels itchy and like its on fire0
4@nationwideclass no, it's not behaving at all....0

Check the class balance for positive and negative sentiment labels:

PYTHON
df['sentiment'].value_counts()
OUTPUT
4    800000
0    800000
Name: sentiment, dtype: int64

Create a lookup dictionary mapping sentiment label integers to descriptive string categories:

PYTHON
sent_map = {0: 'negative', 4: 'positive'}

Word Counts

We can calculate word counts by splitting each tweet sentence on whitespace and finding the length of the resulting word list:

PYTHON
df['word_counts'] = df['twitts'].apply(lambda x: len(str(x).split()))

Display the DataFrame with the new word counts column:

PYTHON
df.head()
OUTPUT
twittssentimentword_counts
0@switchfoot http://twitpic.com/2y1zl - Awww, t...019
1is upset that he can't update his Facebook by ...021
2@Kenichan I dived many times for the ball. Man...018
3my whole body feels itchy and like its on fire010
4@nationwideclass no, it's not behaving at all....021

Characters Count

Next, we count the total number of characters in each tweet string:

PYTHON
df['char_counts'] = df['twitts'].apply(lambda x: len(x))

Display the DataFrame with the character counts included:

PYTHON
df.head()
OUTPUT
twittssentimentword_countschar_counts
0@switchfoot http://twitpic.com/2y1zl - Awww, t...019115
1is upset that he can't update his Facebook by ...021111
2@Kenichan I dived many times for the ball. Man...01889
3my whole body feels itchy and like its on fire01047
4@nationwideclass no, it's not behaving at all....021111

Average Word Length

We define a helper function to compute the average character length of words inside each tweet:

PYTHON
def get_avg_word_len(x):
    words = x.split()
    word_len = 0
    for word in words:
        word_len = word_len + len(word)
    return word_len/len(words) # != len(x)/len(words)

Apply the function to generate average word lengths:

PYTHON
df['avg_word_len'] = df['twitts'].apply(lambda x: get_avg_word_len(x))

Confirm the calculation logic on a dummy string:

PYTHON
len('this is nlp lesson')/4
OUTPUT
4.5

Display the head of the updated DataFrame:

PYTHON
df.head()
OUTPUT
twittssentimentword_countschar_countsavg_word_len
0@switchfoot http://twitpic.com/2y1zl - Awww, t...0191155.052632
1is upset that he can't update his Facebook by ...0211114.285714
2@Kenichan I dived many times for the ball. Man...018893.944444
3my whole body feels itchy and like its on fire010473.700000
4@nationwideclass no, it's not behaving at all....0211114.285714

Verify the mathematical relation between character count and word count on the first row:

PLAINTEXT
115/19
PLAINTEXT
6.052631578947368

Stop Words Count

Check the default list of stop words provided by spaCy:

PYTHON
print(STOP_WORDS)
OUTPUT
{'one', 'up', 'further', 'herself', 'nevertheless', 'their', 'when', 'a', 'bottom', 'both', 'also', 'i', 'sometime', 'ours', "'d", 'him', 'together', 'former', 'hereafter', 'whereby', "'ll", 'three', 'same', 'is', 'say', 'hers', 'must', 'five', 'you', 'across', 'n‘t', 'mostly', 'into', 'am', 'myself', 'something', 'could', 'being', 'seems', 'go', 'only', 'fifteen', 'either', 'us', 'than', 'latter', 'so', 'after', 'name', 'there', 'that', 'next', 'even', 'without', 'along', 'behind', 'very', 'whereas', 'off', 'herein', 'although', 'such', 'themselves', 'then', 'in', 'under', 'of', 'onto', 'really', 'due', 'otherwise', 'give', 'yourself', 'indeed', 'my', 'mine', 'show', 'via', 'elsewhere', 'be', 'just', 'thence', 'them', 'beside', 'though', 'as', 'out', 'third', 'however', 'twelve', 'except', '‘d', 'anything', 'move', 'side', 'everything', 'all', 'towards', 'whatever', 'will', 'n’t', 'toward', 'keep', 'hereupon', 'might', 'no', 'own', 'itself', 'for', 'can', 'rather', 'whether', 'while', 'and', 'part', 'over', 'else', 'has', 'forty', 'about', 'hereby', 'sixty', 'using', 'here', 'please', 'often', '’re', 'any', 'ca', 'per', 'whole', 'it', 'are', 'from', 'had', 'thru', '’m', 'two', 'fifty', 'your', 'latterly', 'again', 'or', 'few', 'against', 'much', 'somewhere', 'but', '’d', 'somehow', 'never', 'becoming', 'down', 'regarding', 'always', 'other', 'amount', 'because', 'noone', 'anyone', 'six', 'each', 'thus', 'alone', 'why', 'his', 'sometimes', 'now', 'since', 'become', 'see', 'she', 'where', 'whereafter', 'various', 'perhaps', 'another', 'who', 'anyhow', 'yourselves', 'someone', 'ten', 'became', 'nothing', 'front', 'an', 'anyway', 'get', 'thereafter', "'re", 'our', 'call', 'therein', 'have', 'this', 'above', 'some', 'namely', '‘re', 'seem', 'until', '’ll', 'more', 'still', "n't", 'the', 'does', 'himself', 'take', 'he', 'which', 'seeming', 'been', 'beforehand', 'may', 'do', 'well', 'ever', 'used', 'enough', 'every', 'top', 'made', "'m", 'hundred', 'almost', 'her', 'moreover', 'wherever', '’s', 'amongst', 'meanwhile', 'nobody', 'ourselves', 'whenever', 'at', 'wherein', 'nowhere', 'around', 'between', 'last', 'others', 'becomes', 'they', 'full', 'below', 'nor', 'before', 'what', 'within', 'these', 'besides', 'whereupon', 'how', 'throughout', 'eight', "'s", 'on', 'most', 'if', '‘ve', 'should', 'four', 'serious', 'thereby', '‘ll', 'whence', 'done', 'anywhere', 'yours', 'formerly', 'everyone', 'whose', 'back', 'make', 'among', 'first', 'we', '‘s', 'neither', 'doing', 'already', 'those', 'empty', 'did', 'not', '‘m', 'less', 'to', 'during', 'twenty', 'too', 'put', 'nine', 'yet', 'everywhere', 'quite', 'were', 'seemed', '’ve', 'through', 'once', 'whither', 'thereupon', 'whoever', "'ve", 'therefore', 'me', 'unless', 'whom', 'cannot', 'afterwards', 'none', 'least', 'hence', 'eleven', 'with', 'upon', 'was', 'would', 'by', 'beyond', 'several', 'its', 'many', 're'}

Initialize a test string:

PYTHON
x = 'this is text data'

Tokenize the test string:

PYTHON
x.split()
OUTPUT
['this', 'is', 'text', 'data']

Filter out and count stop words in the token list:

PYTHON
len([t for t in x.split() if t in STOP_WORDS])
OUTPUT
2

Compute the number of stop words contained in each tweet:

PYTHON
df['stop_words_len'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t in STOP_WORDS]))

Display the updated DataFrame:

PYTHON
df.head()
OUTPUT
twittssentimentword_countschar_countsavg_word_lenstop_words_len
0@switchfoot http://twitpic.com/2y1zl - Awww, t...0191155.0526324
1is upset that he can't update his Facebook by ...0211114.2857149
2@Kenichan I dived many times for the ball. Man...018893.9444447
3my whole body feels itchy and like its on fire010473.7000005
4@nationwideclass no, it's not behaving at all....0211114.28571410

Count #HashTags and @Mentions

Initialize a sample string with a hashtag and a mention:

PYTHON
x = 'this #hashtag and this is @mention'
# x = x.split()
# x

Find all tokens starting with @:

PLAINTEXT
[t for t in x.split() if t.startswith('@')]
PLAINTEXT
['@mention']

Calculate the occurrences of hashtags (#) and user mentions (@) across all tweets:

PYTHON
df['hashtags_count'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t.startswith('#')]))
df['mentions_count'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t.startswith('@')]))

Inspect the head of the updated DataFrame:

PYTHON
df.head()
OUTPUT
twittssentimentword_countschar_countsavg_word_lenstop_words_lenhashtags_countmentions_count
0@switchfoot http://twitpic.com/2y1zl - Awww, t...0191155.052632401
1is upset that he can't update his Facebook by ...0211114.285714900
2@Kenichan I dived many times for the ball. Man...018893.944444701
3my whole body feels itchy and like its on fire010473.700000500
4@nationwideclass no, it's not behaving at all....0211114.2857141001

If numeric digits are present in twitts

Find and count space-separated digit tokens in the tweets:

PYTHON
df['numerics_count'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t.isdigit()]))

Inspect the head of the DataFrame:

PYTHON
df.head()
OUTPUT
twittssentimentword_countschar_countsavg_word_lenstop_words_lenhashtags_countmentions_countnumerics_count
0@switchfoot http://twitpic.com/2y1zl - Awww, t...0191155.0526324010
1is upset that he can't update his Facebook by ...0211114.2857149000
2@Kenichan I dived many times for the ball. Man...018893.9444447010
3my whole body feels itchy and like its on fire010473.7000005000
4@nationwideclass no, it's not behaving at all....0211114.28571410010

UPPER case words count

Count uppercase tokens that have a string length greater than 3:

PYTHON
df['upper_counts'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t.isupper() and len(x)>3]))

Inspect the head of the DataFrame:

PYTHON
df.head()
OUTPUT
twittssentimentword_countschar_countsavg_word_lenstop_words_lenhashtags_countmentions_countnumerics_countupper_counts
0@switchfoot http://twitpic.com/2y1zl - Awww, t...0191155.05263240101
1is upset that he can't update his Facebook by ...0211114.28571490000
2@Kenichan I dived many times for the ball. Man...018893.94444470101
3my whole body feels itchy and like its on fire010473.70000050000
4@nationwideclass no, it's not behaving at all....0211114.285714100101

View a specific tweet index to check formatting:

PYTHON
df.loc[96]['twitts']
OUTPUT
"so rylee,grace...wana go steve's party or not?? SADLY SINCE ITS EASTER I WNT B ABLE 2 DO MUCH  BUT OHH WELL....."

Preprocessing and Cleaning

In this section, we apply standard normalization, contraction expansion, and regex filters to prepare the dataset.

Lower case conversion

Normalize the text by mapping all characters in the tweet column to lower case:

PYTHON
df['twitts'] = df['twitts'].apply(lambda x: x.lower())

Verify case conversion on the first two rows:

PYTHON
df.head(2)
OUTPUT
twittssentimentword_countschar_countsavg_word_lenstop_words_lenhashtags_countmentions_countnumerics_countupper_counts
0@switchfoot http://twitpic.com/2y1zl - awww, t...0191155.05263240101
1is upset that he can't update his facebook by ...0211114.28571490000

Contraction to Expansion

Initialize a raw test string with various common contractions:

PYTHON
x = "i don't know what you want, can't, he'll, i'd"

Define a dictionary mapping english contractions to their full expanded variants:

PYTHON
contractions = {
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he'll've": "he will have",
"he's": "he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how does",
"i'd": "i would",
"i'd've": "i would have",
"i'll": "i will",
"i'll've": "i will have",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'd've": "it would have",
"it'll": "it will",
"it'll've": "it will have",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she would",
"she'd've": "she would have",
"she'll": "she will",
"she'll've": "she will have",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so is",
"that'd": "that would",
"that'd've": "that would have",
"that's": "that is",
"there'd": "there would",
"there'd've": "there would have",
"there's": "there is",
"they'd": "they would",
"they'd've": "they would have",
"they'll": "they will",
"they'll've": "they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he'll've": "he will have",
"he's": "he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how does",
"i'd": "i would",
"i'd've": "i would have",
"i'll": "i will",
"i'll've": "i will have",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'd've": "it would have",
"it'll": "it will",
"it'll've": "it will have",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she would",
"she'd've": "she would have",
"she'll": "she will",
"she'll've": "she will have",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so is",
"that'd": "that would",
"that'd've": "that would have",
"that's": "that is",
"there'd": "there would",
"there'd've": "there would have",
"there's": "there is",
"they'd": "they would",
"they'd've": "they would have",
"they'll": "they will",
"they'll've": "they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
" u ": " you ",
" ur ": " your ",
" n ": " and "}

Define a utility function to apply string replacements from the contractions dictionary:

PYTHON
def cont_to_exp(x):
    if type(x) is str:
        for key in contractions:
            value = contractions[key]
            x = x.replace(key, value)
        return x
    else:
        return x

Test the expansion on a sample string:

PYTHON
x = "hi, i'd be happy"

Run the contraction expansion function:

PYTHON
cont_to_exp(x)
OUTPUT
'hi, i would be happy'

Apply contraction expansion across all tweet texts:

PYTHON
%%time
df['twitts'] = df['twitts'].apply(lambda x: cont_to_exp(x))
OUTPUT
Wall time: 52.7 s

Inspect the expanded tweet entries:

PYTHON
df.head()
OUTPUT
twittssentimentword_countschar_countsavg_word_lenstop_words_lenhashtags_countmentions_countnumerics_countupper_counts
0@switchfoot http://twitpic.com/2y1zl - awww, t...0191155.05263240101
1is upset that he cannot update his facebook by...0211114.28571490000
2@kenichan i dived many times for the ball. man...018893.94444470101
3my whole body feels itchy and like its on fire010473.70000050000
4@nationwideclass no, it is not behaving at all...0211114.285714100101

Count and Remove Emails

Import the Python regular expressions library:

PYTHON
import re

Initialize a test string containing email addresses:

PYTHON
x = 'hi my email me at email@email.com another@email.com'

Find all email addresses using regex pattern matching:

PYTHON
re.findall(r'([a-zA-Z0-9+._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)', x)
OUTPUT
['email@email.com', 'another@email.com']

Extract lists of emails into a temporary metadata column:

PYTHON
df['emails'] = df['twitts'].apply(lambda x: re.findall(r'([a-zA-Z0-9+._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)', x))

Calculate the length of the email address list:

PYTHON
df['emails_count'] = df['emails'].apply(lambda x: len(x))

Display rows that contain at least one email address:

PYTHON
df[df['emails_count']>0].head()
OUTPUT
twittssentimentword_countschar_countsavg_word_lenstop_words_lenhashtags_countmentions_countnumerics_countupper_countsemailsemails_count
4054i want a new laptop. hp tx2000 is the bomb. :...0201034.15000060004[gabbehhramos@yahoo.com]1
7917who stole elledell@gmail.com?03319.00000010000[elledell@gmail.com]1
8496@alexistehpom really? did you send out all th...0201305.500000110100[missataari@gmail.com]1
10290@laureystack awh...that is kinda sad lol add ...08768.50000000100[hello.kitty.65@hotmail.com]1
16413@jilliancyork got 2 bottom of it, human error...0211375.42857170110[press@linkedin.com]1

Remove emails from the test string using regex substitution:

PYTHON
re.sub(r'([a-zA-Z0-9+._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)', '', x)
OUTPUT
'hi my email me at  '

Apply the email removal substitution across all tweet entries:

PYTHON
df['twitts'] = df['twitts'].apply(lambda x: re.sub(r'([a-zA-Z0-9+._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)', '', x))

Verify email removal on the subset:

PYTHON
df[df['emails_count']>0].head()
OUTPUT
twittssentimentword_countschar_countsavg_word_lenstop_words_lenhashtags_countmentions_countnumerics_countupper_countsemailsemails_count
4054i want a new laptop. hp tx2000 is the bomb. :...0201034.15000060004[gabbehhramos@yahoo.com]1
7917who stole ?03319.00000010000[elledell@gmail.com]1
8496@alexistehpom really? did you send out all th...0201305.500000110100[missataari@gmail.com]1
10290@laureystack awh...that is kinda sad lol add ...08768.50000000100[hello.kitty.65@hotmail.com]1
16413@jilliancyork got 2 bottom of it, human error...0211375.42857170110[press@linkedin.com]1

Count URLs and Remove it

Initialize a test string containing a URL:

PYTHON
x = 'hi, to watch more visit https://youtube.com/kgptalkie'

Identify any URL links using regular expression matching:

PYTHON
re.findall(r'(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', x)
OUTPUT
[('https', 'youtube.com', '/kgptalkie')]

Count the total URLs present in each tweet string:

PYTHON
df['urls_flag'] = df['twitts'].apply(lambda x: len(re.findall(r'(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', x)))

Test URL removal with regex substitution on our test string:

PYTHON
re.sub(r'(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '', x)
OUTPUT
'hi, to watch more visit '

Apply URL removal across all tweet entries in the DataFrame:

PYTHON
df['twitts'] = df['twitts'].apply(lambda x: re.sub(r'(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '', x))

Display the head of the updated DataFrame:

PYTHON
df.head()
OUTPUT
twittssentimentword_countschar_countsavg_word_lenstop_words_lenhashtags_countmentions_countnumerics_countupper_countsemailsemails_counturls_flag
0@switchfoot - awww, that is a bummer. you sh...0191155.05263240101[]01
1is upset that he cannot update his facebook by...0211114.28571490000[]00
2@kenichan i dived many times for the ball. man...018893.94444470101[]00
3my whole body feels itchy and like its on fire010473.70000050000[]00
4@nationwideclass no, it is not behaving at all...0211114.285714100101[]00

Display the first row text entry:

PYTHON
df.loc[0]['twitts']
OUTPUT
'@switchfoot  - awww, that is a bummer.  you shoulda got david carr of third day to do it. ;d'

Remove RT

Remove retweet indicators (RT) from all tweets:

PYTHON
df['twitts'] = df['twitts'].apply(lambda x: re.sub('RT', "", x))

Special Chars removal or punctuation removal

Remove all special characters, symbols, and punctuation from the tweet text:

PYTHON
df['twitts'] = df['twitts'].apply(lambda x: re.sub('[^A-Z a-z 0-9-]+', '', x))

Verify punctuation removal:

PYTHON
df.head()
OUTPUT
twittssentimentword_countschar_countsavg_word_lenstop_words_lenhashtags_countmentions_countnumerics_countupper_countsemailsemails_counturls_flag
0switchfoot - awww that is a bummer you shoul...0191155.05263240101[]01
1is upset that he cannot update his facebook by...0211114.28571490000[]00
2kenichan i dived many times for the ball manag...018893.94444470101[]00
3my whole body feels itchy and like its on fire010473.70000050000[]00
4nationwideclass no it is not behaving at all i...0211114.285714100101[]00

Remove multiple spaces

Initialize a test string with multiple consecutive spaces:

PYTHON
x = 'thanks    for    watching and    please    like this video'

Split on whitespace and rejoin using a single space character to strip extra spaces:

PLAINTEXT
" ".join(x.split())
PLAINTEXT
'thanks for watching and please like this video'

Clean up multiple spaces across all tweet texts:

PYTHON
df['twitts'] = df['twitts'].apply(lambda x: " ".join(x.split()))

Display the first two rows:

PYTHON
df.head(2)
OUTPUT
twittssentimentword_countschar_countsavg_word_lenstop_words_lenhashtags_countmentions_countnumerics_countupper_countsemailsemails_counturls_flag
0switchfoot - awww that is a bummer you shoulda...0191155.05263240101[]01
1is upset that he cannot update his facebook by...0211114.28571490000[]00

Remove HTML tags

Import BeautifulSoup to handle HTML stripping:

PYTHON
from bs4 import BeautifulSoup

Initialize a test string with HTML content:

PYTHON
x = 'Thanks for watching'

Strip HTML tags using lxml parser:

PYTHON
BeautifulSoup(x, 'lxml').get_text()
OUTPUT
'Thanks for watching'

Apply HTML tag stripping to all tweet documents:

PYTHON
%%time
df['twitts'] = df['twitts'].apply(lambda x: BeautifulSoup(x, 'lxml').get_text())
OUTPUT
Wall time: 11min 37s

Remove Accented Chars

Import the unicodedata library:

PYTHON
import unicodedata

Initialize a string with accented characters:

PYTHON
x = 'Áccěntěd těxt'

Normalize and encode to ASCII to remove accent marks:

PYTHON
def remove_accented_chars(x):
    x = unicodedata.normalize('NFKD', x).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return x

Verify normalization output:

PYTHON
remove_accented_chars(x)
OUTPUT
'Accented text'

SpaCy and NLP

In this section, we leverage spaCy's language models to perform tokenization, lemmatization, and linguistic parsing.

Remove Stop Words

Import spaCy library:

PYTHON
import spacy

Initialize a test string with multiple stop words:

PYTHON
x = 'this is stop words removal code is a the an how what'

Filter out any tokens listed in the standard stop words set:

PLAINTEXT
" ".join([t for t in x.split() if t not in STOP_WORDS])
PLAINTEXT
'stop words removal code'

Apply stop word removal to all tweet texts:

PYTHON
df['twitts'] = df['twitts'].apply(lambda x: " ".join([t for t in x.split() if t not in STOP_WORDS]))

Inspect the stop-word-free DataFrame:

PYTHON
df.head()
OUTPUT
twittssentimentword_countschar_countsavg_word_lenstop_words_lenhashtags_countmentions_countnumerics_countupper_countsemailsemails_counturls_flag
0switchfoot - awww bummer shoulda got david car...0191155.05263240101[]01
1upset update facebook texting cry result schoo...0211114.28571490000[]00
2kenichan dived times ball managed save 50 rest...018893.94444470101[]00
3body feels itchy like fire010473.70000050000[]00
4nationwideclass behaving mad0211114.285714100101[]00

spaCy Lemmatization

Load the small English core spaCy pipeline:

PYTHON
nlp = spacy.load('en_core_web_sm')

Initialize a test string with varying inflected forms:

PYTHON
x = 'kenichan dived times ball managed save 50 rest'

Create a custom function to parse the text and extract lemma representations:

PYTHON
def make_to_base(x):
    x_list = []
    doc = nlp(x)

    for token in doc:
        lemma = str(token.lemma_)
        if lemma == '-PRON-' or lemma == 'be':
            lemma = token.text
        x_list.append(lemma)
    print(" ".join(x_list))

Verify lemmatization:

PYTHON
make_to_base(x)
OUTPUT
kenichan dive time ball manage save 50 rest

Common words removal

Join the first five cleaned tweets together to check word frequencies:

PLAINTEXT
' '.join(df.head()['twitts'])
PLAINTEXT
'switchfoot - awww bummer shoulda got david carr day d upset update facebook texting cry result school today blah kenichan dived times ball managed save 50 rest bounds body feels itchy like fire nationwideclass behaving mad'

Aggregate and compute the most common words in our dataset:

PYTHON
text = ' '.join(df['twitts'])

Split string on spaces:

PYTHON
text = text.split()

Retrieve value count series:

PYTHON
freq_comm = pd.Series(text).value_counts()

Select the top 20 most frequent words:

PYTHON
f20 = freq_comm[:20]

Display top 20 list:

PLAINTEXT
f20
PLAINTEXT
good      89366
day       82299
like      77735
-         69662
today     64512
going     64078
love      63421
work      62804
got       60749
time      56081
lol       55094
know      51172
im        50147
want      42070
new       41995
think     41040
night     41029
amp       40616
thanks    39311
home      39168
dtype: int64

Remove these 20 high-frequency words from all tweets:

PYTHON
df['twitts'] = df['twitts'].apply(lambda x: " ".join([t for t in x.split() if t not in f20]))

Rare words removal

Find the 20 least common words in the dataset:

PYTHON
rare20 = freq_comm[-20:]

Display the bottom 20 list:

PLAINTEXT
rare20
PLAINTEXT
veru              1
80-90f            1
refrigerant       1
demaisss          1
knittingsci-fi    1
wendireed         1
danielletuazon    1
chacha8           1
a-zquot           1
krustythecat      1
westmount         1
-appreciate       1
motocycle         1
madamhow          1
felspoon          1
fastbloke         1
900pmno           1
nxec              1
laassssttt        1
update-uri        1
dtype: int64

Isolate all words that occur exactly once:

PYTHON
rare = freq_comm[freq_comm.values == 1]

Display stats on words with single occurrences:

PLAINTEXT
rare
PLAINTEXT
mamat             1
fiive             1
music-festival    1
leenahyena        1
11517             1
                 ..
fastbloke         1
900pmno           1
nxec              1
laassssttt        1
update-uri        1
Length: 536196, dtype: int64

Remove the rare words from all tweets:

PYTHON
df['twitts'] = df['twitts'].apply(lambda x: ' '.join([t for t in x.split() if t not in rare20]))

Verify removal on the first 5 rows:

PYTHON
df.head()
OUTPUT
twittssentimentword_countschar_countsavg_word_lenstop_words_lenhashtags_countmentions_countnumerics_countupper_countsemailsemails_counturls_flag
0switchfoot awww bummer shoulda david carr d0191155.05263240101[]01
1upset update facebook texting cry result schoo...0211114.28571490000[]00
2kenichan dived times ball managed save 50 rest...018893.94444470101[]00
3body feels itchy fire010473.70000050000[]00
4nationwideclass behaving mad0211114.285714100101[]00

Word Cloud Visualization

Import the WordCloud library to visually inspect the text tokens:

PYTHON
from wordcloud import WordCloud
import matplotlib.pyplot as plt
%matplotlib inline

Create a combined text corpus string of the first 20,000 words:

PYTHON
x = ' '.join(text[:20000])

Print the total word count in the dataset:

PYTHON
len(text)
OUTPUT
10837079

Generate and render the Word Cloud representation:

PYTHON
wc = WordCloud(width = 800, height=400).generate(x)
plt.imshow(wc)
plt.axis('off')
plt.show()

The Word Cloud below displays the most prominent terms across the Twitter dataset:

Word cloud visualization of top Twitter dataset words

Spelling Correction

Import TextBlob:

PYTHON
from textblob import TextBlob

Initialize a misspelled test string:

PYTHON
x = 'tanks forr waching this vidio carri'

Correct spelling using TextBlob:

PYTHON
x = TextBlob(x).correct()

Check corrected output:

PLAINTEXT
x
PYTHON
TextBlob("tanks for watching this video carry")

Tokenization

Initialize a test string with no spacing around a hashtag:

PYTHON
x = 'thanks#watching this video. please like it'

Tokenize using TextBlob:

PYTHON
TextBlob(x).words
PYTHON
WordList(['thanks', 'watching', 'this', 'video', 'please', 'like', 'it'])

Parse and print tokens using spaCy:

PYTHON
doc = nlp(x)
for token in doc:
    print(token)
OUTPUT
thanks#watching
this
video
.
please
like
it

TextBlob Lemmatization

Initialize inflected variants:

PYTHON
x = 'runs run running ran'

Import TextBlob Word wrapper:

PYTHON
from textblob import Word

Lemmatize using TextBlob:

PYTHON
for token in x.split():
    print(Word(token).lemmatize())
OUTPUT
run
run
running
ran

Lemmatize using spaCy:

PYTHON
doc = nlp(x)
for token in doc:
    print(token.lemma_)
OUTPUT
run
run
run
run

Detect Entities using NER of SpaCy

Named Entity Recognition (NER) identifies span elements in unstructured text and groups them into predefined categories like places, organizations, or dates.

Initialize a news string:

PYTHON
x = "Breaking News: Donald Trump, the president of the USA is looking to sign a deal to mine the moon"

Perform NER extraction:

PYTHON
doc = nlp(x)
for ent in doc.ents:
    print(ent.text + ' - ' + ent.label_ + ' - ' + str(spacy.explain(ent.label_)))
OUTPUT
Donald Trump - PERSON - People, including fictional
USA - GPE - Countries, cities, states

Import displacy:

PYTHON
from spacy import displacy

Render named entities visually in the document:

PYTHON
displacy.render(doc, style = 'ent')

Breaking News: Donald Trump PERSON , the president of the USA GPE is looking to sign a deal to mine the moon

Detecting Nouns

Verify the test string:

PLAINTEXT
x
PLAINTEXT
'Breaking News: Donald Trump, the president of the USA is looking to sign a deal to mine the moon'

Extract the noun chunks using spaCy properties:

PYTHON
for noun in doc.noun_chunks:
    print(noun)
OUTPUT
Breaking News
Donald Trump
the president
the USA
a deal
the moon

Translation and Language Detection

Verify the test string:

PLAINTEXT
x
PLAINTEXT
'Breaking News: Donald Trump, the president of the USA is looking to sign a deal to mine the moon'

Initialize TextBlob container:

PYTHON
tb = TextBlob(x)

Detect language:

PYTHON
tb.detect_language()
OUTPUT
'en'

Translate document text into Bengali:

PYTHON
tb.translate(to='bn')
PYTHON
TextBlob("ব্রেকিং নিউজ: যুক্তরাষ্ট্রের রাষ্ট্রপতি ডোনাল্ড ট্রাম্প চাঁদটি খনির জন্য একটি চুক্তিতে সই করতে চাইছেন")

Use inbuilt sentiment classifier

Import NaiveBayesAnalyzer:

PYTHON
from textblob.sentiments import NaiveBayesAnalyzer

Initialize a positive test sentence:

PYTHON
x = 'we all stands together to fight with corona virus. we will win together'

Evaluate sentiment predictions:

PYTHON
tb = TextBlob(x, analyzer=NaiveBayesAnalyzer())

Check sentiment distribution:

PYTHON
tb.sentiment
OUTPUT
Sentiment(classification='pos', p_pos=0.8259779151942094, p_neg=0.17402208480578962)

Initialize a second test string:

PYTHON
x = 'we all are sufering from corona'

Evaluate sentiment:

PYTHON
tb = TextBlob(x, analyzer=NaiveBayesAnalyzer())

Check sentiment prediction:

PYTHON
tb.sentiment
OUTPUT
Sentiment(classification='pos', p_pos=0.75616044472398, p_neg=0.2438395552760203)

Advanced Text Processing

In this section, we cover structural vectorization techniques to convert clean tokens into numeric feature matrices.

N-Grams

Initialize a test string:

PYTHON
x = 'thanks for watching'

Convert string to TextBlob object:

PYTHON
tb = TextBlob(x)

Extract 3-gram sequences:

PYTHON
tb.ngrams(3)
OUTPUT
[WordList(['thanks', 'for', 'watching'])]

Bag of Words (BoW) Representation

Initialize a list of dummy sentences:

PYTHON
x = ['this is first sentence this is', 'this is second', 'this is last']

Import CountVectorizer:

PYTHON
from sklearn.feature_extraction.text import CountVectorizer

Fit and generate BoW vectors:

PYTHON
cv = CountVectorizer(ngram_range=(1,1))
text_counts = cv.fit_transform(x)

Convert sparse representation to array:

PYTHON
text_counts.toarray()
OUTPUT
array([[1, 2, 0, 0, 1, 2],
       [0, 1, 0, 1, 0, 1],
       [0, 1, 1, 0, 0, 1]], dtype=int64)

Check feature names mapping:

PYTHON
cv.get_feature_names()
OUTPUT
['first', 'is', 'last', 'second', 'sentence', 'this']

Convert array to a structured DataFrame:

PYTHON
bow = pd.DataFrame(text_counts.toarray(), columns = cv.get_feature_names())

Display BoW DataFrame:

PLAINTEXT
bow
OUTPUT
firstislastsecondsentencethis
0120012
1010101
2011001

Verify documents:

PLAINTEXT
x
PLAINTEXT
['this is first sentence this is', 'this is second', 'this is last']

Term Frequency

Term frequency measures the relative occurrence of a term in a specific document. The formula is:

Where:

  • — the normalized term frequency of term .

Verify documents:

PLAINTEXT
x
PLAINTEXT
['this is first sentence this is', 'this is second', 'this is last']

Display BoW matrix:

PLAINTEXT
bow
OUTPUT
firstislastsecondsentencethis
0120012
1010101
2011001

Check matrix dimensions:

PYTHON
bow.shape
OUTPUT
(3, 6)

Copy the BoW DataFrame:

PYTHON
tf = bow.copy()

Compute normalized TF for each cell:

PYTHON
for index, row in enumerate(tf.iterrows()):
    for col in row[1].index:
        tf.loc[index, col] = tf.loc[index, col]/sum(row[1].values)

Display normalized TF DataFrame:

PLAINTEXT
tf
OUTPUT
firstislastsecondsentencethis
00.1666670.3333330.0000000.0000000.1666670.333333
10.0000000.3333330.0000000.3333330.0000000.333333
20.0000000.3333330.3333330.0000000.0000000.333333

Inverse Document Frequency IDF

Inverse Document Frequency (IDF) measures how unique or rare a term is across the entire corpus. The formula used in scikit-learn when smooth_idf=True is:

Where:

  • — total number of documents in the corpus.
  • — number of documents containing term .

Import NumPy:

PYTHON
import numpy as np

Convert string array to DataFrame:

PYTHON
x_df = pd.DataFrame(x, columns=['words'])

Display words:

PYTHON
x_df
OUTPUT
words
0this is first sentence this is
1this is second
2this is last

Display BoW DataFrame:

PLAINTEXT
bow
OUTPUT
firstislastsecondsentencethis
0120012
1010101
2011001

Get total document count :

PYTHON
N = bow.shape[0]
N
OUTPUT
3

Convert values to boolean flags to find document presence:

PYTHON
bb = bow.astype('bool')
bb
OUTPUT
firstislastsecondsentencethis
0TrueTrueFalseFalseTrueTrue
1FalseTrueFalseTrueFalseTrue
2FalseTrueTrueFalseFalseTrue

Sum occurrences of column "is":

PYTHON
bb['is'].sum()
OUTPUT
3

Retrieve columns:

PYTHON
cols = bb.columns
cols
OUTPUT
Index(['first', 'is', 'last', 'second', 'sentence', 'this'], dtype='object')

Calculate total document occurrences for each term:

PYTHON
nz = []
for col in cols:
    nz.append(bb[col].sum())

Check occurrences list:

PLAINTEXT
nz
PLAINTEXT
[1, 3, 1, 1, 1, 3]

Calculate IDF values:

PYTHON
idf = []
for index, col in enumerate(cols):
    idf.append(np.log((N + 1)/(nz[index] + 1)) + 1)

Check IDF scores:

PLAINTEXT
idf
PLAINTEXT
[1.6931471805599454, 1.0, 1.6931471805599454, 1.6931471805599454, 1.6931471805599454, 1.0]

Review BoW DataFrame:

PLAINTEXT
bow
OUTPUT
firstislastsecondsentencethis
0120012
1010101
2011001

TFIDF

Import TfidfVectorizer:

PYTHON
from sklearn.feature_extraction.text import TfidfVectorizer

Fit and transform text using TfidfVectorizer:

PYTHON
tfidf = TfidfVectorizer()
x_tfidf = tfidf.fit_transform(x_df['words'])

Convert sparse matrix to array:

PYTHON
x_tfidf.toarray()
OUTPUT
array([[0.45688214, 0.5396839 , 0.        , 0.        , 0.45688214,
        0.5396839 ],
       [0.        , 0.45329466, 0.        , 0.76749457, 0.        ,
        0.45329466],
       [0.        , 0.45329466, 0.76749457, 0.        , 0.        ,
        0.45329466]])

Print fitted IDF scores:

PYTHON
tfidf.idf_
OUTPUT
array([1.69314718, 1.        , 1.69314718, 1.69314718, 1.69314718,
       1.        ])

Print calculated manual IDF scores:

PLAINTEXT
idf
PLAINTEXT
[1.6931471805599454, 1.0, 1.6931471805599454, 1.6931471805599454, 1.6931471805599454, 1.0]

Word Embeddings

Word Embedding is a language modeling technique used for mapping words to vectors of real numbers. It represents words or phrases in vector space with several dimensions. Word embeddings can be generated using various methods like neural networks, co-occurrence matrix, probabilistic models, etc.

SpaCy Word2Vec

Load the large English pipeline model to retrieve semantic word vectors:

PYTHON
nlp = spacy.load('en_core_web_lg')

Initialize sample tokens:

PYTHON
doc = nlp('thank you! dog cat lion dfasaa')

Check if each token is associated with a pre-trained vector representation:

PYTHON
for token in doc:
    print(token.text, token.has_vector)
OUTPUT
thank True
you True
! True
dog True
cat True
lion True
dfasaa False

Check token vector dimensions:

PYTHON
token.vector.shape
OUTPUT
(300,)

Check vector dimension on "cat":

PYTHON
nlp('cat').vector.shape
OUTPUT
(300,)

Calculate similarity scores across combinations of words:

PYTHON
for token1 in doc:
    for token2 in doc:
        print(token1.text, token2.text, token1.similarity(token2))
    print()
OUTPUT
thank thank 1.0
thank you 0.5647585
thank ! 0.52147406
thank dog 0.2504265
thank cat 0.20648485
thank lion 0.13629764

Evaluate similarity metrics on empty vectors:

PLAINTEXT
thank dfasaa 0.0

you thank 0.5647585
you you 1.0
you ! 0.4390223
you dog 0.36494097
you cat 0.3080798
you lion 0.20392051

Evaluate similarity against empty tokens:

PLAINTEXT
you dfasaa 0.0

! thank 0.52147406
! you 0.4390223
! ! 1.0
! dog 0.29852203
! cat 0.29702348
! lion 0.19601382

Evaluate similarity metrics:

PLAINTEXT
! dfasaa 0.0

dog thank 0.2504265
dog you 0.36494097
dog ! 0.29852203
dog dog 1.0
dog cat 0.80168545
dog lion 0.47424486

Evaluate similarities:

PLAINTEXT
dog dfasaa 0.0

cat thank 0.20648485
cat you 0.3080798
cat ! 0.29702348
cat dog 0.80168545
cat cat 1.0
cat lion 0.52654374

Calculate similarities:

PLAINTEXT
cat dfasaa 0.0

lion thank 0.13629764
lion you 0.20392051
lion ! 0.19601382
lion dog 0.47424486
lion cat 0.52654374
lion lion 1.0

Verify comparisons:

PLAINTEXT
lion dfasaa 0.0

Check comparison results:

PLAINTEXT
dfasaa thank 0.0

Check comparisons:

PLAINTEXT
dfasaa you 0.0

Evaluate comparisons:

PLAINTEXT
dfasaa ! 0.0

Check evaluations:

PLAINTEXT
dfasaa dog 0.0

Verify similarity output:

PLAINTEXT
dfasaa cat 0.0

Check comparisons:

PLAINTEXT
dfasaa lion 0.0
dfasaa dfasaa 1.0

Machine Learning Models for Text Classification

In this section, we compare machine learning models trained on Bag of Words features, manual features, and Word2Vec semantic embeddings.

BoW Features Setup

Inspect the shape of the main dataset DataFrame:

PYTHON
df.shape
OUTPUT
(1600000, 13)

Create a balanced sample dataset consisting of 2000 positive and 2000 negative sentiment rows:

PYTHON
df0 = df[df['sentiment']==0].sample(2000)
df4 = df[df['sentiment']==4].sample(2000)

Concatenate positive and negative samples:

PYTHON
dfr = df0.append(df4)

Check the dimension of the concatenated sample dataset:

PYTHON
dfr.shape
OUTPUT
(4000, 13)

Drop non-feature labels and email text listings to isolate the manual feature set:

PYTHON
dfr_feat = dfr.drop(labels=['twitts','sentiment','emails'], axis = 1).reset_index(drop=True)

Inspect the manual feature DataFrame:

PYTHON
dfr_feat
OUTPUT
word_countschar_countsavg_word_lenstop_words_lenhashtags_countmentions_countnumerics_countupper_countsemails_counturls_flag
015814.4000006000000
18474.8750004010000
215693.6000006010000
39423.6666674000200
414774.5000005000000
.................................
39953339.6666671010000
399616783.8750004010200
3997271343.9629639010200
39986446.3333330011100
39995254.0000003010000

4000 rows × 10 columns

Extract target sentiment labels:

PYTHON
y = dfr['sentiment']

Import CountVectorizer:

PYTHON
from sklearn.feature_extraction.text import CountVectorizer

Generate BoW representation for the sample dataset:

PYTHON
cv = CountVectorizer()
text_counts = cv.fit_transform(dfr['twitts'])

Check the size of the generated vocabulary feature space:

PYTHON
text_counts.toarray().shape
OUTPUT
(4000, 9750)

Construct a DataFrame from the generated vocabulary:

PYTHON
dfr_bow = pd.DataFrame(text_counts.toarray(), columns=cv.get_feature_names())

Inspect the generated BoW DataFrame:

PYTHON
dfr_bow.head(2)
OUTPUT
007peter05060594091010010001000000000000000000000000000010381041...zomgzonkedzoozooeyzrovnazsharezskzwelzzzzzzzz
00000000000...0000000000
10000000000...0000000000

2 rows × 9750 columns

Classifier Models Setup

Import the classifier algorithms, evaluation metrics, and preprocessing utilities from scikit-learn:

PYTHON
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.preprocessing import MinMaxScaler

Initialize the five comparison models:

PYTHON
sgd = SGDClassifier(n_jobs=-1, random_state=42, max_iter=200)
lgr = LogisticRegression(random_state=42, max_iter=200)
lgrcv = LogisticRegressionCV(cv = 2, random_state=42, max_iter=1000)
svm = LinearSVC(random_state=42, max_iter=200)
rfc = RandomForestClassifier(random_state=42, n_jobs=-1, n_estimators=200)

Map models to standard shorthand keys:

PYTHON
clf = {'SGD': sgd, 'LGR': lgr, 'LGR-CV': lgrcv, 'SVM': svm, 'RFC': rfc}

Verify dictionary keys:

PYTHON
clf.keys()
PYTHON
dict_keys(['SGD', 'LGR', 'LGR-CV', 'SVM', 'RFC'])

Create a generic training and evaluation pipeline function:

PYTHON
def classify(X, y):
    scaler = MinMaxScaler(feature_range=(0, 1))
    X = scaler.fit_transform(X)

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)

    for key in clf.keys():
        clf[key].fit(X_train, y_train)
        y_pred = clf[key].predict(X_test)
        ac = accuracy_score(y_test, y_pred)
        print(key, " ---> ", ac)

Evaluate accuracy scores using the BoW features representation:

PYTHON
%%time
classify(dfr_bow, y)
OUTPUT
SGD  --->  0.62375
LGR  --->  0.65375
LGR-CV  --->  0.6525
SVM  --->  0.6325
RFC  --->  0.6525
Wall time: 1min 42s

Evaluate accuracy using the manual feature sets:

PYTHON
dfr_feat.head(2)
OUTPUT
word_countschar_countsavg_word_lenstop_words_lenhashtags_countmentions_countnumerics_countupper_countsemails_counturls_flag
45384315814.4006000000
3882808474.8754010000

Run classifier evaluation on the manual features:

PYTHON
%%time
classify(dfr_feat, y)
OUTPUT
SGD  --->  0.64125
LGR  --->  0.645
LGR-CV  --->  0.65
SVM  --->  0.6475
RFC  --->  0.5675
Wall time: 1.35 s

Combine manual features and vocabulary-based BoW features:

PYTHON
X = dfr_feat.join(dfr_bow)

Evaluate accuracy on the combined features matrix:

PYTHON
%%time
classify(X, y)
OUTPUT
SGD  --->  0.64875
LGR  --->  0.67125
LGR-CV  --->  0.66125
SVM  --->  0.64375
RFC  --->  0.705
Wall time: 1min 18s

TF-IDF Features Setup

Import TfidfVectorizer:

PYTHON
from sklearn.feature_extraction.text import TfidfVectorizer

Verify dataset shape:

PYTHON
dfr.shape
OUTPUT
(4000, 13)

Vectorize using TfidfVectorizer:

PYTHON
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(dfr['twitts'])

Train and evaluate the models using the generated TF-IDF features matrix:

PYTHON
%%time
classify(pd.DataFrame(X.toarray()), y)
OUTPUT
SGD  --->  0.635
LGR  --->  0.65125
LGR-CV  --->  0.6475
SVM  --->  0.63875
RFC  --->  0.6425
Wall time: 1min 37s

Word2Vec Features Setup

Create a function to calculate the average semantic vector representation of a sentence using spaCy:

PYTHON
def get_vec(x):
    doc = nlp(x)
    return doc.vector.reshape(1, -1)

Apply the vectorization function to all tweets:

PYTHON
%%time
dfr['vec'] = dfr['twitts'].apply(lambda x: get_vec(x))
OUTPUT
Wall time: 51.8 s

Concatenate individual vectors into a single feature array:

PYTHON
X = np.concatenate(dfr['vec'].to_numpy(), axis = 0)

Check the shape of the feature matrix:

PYTHON
X.shape
OUTPUT
(4000, 300)

Evaluate accuracy on Word2Vec feature vectors:

PYTHON
classify(pd.DataFrame(X), y)
OUTPUT
SGD  --->  0.5925
LGR  --->  0.70625
LGR-CV  --->  0.69375

Check predictions on classification outputs:

PLAINTEXT
SVM  --->  0.70125
RFC  --->  0.66625

Create a custom function to run predictions on Word2Vec inputs:

PYTHON
def predict_w2v(x):
    for key in clf.keys():
        y_pred = clf[key].predict(get_vec(x))
        print(key, "-->", y_pred)

Evaluate prediction on a positive review:

PYTHON
predict_w2v('hi, thanks for watching this video. please like and subscribe')
OUTPUT
SGD --> [0]
LGR --> [4]
LGR-CV --> [0]
SVM --> [4]
RFC --> [0]

Predict sentiment of a question:

PYTHON
predict_w2v('please let me know if you want more video')
OUTPUT
SGD --> [0]
LGR --> [0]
LGR-CV --> [0]
SVM --> [0]
RFC --> [0]

Predict sentiment on a highly positive feedback:

PYTHON
predict_w2v('congratulation looking good congrats')
OUTPUT
SGD --> [4]
LGR --> [4]
LGR-CV --> [4]
SVM --> [4]
RFC --> [0]

Conclusion

In this tutorial, you built a complete text preprocessing and feature engineering pipeline in Python. Starting with raw tweet logs from the Sentiment140 dataset, you calculated meta-features like word counts and mentions, cleaned the text by stripping HTML tags and accented characters, and extracted lemmas using spaCy. Using these cleaned tokens, you generated numerical representations using Bag of Words, TF-IDF, and Word2Vec models, then trained five different classifiers. The Logistic Regression and LinearSVC models achieved the highest overall accuracy of 70.6% when trained on dense Word2Vec semantic embeddings.

Key takeaways:

  • Cleaning operations like contraction expansion and lemmatization reduce the overall size of the vocabulary, helping classifiers avoid overfitting.
  • spaCy provides ready-to-use, high-quality, pre-trained word embeddings that capture semantic similarities better than basic Bag of Words arrays.
  • Combining manual features (like word length or stop word counts) with structural word frequencies improves classifier performance.

Next steps:

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments