Complete Text Processing for Beginners
Natural Language Processing (NLP) enables computers to understand, interpret, and manipulate human language. While raw text from tweets, documents, and transcripts is highly unstructured and messy, modern text processing pipelines can extract clean semantic representations to predict sentiment or classify topics.
In this tutorial, you will build an end-to-end text processing pipeline in Python. You will clean and preprocess the Sentiment140 Twitter dataset, extract features using Bag of Words, TF-IDF, and Word2Vec, and train multiple machine learning classifiers to predict sentiment.
Prerequisites: Python 3.x, Numpy, Pandas, spaCy, Scikit-learn, TextBlob.
Dataset used in this tutorial: Sentiment140 Dataset on Kaggle
The diagram below illustrates the general workflow of an end-to-end NLP pipeline:

Installing Libraries
You can install spaCy and its associated English language models using pip:
# pip install -U spacy
# pip install -U spacy-lookups-data
# python -m spacy download en_core_web_sm
# python -m spacy download en_core_web_md
# python -m spacy download en_core_web_lg
Importing Libraries
We begin by importing the basic data manipulation libraries and the spaCy stop word list:
import pandas as pd
import numpy as np
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
Next, load the Sentiment140 CSV dataset with Latin-1 encoding:
df = pd.read_csv('twitter16m.csv', encoding = 'latin1', header = None)
Display the first few rows of the loaded DataFrame:
df.head()
| 0 | 1 | 2 | 3 | 4 | 5 | |
|---|---|---|---|---|---|---|
| 0 | 0 | 1467810369 | Mon Apr 06 22:19:45 PDT 2009 | NO_QUERY | _TheSpecialOne_ | @switchfoot http://twitpic.com/2y1zl - Awww, t... |
| 1 | 0 | 1467810672 | Mon Apr 06 22:19:49 PDT 2009 | NO_QUERY | scotthamilton | is upset that he can't update his Facebook by ... |
| 2 | 0 | 1467810917 | Mon Apr 06 22:19:53 PDT 2009 | NO_QUERY | mattycus | @Kenichan I dived many times for the ball. Man... |
| 3 | 0 | 1467811184 | Mon Apr 06 22:19:57 PDT 2009 | NO_QUERY | ElleCTF | my whole body feels itchy and like its on fire |
| 4 | 0 | 1467811193 | Mon Apr 06 22:19:57 PDT 2009 | NO_QUERY | Karoli | @nationwideclass no, it's not behaving at all.... |
Keep only the text content (column index 5) and the target sentiment label (column index 0):
df = df[[5, 0]]
Assign descriptive column names to the target DataFrame:
df.columns = ['twitts', 'sentiment']
df.head()
| twitts | sentiment | |
|---|---|---|
| 0 | @switchfoot http://twitpic.com/2y1zl - Awww, t... | 0 |
| 1 | is upset that he can't update his Facebook by ... | 0 |
| 2 | @Kenichan I dived many times for the ball. Man... | 0 |
| 3 | my whole body feels itchy and like its on fire | 0 |
| 4 | @nationwideclass no, it's not behaving at all.... | 0 |
Check the class balance for positive and negative sentiment labels:
df['sentiment'].value_counts()
4 800000
0 800000
Name: sentiment, dtype: int64
Create a lookup dictionary mapping sentiment label integers to descriptive string categories:
sent_map = {0: 'negative', 4: 'positive'}
Word Counts
We can calculate word counts by splitting each tweet sentence on whitespace and finding the length of the resulting word list:
df['word_counts'] = df['twitts'].apply(lambda x: len(str(x).split()))
Display the DataFrame with the new word counts column:
df.head()
| twitts | sentiment | word_counts | |
|---|---|---|---|
| 0 | @switchfoot http://twitpic.com/2y1zl - Awww, t... | 0 | 19 |
| 1 | is upset that he can't update his Facebook by ... | 0 | 21 |
| 2 | @Kenichan I dived many times for the ball. Man... | 0 | 18 |
| 3 | my whole body feels itchy and like its on fire | 0 | 10 |
| 4 | @nationwideclass no, it's not behaving at all.... | 0 | 21 |
Characters Count
Next, we count the total number of characters in each tweet string:
df['char_counts'] = df['twitts'].apply(lambda x: len(x))
Display the DataFrame with the character counts included:
df.head()
| twitts | sentiment | word_counts | char_counts | |
|---|---|---|---|---|
| 0 | @switchfoot http://twitpic.com/2y1zl - Awww, t... | 0 | 19 | 115 |
| 1 | is upset that he can't update his Facebook by ... | 0 | 21 | 111 |
| 2 | @Kenichan I dived many times for the ball. Man... | 0 | 18 | 89 |
| 3 | my whole body feels itchy and like its on fire | 0 | 10 | 47 |
| 4 | @nationwideclass no, it's not behaving at all.... | 0 | 21 | 111 |
Average Word Length
We define a helper function to compute the average character length of words inside each tweet:
def get_avg_word_len(x):
words = x.split()
word_len = 0
for word in words:
word_len = word_len + len(word)
return word_len/len(words) # != len(x)/len(words)
Apply the function to generate average word lengths:
df['avg_word_len'] = df['twitts'].apply(lambda x: get_avg_word_len(x))
Confirm the calculation logic on a dummy string:
len('this is nlp lesson')/4
4.5
Display the head of the updated DataFrame:
df.head()
| twitts | sentiment | word_counts | char_counts | avg_word_len | |
|---|---|---|---|---|---|
| 0 | @switchfoot http://twitpic.com/2y1zl - Awww, t... | 0 | 19 | 115 | 5.052632 |
| 1 | is upset that he can't update his Facebook by ... | 0 | 21 | 111 | 4.285714 |
| 2 | @Kenichan I dived many times for the ball. Man... | 0 | 18 | 89 | 3.944444 |
| 3 | my whole body feels itchy and like its on fire | 0 | 10 | 47 | 3.700000 |
| 4 | @nationwideclass no, it's not behaving at all.... | 0 | 21 | 111 | 4.285714 |
Verify the mathematical relation between character count and word count on the first row:
115/19
6.052631578947368
Stop Words Count
Check the default list of stop words provided by spaCy:
print(STOP_WORDS)
{'one', 'up', 'further', 'herself', 'nevertheless', 'their', 'when', 'a', 'bottom', 'both', 'also', 'i', 'sometime', 'ours', "'d", 'him', 'together', 'former', 'hereafter', 'whereby', "'ll", 'three', 'same', 'is', 'say', 'hers', 'must', 'five', 'you', 'across', 'n‘t', 'mostly', 'into', 'am', 'myself', 'something', 'could', 'being', 'seems', 'go', 'only', 'fifteen', 'either', 'us', 'than', 'latter', 'so', 'after', 'name', 'there', 'that', 'next', 'even', 'without', 'along', 'behind', 'very', 'whereas', 'off', 'herein', 'although', 'such', 'themselves', 'then', 'in', 'under', 'of', 'onto', 'really', 'due', 'otherwise', 'give', 'yourself', 'indeed', 'my', 'mine', 'show', 'via', 'elsewhere', 'be', 'just', 'thence', 'them', 'beside', 'though', 'as', 'out', 'third', 'however', 'twelve', 'except', '‘d', 'anything', 'move', 'side', 'everything', 'all', 'towards', 'whatever', 'will', 'n’t', 'toward', 'keep', 'hereupon', 'might', 'no', 'own', 'itself', 'for', 'can', 'rather', 'whether', 'while', 'and', 'part', 'over', 'else', 'has', 'forty', 'about', 'hereby', 'sixty', 'using', 'here', 'please', 'often', '’re', 'any', 'ca', 'per', 'whole', 'it', 'are', 'from', 'had', 'thru', '’m', 'two', 'fifty', 'your', 'latterly', 'again', 'or', 'few', 'against', 'much', 'somewhere', 'but', '’d', 'somehow', 'never', 'becoming', 'down', 'regarding', 'always', 'other', 'amount', 'because', 'noone', 'anyone', 'six', 'each', 'thus', 'alone', 'why', 'his', 'sometimes', 'now', 'since', 'become', 'see', 'she', 'where', 'whereafter', 'various', 'perhaps', 'another', 'who', 'anyhow', 'yourselves', 'someone', 'ten', 'became', 'nothing', 'front', 'an', 'anyway', 'get', 'thereafter', "'re", 'our', 'call', 'therein', 'have', 'this', 'above', 'some', 'namely', '‘re', 'seem', 'until', '’ll', 'more', 'still', "n't", 'the', 'does', 'himself', 'take', 'he', 'which', 'seeming', 'been', 'beforehand', 'may', 'do', 'well', 'ever', 'used', 'enough', 'every', 'top', 'made', "'m", 'hundred', 'almost', 'her', 'moreover', 'wherever', '’s', 'amongst', 'meanwhile', 'nobody', 'ourselves', 'whenever', 'at', 'wherein', 'nowhere', 'around', 'between', 'last', 'others', 'becomes', 'they', 'full', 'below', 'nor', 'before', 'what', 'within', 'these', 'besides', 'whereupon', 'how', 'throughout', 'eight', "'s", 'on', 'most', 'if', '‘ve', 'should', 'four', 'serious', 'thereby', '‘ll', 'whence', 'done', 'anywhere', 'yours', 'formerly', 'everyone', 'whose', 'back', 'make', 'among', 'first', 'we', '‘s', 'neither', 'doing', 'already', 'those', 'empty', 'did', 'not', '‘m', 'less', 'to', 'during', 'twenty', 'too', 'put', 'nine', 'yet', 'everywhere', 'quite', 'were', 'seemed', '’ve', 'through', 'once', 'whither', 'thereupon', 'whoever', "'ve", 'therefore', 'me', 'unless', 'whom', 'cannot', 'afterwards', 'none', 'least', 'hence', 'eleven', 'with', 'upon', 'was', 'would', 'by', 'beyond', 'several', 'its', 'many', 're'}
Initialize a test string:
x = 'this is text data'
Tokenize the test string:
x.split()
['this', 'is', 'text', 'data']
Filter out and count stop words in the token list:
len([t for t in x.split() if t in STOP_WORDS])
2
Compute the number of stop words contained in each tweet:
df['stop_words_len'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t in STOP_WORDS]))
Display the updated DataFrame:
df.head()
| twitts | sentiment | word_counts | char_counts | avg_word_len | stop_words_len | |
|---|---|---|---|---|---|---|
| 0 | @switchfoot http://twitpic.com/2y1zl - Awww, t... | 0 | 19 | 115 | 5.052632 | 4 |
| 1 | is upset that he can't update his Facebook by ... | 0 | 21 | 111 | 4.285714 | 9 |
| 2 | @Kenichan I dived many times for the ball. Man... | 0 | 18 | 89 | 3.944444 | 7 |
| 3 | my whole body feels itchy and like its on fire | 0 | 10 | 47 | 3.700000 | 5 |
| 4 | @nationwideclass no, it's not behaving at all.... | 0 | 21 | 111 | 4.285714 | 10 |
Count #HashTags and @Mentions
Initialize a sample string with a hashtag and a mention:
x = 'this #hashtag and this is @mention'
# x = x.split()
# x
Find all tokens starting with @:
[t for t in x.split() if t.startswith('@')]
['@mention']
Calculate the occurrences of hashtags (#) and user mentions (@) across all tweets:
df['hashtags_count'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t.startswith('#')]))
df['mentions_count'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t.startswith('@')]))
Inspect the head of the updated DataFrame:
df.head()
| twitts | sentiment | word_counts | char_counts | avg_word_len | stop_words_len | hashtags_count | mentions_count | |
|---|---|---|---|---|---|---|---|---|
| 0 | @switchfoot http://twitpic.com/2y1zl - Awww, t... | 0 | 19 | 115 | 5.052632 | 4 | 0 | 1 |
| 1 | is upset that he can't update his Facebook by ... | 0 | 21 | 111 | 4.285714 | 9 | 0 | 0 |
| 2 | @Kenichan I dived many times for the ball. Man... | 0 | 18 | 89 | 3.944444 | 7 | 0 | 1 |
| 3 | my whole body feels itchy and like its on fire | 0 | 10 | 47 | 3.700000 | 5 | 0 | 0 |
| 4 | @nationwideclass no, it's not behaving at all.... | 0 | 21 | 111 | 4.285714 | 10 | 0 | 1 |
If numeric digits are present in twitts
Find and count space-separated digit tokens in the tweets:
df['numerics_count'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t.isdigit()]))
Inspect the head of the DataFrame:
df.head()
| twitts | sentiment | word_counts | char_counts | avg_word_len | stop_words_len | hashtags_count | mentions_count | numerics_count | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | @switchfoot http://twitpic.com/2y1zl - Awww, t... | 0 | 19 | 115 | 5.052632 | 4 | 0 | 1 | 0 |
| 1 | is upset that he can't update his Facebook by ... | 0 | 21 | 111 | 4.285714 | 9 | 0 | 0 | 0 |
| 2 | @Kenichan I dived many times for the ball. Man... | 0 | 18 | 89 | 3.944444 | 7 | 0 | 1 | 0 |
| 3 | my whole body feels itchy and like its on fire | 0 | 10 | 47 | 3.700000 | 5 | 0 | 0 | 0 |
| 4 | @nationwideclass no, it's not behaving at all.... | 0 | 21 | 111 | 4.285714 | 10 | 0 | 1 | 0 |
UPPER case words count
Count uppercase tokens that have a string length greater than 3:
df['upper_counts'] = df['twitts'].apply(lambda x: len([t for t in x.split() if t.isupper() and len(x)>3]))
Inspect the head of the DataFrame:
df.head()
| twitts | sentiment | word_counts | char_counts | avg_word_len | stop_words_len | hashtags_count | mentions_count | numerics_count | upper_counts | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | @switchfoot http://twitpic.com/2y1zl - Awww, t... | 0 | 19 | 115 | 5.052632 | 4 | 0 | 1 | 0 | 1 |
| 1 | is upset that he can't update his Facebook by ... | 0 | 21 | 111 | 4.285714 | 9 | 0 | 0 | 0 | 0 |
| 2 | @Kenichan I dived many times for the ball. Man... | 0 | 18 | 89 | 3.944444 | 7 | 0 | 1 | 0 | 1 |
| 3 | my whole body feels itchy and like its on fire | 0 | 10 | 47 | 3.700000 | 5 | 0 | 0 | 0 | 0 |
| 4 | @nationwideclass no, it's not behaving at all.... | 0 | 21 | 111 | 4.285714 | 10 | 0 | 1 | 0 | 1 |
View a specific tweet index to check formatting:
df.loc[96]['twitts']
"so rylee,grace...wana go steve's party or not?? SADLY SINCE ITS EASTER I WNT B ABLE 2 DO MUCH BUT OHH WELL....."
Preprocessing and Cleaning
In this section, we apply standard normalization, contraction expansion, and regex filters to prepare the dataset.
Lower case conversion
Normalize the text by mapping all characters in the tweet column to lower case:
df['twitts'] = df['twitts'].apply(lambda x: x.lower())
Verify case conversion on the first two rows:
df.head(2)
| twitts | sentiment | word_counts | char_counts | avg_word_len | stop_words_len | hashtags_count | mentions_count | numerics_count | upper_counts | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | @switchfoot http://twitpic.com/2y1zl - awww, t... | 0 | 19 | 115 | 5.052632 | 4 | 0 | 1 | 0 | 1 |
| 1 | is upset that he can't update his facebook by ... | 0 | 21 | 111 | 4.285714 | 9 | 0 | 0 | 0 | 0 |
Contraction to Expansion
Initialize a raw test string with various common contractions:
x = "i don't know what you want, can't, he'll, i'd"
Define a dictionary mapping english contractions to their full expanded variants:
contractions = {
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he'll've": "he will have",
"he's": "he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how does",
"i'd": "i would",
"i'd've": "i would have",
"i'll": "i will",
"i'll've": "i will have",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'd've": "it would have",
"it'll": "it will",
"it'll've": "it will have",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she would",
"she'd've": "she would have",
"she'll": "she will",
"she'll've": "she will have",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so is",
"that'd": "that would",
"that'd've": "that would have",
"that's": "that is",
"there'd": "there would",
"there'd've": "there would have",
"there's": "there is",
"they'd": "they would",
"they'd've": "they would have",
"they'll": "they will",
"they'll've": "they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"ain't": "am not",
"aren't": "are not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he would",
"he'd've": "he would have",
"he'll": "he will",
"he'll've": "he will have",
"he's": "he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how does",
"i'd": "i would",
"i'd've": "i would have",
"i'll": "i will",
"i'll've": "i will have",
"i'm": "i am",
"i've": "i have",
"isn't": "is not",
"it'd": "it would",
"it'd've": "it would have",
"it'll": "it will",
"it'll've": "it will have",
"it's": "it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she would",
"she'd've": "she would have",
"she'll": "she will",
"she'll've": "she will have",
"she's": "she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so is",
"that'd": "that would",
"that'd've": "that would have",
"that's": "that is",
"there'd": "there would",
"there'd've": "there would have",
"there's": "there is",
"they'd": "they would",
"they'd've": "they would have",
"they'll": "they will",
"they'll've": "they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
" u ": " you ",
" ur ": " your ",
" n ": " and "}
Define a utility function to apply string replacements from the contractions dictionary:
def cont_to_exp(x):
if type(x) is str:
for key in contractions:
value = contractions[key]
x = x.replace(key, value)
return x
else:
return x
Test the expansion on a sample string:
x = "hi, i'd be happy"
Run the contraction expansion function:
cont_to_exp(x)
'hi, i would be happy'
Apply contraction expansion across all tweet texts:
%%time
df['twitts'] = df['twitts'].apply(lambda x: cont_to_exp(x))
Wall time: 52.7 s
Inspect the expanded tweet entries:
df.head()
| twitts | sentiment | word_counts | char_counts | avg_word_len | stop_words_len | hashtags_count | mentions_count | numerics_count | upper_counts | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | @switchfoot http://twitpic.com/2y1zl - awww, t... | 0 | 19 | 115 | 5.052632 | 4 | 0 | 1 | 0 | 1 |
| 1 | is upset that he cannot update his facebook by... | 0 | 21 | 111 | 4.285714 | 9 | 0 | 0 | 0 | 0 |
| 2 | @kenichan i dived many times for the ball. man... | 0 | 18 | 89 | 3.944444 | 7 | 0 | 1 | 0 | 1 |
| 3 | my whole body feels itchy and like its on fire | 0 | 10 | 47 | 3.700000 | 5 | 0 | 0 | 0 | 0 |
| 4 | @nationwideclass no, it is not behaving at all... | 0 | 21 | 111 | 4.285714 | 10 | 0 | 1 | 0 | 1 |
Count and Remove Emails
Import the Python regular expressions library:
import re
Initialize a test string containing email addresses:
x = 'hi my email me at email@email.com another@email.com'
Find all email addresses using regex pattern matching:
re.findall(r'([a-zA-Z0-9+._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)', x)
['email@email.com', 'another@email.com']
Extract lists of emails into a temporary metadata column:
df['emails'] = df['twitts'].apply(lambda x: re.findall(r'([a-zA-Z0-9+._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)', x))
Calculate the length of the email address list:
df['emails_count'] = df['emails'].apply(lambda x: len(x))
Display rows that contain at least one email address:
df[df['emails_count']>0].head()
| twitts | sentiment | word_counts | char_counts | avg_word_len | stop_words_len | hashtags_count | mentions_count | numerics_count | upper_counts | emails | emails_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4054 | i want a new laptop. hp tx2000 is the bomb. :... | 0 | 20 | 103 | 4.150000 | 6 | 0 | 0 | 0 | 4 | [gabbehhramos@yahoo.com] | 1 |
| 7917 | who stole elledell@gmail.com? | 0 | 3 | 31 | 9.000000 | 1 | 0 | 0 | 0 | 0 | [elledell@gmail.com] | 1 |
| 8496 | @alexistehpom really? did you send out all th... | 0 | 20 | 130 | 5.500000 | 11 | 0 | 1 | 0 | 0 | [missataari@gmail.com] | 1 |
| 10290 | @laureystack awh...that is kinda sad lol add ... | 0 | 8 | 76 | 8.500000 | 0 | 0 | 1 | 0 | 0 | [hello.kitty.65@hotmail.com] | 1 |
| 16413 | @jilliancyork got 2 bottom of it, human error... | 0 | 21 | 137 | 5.428571 | 7 | 0 | 1 | 1 | 0 | [press@linkedin.com] | 1 |
Remove emails from the test string using regex substitution:
re.sub(r'([a-zA-Z0-9+._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)', '', x)
'hi my email me at '
Apply the email removal substitution across all tweet entries:
df['twitts'] = df['twitts'].apply(lambda x: re.sub(r'([a-zA-Z0-9+._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)', '', x))
Verify email removal on the subset:
df[df['emails_count']>0].head()
| twitts | sentiment | word_counts | char_counts | avg_word_len | stop_words_len | hashtags_count | mentions_count | numerics_count | upper_counts | emails | emails_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4054 | i want a new laptop. hp tx2000 is the bomb. :... | 0 | 20 | 103 | 4.150000 | 6 | 0 | 0 | 0 | 4 | [gabbehhramos@yahoo.com] | 1 |
| 7917 | who stole ? | 0 | 3 | 31 | 9.000000 | 1 | 0 | 0 | 0 | 0 | [elledell@gmail.com] | 1 |
| 8496 | @alexistehpom really? did you send out all th... | 0 | 20 | 130 | 5.500000 | 11 | 0 | 1 | 0 | 0 | [missataari@gmail.com] | 1 |
| 10290 | @laureystack awh...that is kinda sad lol add ... | 0 | 8 | 76 | 8.500000 | 0 | 0 | 1 | 0 | 0 | [hello.kitty.65@hotmail.com] | 1 |
| 16413 | @jilliancyork got 2 bottom of it, human error... | 0 | 21 | 137 | 5.428571 | 7 | 0 | 1 | 1 | 0 | [press@linkedin.com] | 1 |
Count URLs and Remove it
Initialize a test string containing a URL:
x = 'hi, to watch more visit https://youtube.com/kgptalkie'
Identify any URL links using regular expression matching:
re.findall(r'(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', x)
[('https', 'youtube.com', '/kgptalkie')]
Count the total URLs present in each tweet string:
df['urls_flag'] = df['twitts'].apply(lambda x: len(re.findall(r'(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', x)))
Test URL removal with regex substitution on our test string:
re.sub(r'(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '', x)
'hi, to watch more visit '
Apply URL removal across all tweet entries in the DataFrame:
df['twitts'] = df['twitts'].apply(lambda x: re.sub(r'(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '', x))
Display the head of the updated DataFrame:
df.head()
| twitts | sentiment | word_counts | char_counts | avg_word_len | stop_words_len | hashtags_count | mentions_count | numerics_count | upper_counts | emails | emails_count | urls_flag | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | @switchfoot - awww, that is a bummer. you sh... | 0 | 19 | 115 | 5.052632 | 4 | 0 | 1 | 0 | 1 | [] | 0 | 1 |
| 1 | is upset that he cannot update his facebook by... | 0 | 21 | 111 | 4.285714 | 9 | 0 | 0 | 0 | 0 | [] | 0 | 0 |
| 2 | @kenichan i dived many times for the ball. man... | 0 | 18 | 89 | 3.944444 | 7 | 0 | 1 | 0 | 1 | [] | 0 | 0 |
| 3 | my whole body feels itchy and like its on fire | 0 | 10 | 47 | 3.700000 | 5 | 0 | 0 | 0 | 0 | [] | 0 | 0 |
| 4 | @nationwideclass no, it is not behaving at all... | 0 | 21 | 111 | 4.285714 | 10 | 0 | 1 | 0 | 1 | [] | 0 | 0 |
Display the first row text entry:
df.loc[0]['twitts']
'@switchfoot - awww, that is a bummer. you shoulda got david carr of third day to do it. ;d'
Remove RT
Remove retweet indicators (RT) from all tweets:
df['twitts'] = df['twitts'].apply(lambda x: re.sub('RT', "", x))
Special Chars removal or punctuation removal
Remove all special characters, symbols, and punctuation from the tweet text:
df['twitts'] = df['twitts'].apply(lambda x: re.sub('[^A-Z a-z 0-9-]+', '', x))
Verify punctuation removal:
df.head()
| twitts | sentiment | word_counts | char_counts | avg_word_len | stop_words_len | hashtags_count | mentions_count | numerics_count | upper_counts | emails | emails_count | urls_flag | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | switchfoot - awww that is a bummer you shoul... | 0 | 19 | 115 | 5.052632 | 4 | 0 | 1 | 0 | 1 | [] | 0 | 1 |
| 1 | is upset that he cannot update his facebook by... | 0 | 21 | 111 | 4.285714 | 9 | 0 | 0 | 0 | 0 | [] | 0 | 0 |
| 2 | kenichan i dived many times for the ball manag... | 0 | 18 | 89 | 3.944444 | 7 | 0 | 1 | 0 | 1 | [] | 0 | 0 |
| 3 | my whole body feels itchy and like its on fire | 0 | 10 | 47 | 3.700000 | 5 | 0 | 0 | 0 | 0 | [] | 0 | 0 |
| 4 | nationwideclass no it is not behaving at all i... | 0 | 21 | 111 | 4.285714 | 10 | 0 | 1 | 0 | 1 | [] | 0 | 0 |
Remove multiple spaces
Initialize a test string with multiple consecutive spaces:
x = 'thanks for watching and please like this video'
Split on whitespace and rejoin using a single space character to strip extra spaces:
" ".join(x.split())
'thanks for watching and please like this video'
Clean up multiple spaces across all tweet texts:
df['twitts'] = df['twitts'].apply(lambda x: " ".join(x.split()))
Display the first two rows:
df.head(2)
| twitts | sentiment | word_counts | char_counts | avg_word_len | stop_words_len | hashtags_count | mentions_count | numerics_count | upper_counts | emails | emails_count | urls_flag | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | switchfoot - awww that is a bummer you shoulda... | 0 | 19 | 115 | 5.052632 | 4 | 0 | 1 | 0 | 1 | [] | 0 | 1 |
| 1 | is upset that he cannot update his facebook by... | 0 | 21 | 111 | 4.285714 | 9 | 0 | 0 | 0 | 0 | [] | 0 | 0 |
Remove HTML tags
Import BeautifulSoup to handle HTML stripping:
from bs4 import BeautifulSoup
Initialize a test string with HTML content:
x = 'Thanks for watching'
Strip HTML tags using lxml parser:
BeautifulSoup(x, 'lxml').get_text()
'Thanks for watching'
Apply HTML tag stripping to all tweet documents:
%%time
df['twitts'] = df['twitts'].apply(lambda x: BeautifulSoup(x, 'lxml').get_text())
Wall time: 11min 37s
Remove Accented Chars
Import the unicodedata library:
import unicodedata
Initialize a string with accented characters:
x = 'Áccěntěd těxt'
Normalize and encode to ASCII to remove accent marks:
def remove_accented_chars(x):
x = unicodedata.normalize('NFKD', x).encode('ascii', 'ignore').decode('utf-8', 'ignore')
return x
Verify normalization output:
remove_accented_chars(x)
'Accented text'
SpaCy and NLP
In this section, we leverage spaCy's language models to perform tokenization, lemmatization, and linguistic parsing.
Remove Stop Words
Import spaCy library:
import spacy
Initialize a test string with multiple stop words:
x = 'this is stop words removal code is a the an how what'
Filter out any tokens listed in the standard stop words set:
" ".join([t for t in x.split() if t not in STOP_WORDS])
'stop words removal code'
Apply stop word removal to all tweet texts:
df['twitts'] = df['twitts'].apply(lambda x: " ".join([t for t in x.split() if t not in STOP_WORDS]))
Inspect the stop-word-free DataFrame:
df.head()
| twitts | sentiment | word_counts | char_counts | avg_word_len | stop_words_len | hashtags_count | mentions_count | numerics_count | upper_counts | emails | emails_count | urls_flag | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | switchfoot - awww bummer shoulda got david car... | 0 | 19 | 115 | 5.052632 | 4 | 0 | 1 | 0 | 1 | [] | 0 | 1 |
| 1 | upset update facebook texting cry result schoo... | 0 | 21 | 111 | 4.285714 | 9 | 0 | 0 | 0 | 0 | [] | 0 | 0 |
| 2 | kenichan dived times ball managed save 50 rest... | 0 | 18 | 89 | 3.944444 | 7 | 0 | 1 | 0 | 1 | [] | 0 | 0 |
| 3 | body feels itchy like fire | 0 | 10 | 47 | 3.700000 | 5 | 0 | 0 | 0 | 0 | [] | 0 | 0 |
| 4 | nationwideclass behaving mad | 0 | 21 | 111 | 4.285714 | 10 | 0 | 1 | 0 | 1 | [] | 0 | 0 |
spaCy Lemmatization
Load the small English core spaCy pipeline:
nlp = spacy.load('en_core_web_sm')
Initialize a test string with varying inflected forms:
x = 'kenichan dived times ball managed save 50 rest'
Create a custom function to parse the text and extract lemma representations:
def make_to_base(x):
x_list = []
doc = nlp(x)
for token in doc:
lemma = str(token.lemma_)
if lemma == '-PRON-' or lemma == 'be':
lemma = token.text
x_list.append(lemma)
print(" ".join(x_list))
Verify lemmatization:
make_to_base(x)
kenichan dive time ball manage save 50 rest
Common words removal
Join the first five cleaned tweets together to check word frequencies:
' '.join(df.head()['twitts'])
'switchfoot - awww bummer shoulda got david carr day d upset update facebook texting cry result school today blah kenichan dived times ball managed save 50 rest bounds body feels itchy like fire nationwideclass behaving mad'
Aggregate and compute the most common words in our dataset:
text = ' '.join(df['twitts'])
Split string on spaces:
text = text.split()
Retrieve value count series:
freq_comm = pd.Series(text).value_counts()
Select the top 20 most frequent words:
f20 = freq_comm[:20]
Display top 20 list:
f20
good 89366
day 82299
like 77735
- 69662
today 64512
going 64078
love 63421
work 62804
got 60749
time 56081
lol 55094
know 51172
im 50147
want 42070
new 41995
think 41040
night 41029
amp 40616
thanks 39311
home 39168
dtype: int64
Remove these 20 high-frequency words from all tweets:
df['twitts'] = df['twitts'].apply(lambda x: " ".join([t for t in x.split() if t not in f20]))
Rare words removal
Find the 20 least common words in the dataset:
rare20 = freq_comm[-20:]
Display the bottom 20 list:
rare20
veru 1
80-90f 1
refrigerant 1
demaisss 1
knittingsci-fi 1
wendireed 1
danielletuazon 1
chacha8 1
a-zquot 1
krustythecat 1
westmount 1
-appreciate 1
motocycle 1
madamhow 1
felspoon 1
fastbloke 1
900pmno 1
nxec 1
laassssttt 1
update-uri 1
dtype: int64
Isolate all words that occur exactly once:
rare = freq_comm[freq_comm.values == 1]
Display stats on words with single occurrences:
rare
mamat 1
fiive 1
music-festival 1
leenahyena 1
11517 1
..
fastbloke 1
900pmno 1
nxec 1
laassssttt 1
update-uri 1
Length: 536196, dtype: int64
Remove the rare words from all tweets:
df['twitts'] = df['twitts'].apply(lambda x: ' '.join([t for t in x.split() if t not in rare20]))
Verify removal on the first 5 rows:
df.head()
| twitts | sentiment | word_counts | char_counts | avg_word_len | stop_words_len | hashtags_count | mentions_count | numerics_count | upper_counts | emails | emails_count | urls_flag | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | switchfoot awww bummer shoulda david carr d | 0 | 19 | 115 | 5.052632 | 4 | 0 | 1 | 0 | 1 | [] | 0 | 1 |
| 1 | upset update facebook texting cry result schoo... | 0 | 21 | 111 | 4.285714 | 9 | 0 | 0 | 0 | 0 | [] | 0 | 0 |
| 2 | kenichan dived times ball managed save 50 rest... | 0 | 18 | 89 | 3.944444 | 7 | 0 | 1 | 0 | 1 | [] | 0 | 0 |
| 3 | body feels itchy fire | 0 | 10 | 47 | 3.700000 | 5 | 0 | 0 | 0 | 0 | [] | 0 | 0 |
| 4 | nationwideclass behaving mad | 0 | 21 | 111 | 4.285714 | 10 | 0 | 1 | 0 | 1 | [] | 0 | 0 |
Word Cloud Visualization
Import the WordCloud library to visually inspect the text tokens:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
%matplotlib inline
Create a combined text corpus string of the first 20,000 words:
x = ' '.join(text[:20000])
Print the total word count in the dataset:
len(text)
10837079
Generate and render the Word Cloud representation:
wc = WordCloud(width = 800, height=400).generate(x)
plt.imshow(wc)
plt.axis('off')
plt.show()
The Word Cloud below displays the most prominent terms across the Twitter dataset:

Spelling Correction
Import TextBlob:
from textblob import TextBlob
Initialize a misspelled test string:
x = 'tanks forr waching this vidio carri'
Correct spelling using TextBlob:
x = TextBlob(x).correct()
Check corrected output:
x
TextBlob("tanks for watching this video carry")
Tokenization
Initialize a test string with no spacing around a hashtag:
x = 'thanks#watching this video. please like it'
Tokenize using TextBlob:
TextBlob(x).words
WordList(['thanks', 'watching', 'this', 'video', 'please', 'like', 'it'])
Parse and print tokens using spaCy:
doc = nlp(x)
for token in doc:
print(token)
thanks#watching
this
video
.
please
like
it
TextBlob Lemmatization
Initialize inflected variants:
x = 'runs run running ran'
Import TextBlob Word wrapper:
from textblob import Word
Lemmatize using TextBlob:
for token in x.split():
print(Word(token).lemmatize())
run
run
running
ran
Lemmatize using spaCy:
doc = nlp(x)
for token in doc:
print(token.lemma_)
run
run
run
run
Detect Entities using NER of SpaCy
Named Entity Recognition (NER) identifies span elements in unstructured text and groups them into predefined categories like places, organizations, or dates.
Initialize a news string:
x = "Breaking News: Donald Trump, the president of the USA is looking to sign a deal to mine the moon"
Perform NER extraction:
doc = nlp(x)
for ent in doc.ents:
print(ent.text + ' - ' + ent.label_ + ' - ' + str(spacy.explain(ent.label_)))
Donald Trump - PERSON - People, including fictional
USA - GPE - Countries, cities, states
Import displacy:
from spacy import displacy
Render named entities visually in the document:
displacy.render(doc, style = 'ent')
Breaking News: Donald Trump PERSON , the president of the USA GPE is looking to sign a deal to mine the moon
Detecting Nouns
Verify the test string:
x
'Breaking News: Donald Trump, the president of the USA is looking to sign a deal to mine the moon'
Extract the noun chunks using spaCy properties:
for noun in doc.noun_chunks:
print(noun)
Breaking News
Donald Trump
the president
the USA
a deal
the moon
Translation and Language Detection
Verify the test string:
x
'Breaking News: Donald Trump, the president of the USA is looking to sign a deal to mine the moon'
Initialize TextBlob container:
tb = TextBlob(x)
Detect language:
tb.detect_language()
'en'
Translate document text into Bengali:
tb.translate(to='bn')
TextBlob("ব্রেকিং নিউজ: যুক্তরাষ্ট্রের রাষ্ট্রপতি ডোনাল্ড ট্রাম্প চাঁদটি খনির জন্য একটি চুক্তিতে সই করতে চাইছেন")
Use inbuilt sentiment classifier
Import NaiveBayesAnalyzer:
from textblob.sentiments import NaiveBayesAnalyzer
Initialize a positive test sentence:
x = 'we all stands together to fight with corona virus. we will win together'
Evaluate sentiment predictions:
tb = TextBlob(x, analyzer=NaiveBayesAnalyzer())
Check sentiment distribution:
tb.sentiment
Sentiment(classification='pos', p_pos=0.8259779151942094, p_neg=0.17402208480578962)
Initialize a second test string:
x = 'we all are sufering from corona'
Evaluate sentiment:
tb = TextBlob(x, analyzer=NaiveBayesAnalyzer())
Check sentiment prediction:
tb.sentiment
Sentiment(classification='pos', p_pos=0.75616044472398, p_neg=0.2438395552760203)
Advanced Text Processing
In this section, we cover structural vectorization techniques to convert clean tokens into numeric feature matrices.
N-Grams
Initialize a test string:
x = 'thanks for watching'
Convert string to TextBlob object:
tb = TextBlob(x)
Extract 3-gram sequences:
tb.ngrams(3)
[WordList(['thanks', 'for', 'watching'])]
Bag of Words (BoW) Representation
Initialize a list of dummy sentences:
x = ['this is first sentence this is', 'this is second', 'this is last']
Import CountVectorizer:
from sklearn.feature_extraction.text import CountVectorizer
Fit and generate BoW vectors:
cv = CountVectorizer(ngram_range=(1,1))
text_counts = cv.fit_transform(x)
Convert sparse representation to array:
text_counts.toarray()
array([[1, 2, 0, 0, 1, 2],
[0, 1, 0, 1, 0, 1],
[0, 1, 1, 0, 0, 1]], dtype=int64)
Check feature names mapping:
cv.get_feature_names()
['first', 'is', 'last', 'second', 'sentence', 'this']
Convert array to a structured DataFrame:
bow = pd.DataFrame(text_counts.toarray(), columns = cv.get_feature_names())
Display BoW DataFrame:
bow
| first | is | last | second | sentence | this | |
|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 0 | 0 | 1 | 2 |
| 1 | 0 | 1 | 0 | 1 | 0 | 1 |
| 2 | 0 | 1 | 1 | 0 | 0 | 1 |
Verify documents:
x
['this is first sentence this is', 'this is second', 'this is last']
Term Frequency
Term frequency measures the relative occurrence of a term in a specific document. The formula is:
Where:
- — the normalized term frequency of term .
Verify documents:
x
['this is first sentence this is', 'this is second', 'this is last']
Display BoW matrix:
bow
| first | is | last | second | sentence | this | |
|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 0 | 0 | 1 | 2 |
| 1 | 0 | 1 | 0 | 1 | 0 | 1 |
| 2 | 0 | 1 | 1 | 0 | 0 | 1 |
Check matrix dimensions:
bow.shape
(3, 6)
Copy the BoW DataFrame:
tf = bow.copy()
Compute normalized TF for each cell:
for index, row in enumerate(tf.iterrows()):
for col in row[1].index:
tf.loc[index, col] = tf.loc[index, col]/sum(row[1].values)
Display normalized TF DataFrame:
tf
| first | is | last | second | sentence | this | |
|---|---|---|---|---|---|---|
| 0 | 0.166667 | 0.333333 | 0.000000 | 0.000000 | 0.166667 | 0.333333 |
| 1 | 0.000000 | 0.333333 | 0.000000 | 0.333333 | 0.000000 | 0.333333 |
| 2 | 0.000000 | 0.333333 | 0.333333 | 0.000000 | 0.000000 | 0.333333 |
Inverse Document Frequency IDF
Inverse Document Frequency (IDF) measures how unique or rare a term is across the entire corpus. The formula used in scikit-learn when smooth_idf=True is:
Where:
- — total number of documents in the corpus.
- — number of documents containing term .
Import NumPy:
import numpy as np
Convert string array to DataFrame:
x_df = pd.DataFrame(x, columns=['words'])
Display words:
x_df
| words | |
|---|---|
| 0 | this is first sentence this is |
| 1 | this is second |
| 2 | this is last |
Display BoW DataFrame:
bow
| first | is | last | second | sentence | this | |
|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 0 | 0 | 1 | 2 |
| 1 | 0 | 1 | 0 | 1 | 0 | 1 |
| 2 | 0 | 1 | 1 | 0 | 0 | 1 |
Get total document count :
N = bow.shape[0]
N
3
Convert values to boolean flags to find document presence:
bb = bow.astype('bool')
bb
| first | is | last | second | sentence | this | |
|---|---|---|---|---|---|---|
| 0 | True | True | False | False | True | True |
| 1 | False | True | False | True | False | True |
| 2 | False | True | True | False | False | True |
Sum occurrences of column "is":
bb['is'].sum()
3
Retrieve columns:
cols = bb.columns
cols
Index(['first', 'is', 'last', 'second', 'sentence', 'this'], dtype='object')
Calculate total document occurrences for each term:
nz = []
for col in cols:
nz.append(bb[col].sum())
Check occurrences list:
nz
[1, 3, 1, 1, 1, 3]
Calculate IDF values:
idf = []
for index, col in enumerate(cols):
idf.append(np.log((N + 1)/(nz[index] + 1)) + 1)
Check IDF scores:
idf
[1.6931471805599454, 1.0, 1.6931471805599454, 1.6931471805599454, 1.6931471805599454, 1.0]
Review BoW DataFrame:
bow
| first | is | last | second | sentence | this | |
|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 0 | 0 | 1 | 2 |
| 1 | 0 | 1 | 0 | 1 | 0 | 1 |
| 2 | 0 | 1 | 1 | 0 | 0 | 1 |
TFIDF
Import TfidfVectorizer:
from sklearn.feature_extraction.text import TfidfVectorizer
Fit and transform text using TfidfVectorizer:
tfidf = TfidfVectorizer()
x_tfidf = tfidf.fit_transform(x_df['words'])
Convert sparse matrix to array:
x_tfidf.toarray()
array([[0.45688214, 0.5396839 , 0. , 0. , 0.45688214,
0.5396839 ],
[0. , 0.45329466, 0. , 0.76749457, 0. ,
0.45329466],
[0. , 0.45329466, 0.76749457, 0. , 0. ,
0.45329466]])
Print fitted IDF scores:
tfidf.idf_
array([1.69314718, 1. , 1.69314718, 1.69314718, 1.69314718,
1. ])
Print calculated manual IDF scores:
idf
[1.6931471805599454, 1.0, 1.6931471805599454, 1.6931471805599454, 1.6931471805599454, 1.0]
Word Embeddings
Word Embedding is a language modeling technique used for mapping words to vectors of real numbers. It represents words or phrases in vector space with several dimensions. Word embeddings can be generated using various methods like neural networks, co-occurrence matrix, probabilistic models, etc.
SpaCy Word2Vec
Load the large English pipeline model to retrieve semantic word vectors:
nlp = spacy.load('en_core_web_lg')
Initialize sample tokens:
doc = nlp('thank you! dog cat lion dfasaa')
Check if each token is associated with a pre-trained vector representation:
for token in doc:
print(token.text, token.has_vector)
thank True
you True
! True
dog True
cat True
lion True
dfasaa False
Check token vector dimensions:
token.vector.shape
(300,)
Check vector dimension on "cat":
nlp('cat').vector.shape
(300,)
Calculate similarity scores across combinations of words:
for token1 in doc:
for token2 in doc:
print(token1.text, token2.text, token1.similarity(token2))
print()
thank thank 1.0
thank you 0.5647585
thank ! 0.52147406
thank dog 0.2504265
thank cat 0.20648485
thank lion 0.13629764
Evaluate similarity metrics on empty vectors:
thank dfasaa 0.0
you thank 0.5647585
you you 1.0
you ! 0.4390223
you dog 0.36494097
you cat 0.3080798
you lion 0.20392051
Evaluate similarity against empty tokens:
you dfasaa 0.0
! thank 0.52147406
! you 0.4390223
! ! 1.0
! dog 0.29852203
! cat 0.29702348
! lion 0.19601382
Evaluate similarity metrics:
! dfasaa 0.0
dog thank 0.2504265
dog you 0.36494097
dog ! 0.29852203
dog dog 1.0
dog cat 0.80168545
dog lion 0.47424486
Evaluate similarities:
dog dfasaa 0.0
cat thank 0.20648485
cat you 0.3080798
cat ! 0.29702348
cat dog 0.80168545
cat cat 1.0
cat lion 0.52654374
Calculate similarities:
cat dfasaa 0.0
lion thank 0.13629764
lion you 0.20392051
lion ! 0.19601382
lion dog 0.47424486
lion cat 0.52654374
lion lion 1.0
Verify comparisons:
lion dfasaa 0.0
Check comparison results:
dfasaa thank 0.0
Check comparisons:
dfasaa you 0.0
Evaluate comparisons:
dfasaa ! 0.0
Check evaluations:
dfasaa dog 0.0
Verify similarity output:
dfasaa cat 0.0
Check comparisons:
dfasaa lion 0.0
dfasaa dfasaa 1.0
Machine Learning Models for Text Classification
In this section, we compare machine learning models trained on Bag of Words features, manual features, and Word2Vec semantic embeddings.
BoW Features Setup
Inspect the shape of the main dataset DataFrame:
df.shape
(1600000, 13)
Create a balanced sample dataset consisting of 2000 positive and 2000 negative sentiment rows:
df0 = df[df['sentiment']==0].sample(2000)
df4 = df[df['sentiment']==4].sample(2000)
Concatenate positive and negative samples:
dfr = df0.append(df4)
Check the dimension of the concatenated sample dataset:
dfr.shape
(4000, 13)
Drop non-feature labels and email text listings to isolate the manual feature set:
dfr_feat = dfr.drop(labels=['twitts','sentiment','emails'], axis = 1).reset_index(drop=True)
Inspect the manual feature DataFrame:
dfr_feat
| word_counts | char_counts | avg_word_len | stop_words_len | hashtags_count | mentions_count | numerics_count | upper_counts | emails_count | urls_flag | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 15 | 81 | 4.400000 | 6 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 8 | 47 | 4.875000 | 4 | 0 | 1 | 0 | 0 | 0 | 0 |
| 2 | 15 | 69 | 3.600000 | 6 | 0 | 1 | 0 | 0 | 0 | 0 |
| 3 | 9 | 42 | 3.666667 | 4 | 0 | 0 | 0 | 2 | 0 | 0 |
| 4 | 14 | 77 | 4.500000 | 5 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3995 | 3 | 33 | 9.666667 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
| 3996 | 16 | 78 | 3.875000 | 4 | 0 | 1 | 0 | 2 | 0 | 0 |
| 3997 | 27 | 134 | 3.962963 | 9 | 0 | 1 | 0 | 2 | 0 | 0 |
| 3998 | 6 | 44 | 6.333333 | 0 | 0 | 1 | 1 | 1 | 0 | 0 |
| 3999 | 5 | 25 | 4.000000 | 3 | 0 | 1 | 0 | 0 | 0 | 0 |
4000 rows × 10 columns
Extract target sentiment labels:
y = dfr['sentiment']
Import CountVectorizer:
from sklearn.feature_extraction.text import CountVectorizer
Generate BoW representation for the sample dataset:
cv = CountVectorizer()
text_counts = cv.fit_transform(dfr['twitts'])
Check the size of the generated vocabulary feature space:
text_counts.toarray().shape
(4000, 9750)
Construct a DataFrame from the generated vocabulary:
dfr_bow = pd.DataFrame(text_counts.toarray(), columns=cv.get_feature_names())
Inspect the generated BoW DataFrame:
dfr_bow.head(2)
| 007peter | 05 | 060594 | 09 | 10 | 100 | 1000 | 10000000000000000000000000000 | 1038 | 1041 | ... | zomg | zonked | zoo | zooey | zrovna | zshare | zsk | zwel | zzz | zzzzz | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 rows × 9750 columns
Classifier Models Setup
Import the classifier algorithms, evaluation metrics, and preprocessing utilities from scikit-learn:
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.preprocessing import MinMaxScaler
Initialize the five comparison models:
sgd = SGDClassifier(n_jobs=-1, random_state=42, max_iter=200)
lgr = LogisticRegression(random_state=42, max_iter=200)
lgrcv = LogisticRegressionCV(cv = 2, random_state=42, max_iter=1000)
svm = LinearSVC(random_state=42, max_iter=200)
rfc = RandomForestClassifier(random_state=42, n_jobs=-1, n_estimators=200)
Map models to standard shorthand keys:
clf = {'SGD': sgd, 'LGR': lgr, 'LGR-CV': lgrcv, 'SVM': svm, 'RFC': rfc}
Verify dictionary keys:
clf.keys()
dict_keys(['SGD', 'LGR', 'LGR-CV', 'SVM', 'RFC'])
Create a generic training and evaluation pipeline function:
def classify(X, y):
scaler = MinMaxScaler(feature_range=(0, 1))
X = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)
for key in clf.keys():
clf[key].fit(X_train, y_train)
y_pred = clf[key].predict(X_test)
ac = accuracy_score(y_test, y_pred)
print(key, " ---> ", ac)
Evaluate accuracy scores using the BoW features representation:
%%time
classify(dfr_bow, y)
SGD ---> 0.62375
LGR ---> 0.65375
LGR-CV ---> 0.6525
SVM ---> 0.6325
RFC ---> 0.6525
Wall time: 1min 42s
Evaluate accuracy using the manual feature sets:
dfr_feat.head(2)
| word_counts | char_counts | avg_word_len | stop_words_len | hashtags_count | mentions_count | numerics_count | upper_counts | emails_count | urls_flag | |
|---|---|---|---|---|---|---|---|---|---|---|
| 453843 | 15 | 81 | 4.400 | 6 | 0 | 0 | 0 | 0 | 0 | 0 |
| 388280 | 8 | 47 | 4.875 | 4 | 0 | 1 | 0 | 0 | 0 | 0 |
Run classifier evaluation on the manual features:
%%time
classify(dfr_feat, y)
SGD ---> 0.64125
LGR ---> 0.645
LGR-CV ---> 0.65
SVM ---> 0.6475
RFC ---> 0.5675
Wall time: 1.35 s
Combine manual features and vocabulary-based BoW features:
X = dfr_feat.join(dfr_bow)
Evaluate accuracy on the combined features matrix:
%%time
classify(X, y)
SGD ---> 0.64875
LGR ---> 0.67125
LGR-CV ---> 0.66125
SVM ---> 0.64375
RFC ---> 0.705
Wall time: 1min 18s
TF-IDF Features Setup
Import TfidfVectorizer:
from sklearn.feature_extraction.text import TfidfVectorizer
Verify dataset shape:
dfr.shape
(4000, 13)
Vectorize using TfidfVectorizer:
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(dfr['twitts'])
Train and evaluate the models using the generated TF-IDF features matrix:
%%time
classify(pd.DataFrame(X.toarray()), y)
SGD ---> 0.635
LGR ---> 0.65125
LGR-CV ---> 0.6475
SVM ---> 0.63875
RFC ---> 0.6425
Wall time: 1min 37s
Word2Vec Features Setup
Create a function to calculate the average semantic vector representation of a sentence using spaCy:
def get_vec(x):
doc = nlp(x)
return doc.vector.reshape(1, -1)
Apply the vectorization function to all tweets:
%%time
dfr['vec'] = dfr['twitts'].apply(lambda x: get_vec(x))
Wall time: 51.8 s
Concatenate individual vectors into a single feature array:
X = np.concatenate(dfr['vec'].to_numpy(), axis = 0)
Check the shape of the feature matrix:
X.shape
(4000, 300)
Evaluate accuracy on Word2Vec feature vectors:
classify(pd.DataFrame(X), y)
SGD ---> 0.5925
LGR ---> 0.70625
LGR-CV ---> 0.69375
Check predictions on classification outputs:
SVM ---> 0.70125
RFC ---> 0.66625
Create a custom function to run predictions on Word2Vec inputs:
def predict_w2v(x):
for key in clf.keys():
y_pred = clf[key].predict(get_vec(x))
print(key, "-->", y_pred)
Evaluate prediction on a positive review:
predict_w2v('hi, thanks for watching this video. please like and subscribe')
SGD --> [0]
LGR --> [4]
LGR-CV --> [0]
SVM --> [4]
RFC --> [0]
Predict sentiment of a question:
predict_w2v('please let me know if you want more video')
SGD --> [0]
LGR --> [0]
LGR-CV --> [0]
SVM --> [0]
RFC --> [0]
Predict sentiment on a highly positive feedback:
predict_w2v('congratulation looking good congrats')
SGD --> [4]
LGR --> [4]
LGR-CV --> [4]
SVM --> [4]
RFC --> [0]
Conclusion
In this tutorial, you built a complete text preprocessing and feature engineering pipeline in Python. Starting with raw tweet logs from the Sentiment140 dataset, you calculated meta-features like word counts and mentions, cleaned the text by stripping HTML tags and accented characters, and extracted lemmas using spaCy. Using these cleaned tokens, you generated numerical representations using Bag of Words, TF-IDF, and Word2Vec models, then trained five different classifiers. The Logistic Regression and LinearSVC models achieved the highest overall accuracy of 70.6% when trained on dense Word2Vec semantic embeddings.
Key takeaways:
- Cleaning operations like contraction expansion and lemmatization reduce the overall size of the vocabulary, helping classifiers avoid overfitting.
- spaCy provides ready-to-use, high-quality, pre-trained word embeddings that capture semantic similarities better than basic Bag of Words arrays.
- Combining manual features (like word length or stop word counts) with structural word frequencies improves classifier performance.
Next steps:
- Apply the same workflow to movie and product reviews in Sentiment Classification with spaCy.
- Build a binary classifier to identify spam messages in Spam Text Message Classification using NLP.
- Explore summarization techniques on raw text blocks in Text Summarization using NLP.
