Feature Dimention Reduction Using LDA and PCA with Python | Principal Component Analysis in Feature Selection | KGP Talkie
Feature Dimension Reduction
Watch Full Playlist: https://www.youtube.com/playlist?list=PLc2rvfiptPSQYzmDIFuq2PqN2n28ZjxDH
What is LDA (Linear Discriminant Analysis)?
The idea behind
LDA is simple. Mathematically speaking, we need to find a new feature space to project the data in order to maximize classes separability
Linear Discriminant Analysis is a supervised algorithm as it takes the class label into consideration. It is a way to reduce ‘dimensionality’ while at the same time preserving as much of the class discrimination information as possible.
LDA helps you find the boundaries around
clusters of classes. It projects your data points on a line so that your
clusters are as separated as possible, with each cluster having a relative (close) distance to a
So the question arises- how are these
clusters are defined and how do we get the reduced feature set in case of
LDA finds a
centroid of each class datapoints. For example with
thirteen different features
LDA will find the centroid of each of its class using the thirteen different feature dataset. Now on the basis of this, it determines a new dimension which is nothing but an axis which should satisfy two criteria:
- Maximize the distance between the
centroidof each class.
- Minimize the variation (which LDA calls
scatterand is represented by s2), within each category.
What is PCA
Principal Component Analysis (PCA) is a linear dimensionality reduction technique that can be utilized for extracting information from a
high-dimensional space by projecting it into a
lower-dimensional sub-space. It tries to
preserve the essential parts that have more variation of the data and remove the non-essential parts with fewer variation.
Dimensions are nothing but features that represent the data. For example, A
28 X 28 image has
784 picture elements (pixels) that are the dimensions or features which together represent that image.
One important thing to note about
PCA is that it is an
Unsupervised dimensionality reduction technique, you can
cluster the similar data points based on the feature
correlation between them without any supervision (or labels), and you will learn how to achieve this practically using
Python in later sections of this tutorial!
According to Wikipedia,
PCA is a statistical procedure that uses an
orthogonal transformation to convert a set of observations of possibly correlated variables (entities each of which takes on various numerical values) into a set of values of linearly
uncorrelated variables called
When to use PCA
When working on any data related problem, the challenge in today's world is the
sheer volume of data, and the
variables/features that define that data. To solve a problem where data is the key, you need extensive data
exploration like finding out how the variables are correlated or understanding the
distribution of a few variables. Considering that there are a large number of variables or dimensions along which the data is distributed,
visualization can be a challenge and almost impossible.
Speeding Machine Learning (ML) Algorithm:
Since PCA's main idea is dimensionality
reduction, you can leverage that to speed up your machine learning algorithm's training and
testing time considering your data has a lot of features, and the
ML algorithm's learning is too slow.
How to do PCA
We can calculate a
Principal Component Analysis on a dataset using the
PCA() class in the scikit-learn library. The benefit of this approach is that once the
projection is calculated, it can be applied to new data again and again quite easily.
When creating the
class, the number of components can be specified as a parameter.
The class is first fit on a dataset by calling the
fit() function, and then the original dataset or other data can be
projected into a subspace with the chosen number of dimensions by calling the
Once fit, the
principal components can be accessed on the PCA class via the explainedvariance and components_ attributes.
Let's go ahead and learn with the script.
Importing required libraries
import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline
from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score, roc_auc_score from sklearn.feature_selection import VarianceThreshold from sklearn.preprocessing import StandardScaler
Here, we are going to read the santander data into variable data.
data = pd.read_csv('santander.csv', nrows = 20000) data.head()
Let's read this into the x and y vectors to train .
X = data.drop('TARGET', axis = 1) y = data['TARGET'] X.shape, y.shape
((20000, 370), (20000,))
Now, we will split the dataset into train and test datasets, you can observe from the following script.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0, stratify = y)
Remove Constant, Quasi Constant and Duplicate Features
Let's remove constant and quasi constant features from the data with the threshold value of 0.01. That means the features which have 99% similarity among them have been removed.
constant_filter = VarianceThreshold(threshold=0.01) constant_filter.fit(X_train) X_train_filter = constant_filter.transform(X_train) X_test_filter = constant_filter.transform(X_test) X_train_filter.shape, X_test_filter.shape
((16000, 245), (4000, 245))
Let's remove duplicated features from the data.
X_train_T = X_train_filter.T X_test_T = X_test_filter.T X_train_T = pd.DataFrame(X_train_T) X_test_T =pd.DataFrame(X_test_T)X_train_T.duplicated().sum()
duplicated_features = X_train_T.duplicated() features_to_keep = [not index for index in duplicated_features] X_train_unique = X_train_T[features_to_keep].T X_test_unique = X_test_T[features_to_keep].T
Let's go ahead and standardize the data to get the same scale.
scaler = StandardScaler().fit(X_train_unique) X_train_unique = scaler.transform(X_train_unique) X_test_unique = scaler.transform(X_test_unique) X_train_unique = pd.DataFrame(X_train_unique) X_test_unique = pd.DataFrame(X_test_unique)X_train_unique.shape, X_test_unique.shape
((16000, 227), (4000, 227))
Removal of correlated Feature
Now we will find out the correlated features from the following code:
corrmat = X_train_unique.corr()
def get_correlation(data, threshold): corr_col = set() corrmat = data.corr() for i in range(len(corrmat.columns)): for j in range(i): if abs(corrmat.iloc[i, j]) > threshold: colname = corrmat.columns[i] corr_col.add(colname) return corr_col corr_features = get_correlation(X_train_unique, 0.70) print('correlated features: ', len(set(corr_features)) )
correlated features: 148
X_train_uncorr = X_train_unique.drop(labels=corr_features, axis = 1) X_test_uncorr = X_test_unique.drop(labels = corr_features, axis = 1) X_train_uncorr.shape, X_test_uncorr.shape
((16000, 79), (4000, 79))
Here, we can observe that the features are reduced from
Feature Dimention Reduction by LDA or Is it a Classifier
Here the question is, is it dimentional reduction technique or is it a classifier. But we can say it is working for both.
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
The number of components which we can pass here is 1 why beacuse if you remember the maximum number of components we can select are total number of classes - 1. Here, santadard problem is a biclass problem those are either 0 or 1. so the maximum number of components is 1.
Even if we select more than 1, it will treat as 1.
Let's go ahead and transform the data by using fit_transform().
lda = LDA(n_components=1) X_train_lda = lda.fit_transform(X_train_uncorr, y_train) X_test_lda = lda.transform(X_test_uncorr)
Here, we can see transformed data from the following code.
((16000, 1), (4000, 1))
def run_randomForest(X_train, X_test, y_train, y_test): clf = RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1) clf.fit(X_train, y_train) y_pred = clf.predict(X_test) print('Accuracy on test set: ') print(accuracy_score(y_test, y_pred))
%%time run_randomForest(X_train_lda, X_test_lda, y_train, y_test)
Accuracy on test set: 0.93025 CPU times: user 3.35 s, sys: 64.3 ms, total: 3.41 s Wall time: 1.1 s
%%time run_randomForest(X_train_uncorr, X_test_uncorr, y_train, y_test)
Accuracy on test set: 0.9585 CPU times: user 3.41 s, sys: 82 ms, total: 3.49 s Wall time: 1.2 s
Let's go ahead and run this on the original dataset.
%%time run_randomForest(X_train, X_test, y_train, y_test)
Accuracy on test set: 0.9585 CPU times: user 6.68 s, sys: 143 ms, total: 6.83 s Wall time: 2.01 s
So, if see here accuracy on the original dataset is more compared to transformed dataset. But, the training time original dataset is double than tranformed version and the dimension also has been reduced.
From this, we can observe LDA won't give guarantee on the accuracy but it will give guarantee on the reduction in dimension and cpu time.
Feature Reduction by PCA
from sklearn.decomposition import PCA
Let's remove the features by using the Principal Component Analysis (PCA) method.
pca = PCA(n_components=2, random_state=42) pca.fit(X_train_uncorr) PCA(copy=True, iterated_power='auto', n_components=2, random_state=42, svd_solver='auto', tol=0.0, whiten=False)
Let's go ahead and get training and testing dataset by PCA transformation.
X_train_pca = pca.transform(X_train_uncorr) X_test_pca = pca.transform(X_test_uncorr) X_train_pca.shape, X_test_pca.shape
((16000, 2), (4000, 2))
Now, find out the accuracy and cpu time of the transformed dataset.
%%time run_randomForest(X_train_pca, X_test_pca, y_train, y_test)
Accuracy on test set: 0.956 CPU times: user 3.06 s, sys: 67.9 ms, total: 3.12 s Wall time: 999 ms
Let's get the accuracy and cpu time of the original dataset.
%%time run_randomForest(X_train, X_test, y_train, y_test)
Accuracy on test set: 0.9585 CPU times: user 6.75 s, sys: 138 ms, total: 6.89 s Wall time: 2.13 s
Let's check the dimension of the uncorrected dataset.
Let's check the accuracy for various selected components.
for component in range(1,5): pca = PCA(n_components=component, random_state=42) pca.fit(X_train_uncorr) X_train_pca = pca.transform(X_train_uncorr) X_test_pca = pca.transform(X_test_uncorr) print('Selected Components: ', component) run_randomForest(X_train_pca, X_test_pca, y_train, y_test) print()
Selected Components: 1 Accuracy on test set: 0.92375 Selected Components: 2 Accuracy on test set: 0.956 Selected Components: 3 Accuracy on test set: 0.95675 Selected Components: 4 Accuracy on test set: 0.95825