Feature Selection with Filtering Method | Constant, Quasi Constant and Duplicate Feature Removal

Published by KGP Talkie on

Filtering method

Watch Full Playlist: https://www.youtube.com/playlist?list=PLc2rvfiptPSQYzmDIFuq2PqN2n28ZjxDH

Unnecessary and redundant features not only slow down the training time of an algorithm, but they also affect the performance of the algorithm.

There are several advantages of performing feature selection before training machine learning models:

  • Models with less number of features have higher explainability.
  • It is easier to implement machine learning models with reduced features.
  • Fewer features lead to enhanced generalization which in turn reduces overfitting.
  • Feature selection removes data redundancy.
  • Training time of models with fewer features is significantly lower.
  • Models with fewer features are less prone to errors.

What is filter method?

Features selected using filter methods can be used as an input to any machine learning models.

  • Univariate -> Fisher Score, Mutual Information Gain, Variance etc
  • Multi-variate -> Pearson Correlation

The univariate filter methods are the type of methods where individual features are ranked according to specific criteria. The top N features are then selected. Different types of ranking criteria are used for univariate filtermethods, for example fisher score, mutual information, and variance of the feature.

Multivariate filter methods are capable of removing redundant features from the data since they take the mutual relationship between the features into account.

Part 1 of filtering method
Part 2 of filtering method

Univariate Filtering Methods in this lesson

  • Constant Removal
  • Quasi Constant Removal
  • Duplicate Feature Removal

Download Data Files

https://github.com/laxmimerit/Data-Files-for-Feature-Selection

Constant Feature Removal

Importing required libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import VarianceThreshold

Now read the dataset from pandas and intially number of rows are 20000.

data = pd.read_csv('santander.csv', nrows = 20000)
data.head()
IDvar3var15imp_ent_var16_ult1imp_op_var39_comer_ult1imp_op_var39_comer_ult3imp_op_var40_comer_ult1imp_op_var40_comer_ult3imp_op_var40_efect_ult1imp_op_var40_efect_ult3...saldo_medio_var33_hace2saldo_medio_var33_hace3saldo_medio_var33_ult1saldo_medio_var33_ult3saldo_medio_var44_hace2saldo_medio_var44_hace3saldo_medio_var44_ult1saldo_medio_var44_ult3var38TARGET
012230.00.00.00.00.000...0.00.00.00.00.00.00.00.039205.1700000
132340.00.00.00.00.000...0.00.00.00.00.00.00.00.049278.0300000
242230.00.00.00.00.000...0.00.00.00.00.00.00.00.067333.7700000
382370.0195.0195.00.00.000...0.00.00.00.00.00.00.00.064007.9700000
4102390.00.00.00.00.000...0.00.00.00.00.00.00.00.0117310.9790160

5 rows × 371 columns

Let's load these datasets into x and y vectors.

X = data.drop('TARGET', axis = 1)
y = data['TARGET']
X.shape, y.shape
((20000, 370), (20000,))

Let's split this dataset into train and test datasets using the below code.

Here test_size = 0.2 that means 20% for testing and remaining for training the model.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0, stratify = y)

Constant Features Removal

In this, first we need to create a variance threshold and then fit the model with training set of the data.

constant_filter = VarianceThreshold(threshold=0)
constant_filter.fit(X_train)
VarianceThreshold(threshold=0)

Let's get the number of features left after removing constant features.

constant_filter.get_support().sum()
291

Let's print the constant features list.

constant_list = [not temp for temp in constant_filter.get_support()]
constant_list
[False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 True,
 False,
 False,
 False,
 False,
 False,
 True,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 True,
 True,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 False,
 False,
 True,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 True,
 False,
 False,
 False,
 False,
 False,
 True,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 True,
 True,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 False,
 False,
 False,
 False,
 True,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 False,
 False,
 True,
 False,
 False,
 False,
 False,
 True,
 True,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 False,
 True,
 False,
 False,
 True,
 False,
 True,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 True,
 False,
 True,
 False,
 True,
 True,
 True,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 False,
 False,
 True,
 False,
 True,
 False,
 True,
 True,
 False,
 False,
 True,
 False,
 True,
 True,
 True,
 False,
 True,
 True,
 False,
 False,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 True,
 False,
 True,
 True,
 False,
 False,
 False,
 False,
 True,
 False,
 True,
 True,
 True,
 False,
 True,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 True,
 True,
 True,
 True,
 False,
 False,
 False,
 False,
 False,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False]

Now we will try get the name of features which are constant.

X.columns[constant_list]
Index(['ind_var2_0', 'ind_var2', 'ind_var13_medio_0', 'ind_var13_medio',
       'ind_var18_0', 'ind_var18', 'ind_var27_0', 'ind_var28_0', 'ind_var28',
       'ind_var27', 'ind_var34_0', 'ind_var34', 'ind_var41', 'ind_var46_0',
       'ind_var46', 'num_var13_medio_0', 'num_var13_medio', 'num_var18_0',
       'num_var18', 'num_var27_0', 'num_var28_0', 'num_var28', 'num_var27',
       'num_var34_0', 'num_var34', 'num_var41', 'num_var46_0', 'num_var46',
       'saldo_var13_medio', 'saldo_var18', 'saldo_var28', 'saldo_var27',
       'saldo_var34', 'saldo_var41', 'saldo_var46',
       'delta_imp_amort_var18_1y3', 'delta_imp_amort_var34_1y3',
       'delta_imp_reemb_var33_1y3', 'delta_imp_trasp_var17_out_1y3',
       'delta_imp_trasp_var33_out_1y3', 'delta_num_reemb_var33_1y3',
       'delta_num_trasp_var17_out_1y3', 'delta_num_trasp_var33_out_1y3',
       'imp_amort_var18_hace3', 'imp_amort_var18_ult1',
       'imp_amort_var34_hace3', 'imp_amort_var34_ult1', 'imp_var7_emit_ult1',
       'imp_reemb_var13_hace3', 'imp_reemb_var17_hace3',
       'imp_reemb_var33_hace3', 'imp_reemb_var33_ult1',
       'imp_trasp_var17_in_hace3', 'imp_trasp_var17_out_hace3',
       'imp_trasp_var17_out_ult1', 'imp_trasp_var33_in_hace3',
       'imp_trasp_var33_out_hace3', 'imp_trasp_var33_out_ult1',
       'ind_var7_emit_ult1', 'num_var2_0_ult1', 'num_var2_ult1',
       'num_var7_emit_ult1', 'num_meses_var13_medio_ult3',
       'num_reemb_var13_hace3', 'num_reemb_var17_hace3',
       'num_reemb_var33_hace3', 'num_reemb_var33_ult1',
       'num_trasp_var17_in_hace3', 'num_trasp_var17_out_hace3',
       'num_trasp_var17_out_ult1', 'num_trasp_var33_in_hace3',
       'num_trasp_var33_out_hace3', 'num_trasp_var33_out_ult1',
       'saldo_var2_ult1', 'saldo_medio_var13_medio_hace2',
       'saldo_medio_var13_medio_hace3', 'saldo_medio_var13_medio_ult1',
       'saldo_medio_var13_medio_ult3', 'saldo_medio_var29_hace3'],
      dtype='object')

Let's go ahead and transform the x_train and x_test datasets into non constant datasets.

X_train_filter = constant_filter.transform(X_train)
X_test_filter = constant_filter.transform(X_test)

Let's get the shape of the datasets.

X_train_filter.shape, X_test_filter.shape, X_train.shape
((16000, 291), (4000, 291), (16000, 370))

Quasi constant feature removal

These are the filters that are almost constant or quasi constant in other words these features have same values for large subset of outputs and such features are not very useful for making predictions.

There is no rule for fixing threshold value but generally we can take as 99% similarity and 1% of non similarity.

Let's go ahead see how many quasi constant features are there.

quasi_constant_filter = VarianceThreshold(threshold=0.01)
quasi_constant_filter.fit(X_train_filter)
VarianceThreshold(threshold=0.01)

Let's see how many features are non quasi constant.

quasi_constant_filter.get_support().sum()
245
291-245
46

To remove those quasi constant features, we need to apply transform on quasi transform filter object.

X_train_quasi_filter = quasi_constant_filter.transform(X_train_filter)
X_test_quasi_filter = quasi_constant_filter.transform(X_test_filter)
X_train_quasi_filter.shape, X_test_quasi_filter.shape
((16000, 245), (4000, 245))
370-245
125

In this way, we have reduced features from 370 to 245 features.

Remove Duplicate Features

If two features are exactly same those are called as duplicate features that means these features doesn't provide any new information and makes our model complex.

Here we have a problem as we did in quasi constant and constant removal sklearn doesn't have direct library to handle with duplicate features .

So, first we will do transpose the dataset and then python have a method to remove duplicate features.

Let's transpose the training and testing dataset by using following code.

X_train_T = X_train_quasi_filter.T
X_test_T = X_test_quasi_filter.T
type(X_train_T)
numpy.ndarray

Let's change into pandas dataframe.

X_train_T = pd.DataFrame(X_train_T)
X_test_T = pd.DataFrame(X_test_T)

Let's check the shapes of the datasets.

X_train_T.shape, X_test_T.shape
((245, 16000), (245, 4000))

Let's go ahead and get the duplicate features.

X_train_T.duplicated().sum()
18

So, here we have 18 duplicated features.

duplicated_features = X_train_T.duplicated()
duplicated_features
0      False
1      False
2      False
3      False
4      False
5      False
6      False
7      False
8      False
9      False
10     False
11     False
12     False
13     False
14     False
15     False
16     False
17     False
18     False
19     False
20     False
21     False
22     False
23     False
24     False
25     False
26     False
27     False
28     False
29     False
       ...  
215    False
216    False
217    False
218    False
219    False
220    False
221    False
222    False
223    False
224    False
225    False
226    False
227    False
228    False
229    False
230    False
231    False
232    False
233    False
234    False
235    False
236    False
237    False
238    False
239    False
240    False
241    False
242    False
243    False
244    False
Length: 245, dtype: bool

Now, we need to get to non duplicated features from the following code.

features_to_keep = [not index for index in duplicated_features]

Let's do transpose again to get the original shape.

X_train_unique = X_train_T[features_to_keep].T
X_test_unique = X_test_T[features_to_keep].T

Let's check the shape of the datasets.

X_train_unique.shape, X_train.shape
((16000, 227), (16000, 370))

Here, we can observe original dataset has 370 features and after removal of quasi constant, constant and duplicate features we have 227 features.

370-227
143

Build ML model and compare the performance of the selected feature

Let's go ahead and compare the model between original dataset and transformed dataset.

Here we are going to build random forest classifier.

def run_randomForest(X_train, X_test, y_train, y_test):
    clf = RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print('Accuracy on test set: ')
    print(accuracy_score(y_test, y_pred))

Let's calculate the accuracy.

%%time
run_randomForest(X_train_unique, X_test_unique, y_train, y_test)
Accuracy on test set: 
0.95875
Wall time: 2.18 s

Let's check the accuracy of the original dataset.

%%time
run_randomForest(X_train, X_test, y_train, y_test)
Accuracy on test set: 
0.9585
Wall time: 2.87 s
(1.51-1.26)*100/1.51
16.556291390728475

Feature Selection with Filtering Method- Correlated Feature Removal

image.png

A dataset can also contain correlated features. Two or more than two features are correlated if they are close to each other in the linear space.

Correlation between the output observations and the input features is very important and such features should be retained.

image.png

Summary

  • Feature Space to target correlation is desired
  • Feature to feature correlation is not desired
  • If 2 features are highly correlated then either feature is redundant
  • Correlation in feature space increases model complexity
  • Removing correlated features improves model performance
  • Different model shows different performance over the correlated features
corrmat = X_train_unique.corr()
plt.figure(figsize=(12,8))
sns.heatmap(corrmat)
def get_correlation(data, threshold):
    corr_col = set()
    corrmat = data.corr()
    for i in range(len(corrmat.columns)):
        for j in range(i):
            if abs(corrmat.iloc[i, j])> threshold:
                colname = corrmat.columns[i]
                corr_col.add(colname)
    return corr_col
corr_features = get_correlation(X_train_unique, 0.85)
corr_features
{5,
 7,
 9,
 11,
 12,
 14,
 15,
 16,
 17,
 18,
 23,
 24,
 28,
 29,
 30,
 32,
 33,
 35,
 36,
 38,
 42,
 46,
 47,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 57,
 58,
 60,
 61,
 62,
 65,
 67,
 68,
 69,
 70,
 72,
 76,
 80,
 81,
 82,
 83,
 84,
 86,
 87,
 88,
 91,
 93,
 95,
 98,
 100,
 101,
 103,
 104,
 111,
 115,
 117,
 120,
 121,
 125,
 136,
 138,
 143,
 146,
 149,
 153,
 154,
 157,
 158,
 161,
 162,
 163,
 164,
 169,
 170,
 173,
 180,
 182,
 183,
 184,
 185,
 188,
 189,
 190,
 191,
 192,
 193,
 194,
 195,
 197,
 198,
 199,
 204,
 205,
 207,
 208,
 215,
 216,
 217,
 219,
 220,
 221,
 223,
 224,
 227,
 228,
 229,
 230,
 231,
 232,
 234,
 235,
 236,
 237,
 238,
 239,
 240,
 241,
 242,
 243}

Let's get the length of the correlated features.

len(corr_features)
124

Let's drop the correlated features from the dataset.

X_train_uncorr = X_train_unique.drop(labels=corr_features, axis = 1)
X_test_uncorr = X_test_unique.drop(labels = corr_features, axis = 1)
X_train_uncorr.shape, X_test_uncorr.shape
((16000, 103), (4000, 103))

Let's find out the accuracy and training time of the uncorrelated dataset.

%%time
run_randomForest(X_train_uncorr, X_test_uncorr, y_train, y_test)
Accuracy on test set: 
0.95875
Wall time: 912 ms

Now we will find out the accuracy and training time of the original daatset.

%%time
run_randomForest(X_train, X_test, y_train, y_test)
Accuracy on test set: 
0.9585
Wall time: 1.53 s
(1.53-0.912)*100/1.53
40.3921568627451

Feature Grouping and Feature Importance

corrmat
0123456789...235236237238239240241242243244
01.000000-0.025277-0.0019420.0035940.004054-0.001697-0.015882-0.0198070.000956-0.000588...-0.0013370.002051-0.0085000.0065540.0059070.008825-0.0091740.0120310.0121280.006612
1-0.0252771.000000-0.0076470.0018190.0089810.0092320.0016380.0017460.0006140.000695...0.0005440.0005860.0003370.0005500.0005630.0009220.0005980.0008750.0009420.000415
2-0.001942-0.0076471.0000000.0309190.1062450.1091400.0485240.0557080.0040400.005796...0.0255220.0201680.0115500.0193250.0195270.0413210.0161720.0435770.044281-0.000810
30.0035940.0018190.0309191.0000000.0294180.0249050.0145130.013857-0.000613-0.000691...0.014032-0.000583-0.000337-0.000548-0.0005610.000541-0.0005770.0002310.0002350.000966
40.0040540.0089810.1062450.0294181.0000000.8887890.3816320.3412660.0129270.019674...0.0023280.016743-0.0016620.0205090.021276-0.001905-0.000635-0.002552-0.0027360.003656
5-0.0016970.0092320.1091400.0249050.8887891.0000000.3636800.3848200.0176710.030060...0.0003280.010860-0.0017060.0129630.0135530.0008710.007096-0.001672-0.0018440.002257
6-0.0158820.0016380.0485240.0145130.3816320.3636801.0000000.9081580.0303970.036359...-0.0004850.006351-0.0003010.0025900.003867-0.000818-0.000515-0.000779-0.0008390.004448
7-0.0198070.0017460.0557080.0138570.3412660.3848200.9081581.0000000.0476670.056456...-0.0005140.006336-0.0003180.0024760.003707-0.000866-0.000545-0.000825-0.0008880.002427
80.0009560.0006140.004040-0.0006130.0129270.0176710.0303970.0476671.0000000.988256...-0.000184-0.000197-0.000114-0.000185-0.000189-0.000309-0.000195-0.000295-0.000317-0.000739
9-0.0005880.0006950.005796-0.0006910.0196740.0300600.0363590.0564560.9882561.000000...-0.000207-0.000222-0.000128-0.000208-0.000213-0.000349-0.000220-0.000332-0.000358-0.000811
10-0.0124430.0015170.0423680.0124510.2989160.2800810.8052650.7066080.3883090.398826...-0.000452-0.000485-0.000280-0.000213-0.000100-0.000762-0.000480-0.000726-0.0007820.003341
110.0103190.0090970.0967190.0263770.9384090.8248930.0387510.0294450.0026120.007677...0.0026990.015727-0.0016840.0212030.021555-0.001754-0.000494-0.002467-0.0026450.002290
120.0052680.0093600.0980700.0219680.8389530.9436220.0676640.0575910.0020180.012266...0.0005400.009474-0.0017320.0131330.0133300.0012530.007871-0.001513-0.0016760.001570
130.017605-0.0025110.0820250.0163310.2667460.2547020.0407880.0339960.1083290.106806...-0.0016700.034002-0.0010350.0381030.0380470.0074000.0022480.0066880.0062830.000707
140.016960-0.0010860.0954850.0164580.3260510.3598970.0489140.0451360.0810300.081962...-0.0020400.025566-0.0012640.0286410.0286130.0044850.0011760.0040120.003599-0.001992
150.0180400.0024260.1064150.0240140.6384120.5656200.0439200.0337160.0834380.084688...0.0000090.033265-0.0015890.0389780.0391030.0047730.0014650.0039840.0036320.001339
160.017400-0.0024010.0810280.0159790.2634820.2521600.0433570.0385480.2143970.211633...-0.0016610.033386-0.0010290.0374170.0373620.0072370.0021880.0065390.0061390.000614
170.016745-0.0010190.0950090.0162390.3244170.3587690.0513730.0492600.1602400.162113...-0.0020370.025295-0.0012620.0283410.0283120.0044120.0011470.0039460.003534-0.002038
180.0152060.0026290.1109120.0255580.6735930.5995840.1901380.1621680.1520350.155173...-0.0000740.032155-0.0015910.0377420.0378840.0044870.0013330.0037290.0033770.001910
19-0.0001030.0005190.016886-0.0005200.0495790.0426210.0124540.007797-0.000175-0.000198...-0.000156-0.000167-0.000096-0.000156-0.000160-0.000262-0.000165-0.000250-0.0002690.000213
20-0.0011980.0045900.1076800.0074780.2278030.2381590.3061650.2843530.1251080.138077...-0.001365-0.001462-0.0008460.0007680.0018290.009008-0.0014490.0091560.011164-0.001227
21-0.006814-0.008975-0.105502-0.002101-0.208030-0.211873-0.071459-0.078593-0.012763-0.021318...0.002674-0.049920-0.037729-0.038529-0.041438-0.000421-0.0102570.0021490.002306-0.016447
22-0.0020370.041015-0.1024870.0175410.0411670.041372-0.006549-0.0081790.0035760.001088...0.009114-0.012641-0.011070-0.008324-0.0093680.0132630.0041140.0137170.014768-0.056029
230.0103560.0080190.1075700.0034290.2005140.1829370.0354010.0255120.0129070.018006...-0.0024010.031418-0.0014870.0332910.0345730.0013970.011918-0.001487-0.0015910.012346
240.0120210.0074390.1016050.0048430.2206730.2019090.0390180.0284690.0142400.019760...-0.0022270.034079-0.0013800.0360660.0374500.0020860.013156-0.001036-0.001105-0.003767
250.0017320.0115250.2731520.0100990.0273870.0263780.0462580.033114-0.003889-0.004384...-0.0034500.020817-0.0021380.0222790.023163-0.000543-0.003662-0.001555-0.0014630.012034
260.0011380.0094670.2316490.0151170.0337570.0370530.0442250.034049-0.003194-0.003600...-0.0028340.026148-0.0017560.0278060.0288880.001500-0.0030080.0001930.0004600.006643
27-0.0048360.0097710.2991650.036569-0.010411-0.0137010.0203270.019508-0.003295-0.003715...-0.002924-0.003132-0.001811-0.001902-0.001440-0.003893-0.003103-0.003587-0.0039890.012240
28-0.0064800.0087960.2417070.040420-0.012628-0.0187550.0099920.003331-0.002969-0.003347...-0.002634-0.002822-0.001632-0.001507-0.000986-0.004438-0.002796-0.004228-0.0045530.007400
29-0.0058110.0086760.2378300.041165-0.012035-0.0181460.0103260.003592-0.002929-0.003301...-0.002599-0.002784-0.001610-0.001456-0.000927-0.004378-0.002758-0.004171-0.0044910.006121
..................................................................
2150.0069370.0021520.0432780.0023140.0736270.0849080.0097890.0096530.0005400.013546...-0.0006440.003153-0.0003990.0035110.0037090.0829360.2304520.0299150.0336170.000338
2160.0049240.0022100.0456220.0032340.0869040.0934010.0420810.0299060.0004200.012702...-0.0006620.003721-0.0004100.0041290.0043590.0743340.2068080.0267580.0300660.000244
2170.0081000.0039790.1495860.001554-0.003401-0.0048670.0245890.017603-0.001336-0.001506...-0.0011850.011108-0.0007340.0113850.011627-0.001886-0.001258-0.001828-0.0019660.017276
218-0.0005820.0025810.093124-0.001262-0.007050-0.006547-0.002282-0.002111-0.000865-0.000976...-0.0007680.012837-0.0004760.0130760.013340-0.001294-0.000815-0.001232-0.0013270.006644
2190.0071300.0048110.1785460.002540-0.002079-0.0047820.0361680.025934-0.001618-0.001823...-0.0014350.010157-0.0008890.0104300.010646-0.002222-0.001523-0.002103-0.0022370.018092
2200.0076750.0048790.1795650.002948-0.001151-0.0038080.0389640.028107-0.001640-0.001849...-0.0014550.009732-0.0009020.0100110.010224-0.002257-0.001545-0.002116-0.0022430.017579
221-0.0064770.0057590.178263-0.0054380.002963-0.0016310.0484700.029154-0.001943-0.002190...-0.001724-0.001846-0.001068-0.001729-0.001768-0.002904-0.001829-0.002766-0.0029790.014736
222-0.0102190.0031830.094741-0.0030830.0406910.0277490.1336030.084716-0.001073-0.001210...-0.000952-0.001020-0.000590-0.000958-0.000981-0.001604-0.001011-0.001528-0.0016460.002052
223-0.0113860.0063550.2004150.025778-0.000914-0.0058090.0387670.022672-0.002144-0.002417...-0.001903-0.002038-0.001179-0.001788-0.001769-0.003205-0.002019-0.003054-0.0032880.014980
224-0.0112000.0062480.1956520.033042-0.000322-0.0057290.0431370.025504-0.002108-0.002376...-0.001871-0.002004-0.001159-0.001801-0.001804-0.003151-0.001985-0.003002-0.0032330.014628
2250.0064550.0026290.125618-0.0015320.0032670.0040180.0135320.021485-0.000883-0.000995...-0.000783-0.000839-0.000485-0.000788-0.000807-0.001195-0.000831-0.001146-0.0012410.014567
2260.0083610.0014820.0592930.0002380.0124290.0108960.0010340.002878-0.000492-0.000555...-0.000437-0.000468-0.000271-0.000439-0.000450-0.000573-0.000463-0.000556-0.0006080.005688
2270.0037650.0028270.135362-0.0018170.0008240.0014930.0121080.019415-0.000950-0.001071...-0.000843-0.000903-0.000522-0.000848-0.000868-0.001417-0.000895-0.001348-0.0014520.015351
2280.0053520.0027700.132537-0.0016980.0012720.0016270.0100820.016429-0.000930-0.001048...-0.000825-0.000884-0.000511-0.000830-0.000850-0.001387-0.000876-0.001318-0.0014210.014485
2290.0080420.0003560.023435-0.000354-0.001629-0.001719-0.000318-0.000337-0.000120-0.000136...-0.0001070.002856-0.0000660.0043060.0040220.000265-0.0001130.0000890.0001210.013197
2300.0078700.0003380.022679-0.000338-0.001669-0.001713-0.000302-0.000320-0.000114-0.000129...-0.000101-0.000108-0.000063-0.000102-0.000104-0.000171-0.000107-0.000163-0.0001750.012842
2310.0079520.0004110.025362-0.000285-0.001677-0.001846-0.000365-0.000387-0.000138-0.000156...-0.0001230.007701-0.0000760.0115130.0107680.001004-0.0001300.0005100.0006190.013321
2320.0080210.0004080.025406-0.000334-0.001730-0.001875-0.000362-0.000383-0.000137-0.000154...-0.0001210.006078-0.0000750.0091010.0085100.000746-0.0001290.0003600.0004430.013418
233-0.0015960.0003910.013612-0.000391-0.001930-0.001981-0.000349-0.000370-0.000132-0.000149...0.538270-0.000125-0.000073-0.000118-0.0001210.031970-0.0001240.0686480.057673-0.000203
2340.0018300.0004530.0234460.0084690.000833-0.000407-0.000404-0.000428-0.000153-0.000172...0.950232-0.000145-0.000084-0.000136-0.0001400.004649-0.0001440.0102190.008541-0.003446
235-0.0013370.0005440.0255220.0140320.0023280.000328-0.000485-0.000514-0.000184-0.000207...1.000000-0.000174-0.000101-0.000164-0.0001680.012705-0.0001730.0275150.023072-0.003399
2360.0020510.0005860.020168-0.0005830.0167430.0108600.0063510.006336-0.000197-0.000222...-0.0001741.0000000.4843310.9386680.9534110.021540-0.0001850.0123930.014523-0.000773
237-0.0085000.0003370.011550-0.000337-0.001662-0.001706-0.000301-0.000318-0.000114-0.000128...-0.0001010.4843311.0000000.1932810.225912-0.000170-0.000107-0.000162-0.000174-0.000402
2380.0065540.0005500.019325-0.0005480.0205090.0129630.0025900.002476-0.000185-0.000208...-0.0001640.9386680.1932811.0000000.9984970.032162-0.0001740.0185650.021742-0.000525
2390.0059070.0005630.019527-0.0005610.0212760.0135530.0038670.003707-0.000189-0.000213...-0.0001680.9534110.2259120.9984971.0000000.030087-0.0001780.0173580.020331-0.000589
2400.0088250.0009220.0413210.000541-0.0019050.000871-0.000818-0.000866-0.000309-0.000349...0.0127050.021540-0.0001700.0321620.0300871.0000000.3298050.9353170.9190360.011106
241-0.0091740.0005980.016172-0.000577-0.0006350.007096-0.000515-0.000545-0.000195-0.000220...-0.000173-0.000185-0.000107-0.000174-0.0001780.3298051.0000000.1272240.1409020.011807
2420.0120310.0008750.0435770.000231-0.002552-0.001672-0.000779-0.000825-0.000295-0.000332...0.0275150.012393-0.0001620.0185650.0173580.9353170.1272241.0000000.9935360.008604
2430.0121280.0009420.0442810.000235-0.002736-0.001844-0.000839-0.000888-0.000317-0.000358...0.0230720.014523-0.0001740.0217420.0203310.9190360.1409020.9935361.0000000.009136
2440.0066120.000415-0.0008100.0009660.0036560.0022570.0044480.002427-0.000739-0.000811...-0.003399-0.000773-0.000402-0.000525-0.0005890.0111060.0118070.0086040.0091361.000000

227 rows × 227 columns

Let's get the list of correlated features from the data.

corrdata = corrmat.abs().stack()
corrdata
0    0      1.000000
     1      0.025277
     2      0.001942
     3      0.003594
     4      0.004054
     5      0.001697
     6      0.015882
     7      0.019807
     8      0.000956
     9      0.000588
     10     0.012443
     11     0.010319
     12     0.005268
     13     0.017605
     14     0.016960
     15     0.018040
     16     0.017400
     17     0.016745
     18     0.015206
     19     0.000103
     20     0.001198
     21     0.006814
     22     0.002037
     23     0.010356
     24     0.012021
     25     0.001732
     26     0.001138
     27     0.004836
     28     0.006480
     29     0.005811
              ...   
244  215    0.000338
     216    0.000244
     217    0.017276
     218    0.006644
     219    0.018092
     220    0.017579
     221    0.014736
     222    0.002052
     223    0.014980
     224    0.014628
     225    0.014567
     226    0.005688
     227    0.015351
     228    0.014485
     229    0.013197
     230    0.012842
     231    0.013321
     232    0.013418
     233    0.000203
     234    0.003446
     235    0.003399
     236    0.000773
     237    0.000402
     238    0.000525
     239    0.000589
     240    0.011106
     241    0.011807
     242    0.008604
     243    0.009136
     244    1.000000
Length: 51529, dtype: float64

Let's arrange the correlated data in the descending order.

corrdata = corrdata.sort_values(ascending=False)
corrdata
29   58     1.000000e+00
58   29     1.000000e+00
134  158    1.000000e+00
158  134    1.000000e+00
182  182    1.000000e+00
181  181    1.000000e+00
159  159    1.000000e+00
160  160    1.000000e+00
161  161    1.000000e+00
162  162    1.000000e+00
163  163    1.000000e+00
164  164    1.000000e+00
165  165    1.000000e+00
166  166    1.000000e+00
167  167    1.000000e+00
168  168    1.000000e+00
169  169    1.000000e+00
170  170    1.000000e+00
171  171    1.000000e+00
158  158    1.000000e+00
173  173    1.000000e+00
174  174    1.000000e+00
175  175    1.000000e+00
176  176    1.000000e+00
177  177    1.000000e+00
183  183    1.000000e+00
178  178    1.000000e+00
179  179    1.000000e+00
180  180    1.000000e+00
172  172    1.000000e+00
                ...     
113  60     8.925381e-06
60   113    8.925381e-06
82   193    8.892757e-06
193  82     8.892757e-06
230  110    8.848510e-06
110  230    8.848510e-06
235  15     8.707147e-06
15   235    8.707147e-06
186  243    7.715459e-06
243  186    7.715459e-06
150  120    7.232908e-06
120  150    7.232908e-06
103  189    5.738723e-06
189  103    5.738723e-06
13   120    5.200500e-06
120  13     5.200500e-06
243  162    3.905074e-06
162  243    3.905074e-06
186  126    3.594093e-06
126  186    3.594093e-06
159  242    2.877380e-06
242  159    2.877380e-06
107  68     2.392837e-06
68   107    2.392837e-06
111  229    1.934954e-06
229  111    1.934954e-06
231  150    6.044672e-07
150  231    6.044672e-07
231  123    3.966696e-07
123  231    3.966696e-07
Length: 51529, dtype: float64

Let's get the correlated data between 1 and 0.85.

corrdata = corrdata[corrdata>0.85]
corrdata = corrdata[corrdata<1]
corrdata
143  135    1.000000
135  143    1.000000
136  128    1.000000
128  136    1.000000
31   62     1.000000
62   31     1.000000
20   47     1.000000
47   20     1.000000
52   23     1.000000
23   52     1.000000
53   24     1.000000
24   53     1.000000
33   69     1.000000
69   33     1.000000
157  133    1.000000
133  157    1.000000
237  149    1.000000
149  237    1.000000
154  132    1.000000
132  154    1.000000
146  230    0.999997
230  146    0.999997
238  122    0.999945
122  238    0.999945
148  149    0.999929
149  148    0.999929
237  148    0.999929
148  237    0.999929
231  232    0.999892
232  231    0.999892
              ...   
183  52     0.860163
52   183    0.860163
183  23     0.860163
23   183    0.860163
79   195    0.859806
195  79     0.859806
8    193    0.859270
193  8      0.859270
29   61     0.858830
61   29     0.858830
     58     0.858830
58   61     0.858830
84   77     0.858529
77   84     0.858529
83   189    0.858484
189  83     0.858484
84   194    0.857731
194  84     0.857731
76   190    0.857717
190  76     0.857717
151  173    0.854991
173  151    0.854991
41   163    0.852233
163  41     0.852233
66   67     0.851384
67   66     0.851384
61   28     0.851022
28   61     0.851022
72   35     0.850893
35   72     0.850893
Length: 534, dtype: float64
corrdata = pd.DataFrame(corrdata).reset_index()
corrdata.columns = ['features1', 'features2', 'corr_value']
corrdata
features1features2corr_value
01431351.000000
11351431.000000
21361281.000000
31281361.000000
431621.000000
562311.000000
620471.000000
747201.000000
852231.000000
923521.000000
1053241.000000
1124531.000000
1233691.000000
1369331.000000
141571331.000000
151331571.000000
162371491.000000
171492371.000000
181541321.000000
191321541.000000
201462300.999997
212301460.999997
222381220.999945
231222380.999945
241481490.999929
251491480.999929
262371480.999929
271482370.999929
282312320.999892
292322310.999892
............
504183520.860163
505521830.860163
506183230.860163
507231830.860163
508791950.859806
509195790.859806
51081930.859270
51119380.859270
51229610.858830
51361290.858830
51461580.858830
51558610.858830
51684770.858529
51777840.858529
518831890.858484
519189830.858484
520841940.857731
521194840.857731
522761900.857717
523190760.857717
5241511730.854991
5251731510.854991
526411630.852233
527163410.852233
52866670.851384
52967660.851384
53061280.851022
53128610.851022
53272350.850893
53335720.850893

534 rows × 3 columns

Let's have a list of uncorrelated features from the dataset.

grouped_feature_list = []
correlated_groups_list = []
for feature in corrdata.features1.unique():
    if feature not in grouped_feature_list:
        correlated_block = corrdata[corrdata.features1 == feature]
        grouped_feature_list = grouped_feature_list + list(correlated_block.features2.unique()) + [feature]
        correlated_groups_list.append(correlated_block)
len(correlated_groups_list)
56
X_train.shape, X_train_uncorr.shape
((16000, 370), (16000, 103))
for group in correlated_groups_list:
    print(group)
   features1  features2  corr_value
0        143        135         1.0
     features1  features2  corr_value
2          136        128    1.000000
197        136        169    0.959468
   features1  features2  corr_value
4         31         62         1.0
   features1  features2  corr_value
6         20         47         1.0
     features1  features2  corr_value
8           52         23    1.000000
297         52         24    0.927683
299         52         53    0.927683
448         52         21    0.877297
505         52        183    0.860163
     features1  features2  corr_value
12          33         69    1.000000
224         33         32    0.947113
228         33         68    0.946571
322         33         26    0.917665
337         33         55    0.914178
422         33        184    0.884383
    features1  features2  corr_value
14        157        133         1.0
    features1  features2  corr_value
16        237        149    1.000000
26        237        148    0.999929
    features1  features2  corr_value
18        154        132         1.0
     features1  features2  corr_value
20         146        230    0.999997
36         146        229    0.999778
59         146        231    0.997052
68         146        232    0.996772
76         146        113    0.996424
89         146        120    0.993307
245        146        170    0.944314
     features1  features2  corr_value
22         238        122    0.999945
49         238        239    0.998497
264        238        236    0.938668
    features1  features2  corr_value
34         82         78    0.999859
     features1  features2  corr_value
40         108        115    0.999478
97         108        219    0.992870
115        108        125    0.987333
142        108        220    0.982474
280        108        217    0.933815
     features1  features2  corr_value
46         199        197    0.998753
362        199        196    0.905699
371        199        198    0.904341
     features1  features2  corr_value
50         181        208    0.997718
345        181        205    0.911453
467        181        207    0.871801
     features1  features2  corr_value
72          17         14    0.996739
396         17         16    0.890442
408         17         13    0.888669
     features1  features2  corr_value
86         242        243    0.993536
122        242        126    0.986744
276        242        240    0.935317
     features1  features2  corr_value
92          28         57    0.993186
124         28         58    0.986371
126         28         29    0.986371
185         28        185    0.964067
381         28         27    0.901032
399         28         30    0.889321
531         28         61    0.851022
     features1  features2  corr_value
94          51         22    0.992882
385         51        182    0.899063
     features1  features2  corr_value
100         44         46    0.990593
377         44         98    0.902736
410         44         95    0.888337
     features1  features2  corr_value
102         77         81    0.989793
461         77         80    0.874240
517         77         84    0.858529
     features1  features2  corr_value
104        109        223    0.989341
151        109        224    0.980951
356        109        221    0.907987
413        109        111    0.887721
     features1  features2  corr_value
112          9          8    0.988256
417          9        193    0.886955
444          9        192    0.878045
     features1  features2  corr_value
116        227        228    0.987304
188        227        225    0.962657
     features1  features2  corr_value
118        116        117    0.987013
     features1  features2  corr_value
128         91         49    0.985951
     features1  features2  corr_value
130         54         25    0.985875
419         54        100    0.886309
     features1  features2  corr_value
134         76         75    0.984751
353         76         74    0.908497
477         76        191    0.870551
522         76        190    0.857717
     features1  features2  corr_value
136         38         35    0.984077
261         38         34    0.940390
306         38         36    0.922699
496         38         72    0.864661
     features1  features2  corr_value
138         18         15    0.983164
465         18         16    0.872133
470         18         13    0.870936
     features1  features2  corr_value
140        215        107    0.983156
146        215        216    0.981815
     features1  features2  corr_value
161         56         61    0.976942
187         56         27    0.962726
211         56         30    0.953194
     features1  features2  corr_value
164        162        163    0.975002
288        162        161    0.930635
369        162        164    0.904702
463        162         41    0.874083
     features1  features2  corr_value
166        102        103    0.974341
     features1  features2  corr_value
168         83         79    0.973140
263         83        188    0.938960
273         83         84    0.936080
315         83        194    0.919405
351         83         80    0.910385
518         83        189    0.858484
     features1  features2  corr_value
174         70         72    0.972088
500         70         35    0.862850
     features1  features2  corr_value
180         59         60    0.968504
     features1  features2  corr_value
207        195        189    0.956666
313        195         80    0.920961
330        195        194    0.916442
378        195         84    0.902276
428        195        188    0.882312
509        195         79    0.859806
     features1  features2  corr_value
216        235        234    0.950232
349        235        106    0.911179
     features1  features2  corr_value
220         10        104    0.948845
     features1  features2  corr_value
234        180        179    0.945288
     features1  features2  corr_value
236        241        151    0.944812
     features1  features2  corr_value
243         42         41    0.944451
415         42        161    0.887059
503         42        164    0.861507
     features1  features2  corr_value
248         12          5    0.943622
434         12         11    0.881673
     features1  features2  corr_value
266          4         11    0.938409
402          4          5    0.888789
     features1  features2  corr_value
274         93         92    0.935867
     features1  features2  corr_value
290         89        121    0.928898
     features1  features2  corr_value
304         88         87       0.924
     features1  features2  corr_value
318        174        204    0.918533
     features1  features2  corr_value
333         50         21    0.916137
     features1  features2  corr_value
354          6          7    0.908158
     features1  features2  corr_value
372         64         65    0.904095
488         64         87    0.866430
     features1  features2  corr_value
374        101         86    0.903641
394        101         40    0.892951
     features1  features2  corr_value
390        131        153     0.89633
     features1  features2  corr_value
525        173        151    0.854991
     features1  features2  corr_value
528         66         67    0.851384

Feature Importance based on tree based classifiers

Let's get the list of important features from the following code.

important_features = []
for group in correlated_groups_list:
    features = list(group.features1.unique()) + list(group.features2.unique())
    rf = RandomForestClassifier(n_estimators=100, random_state=0)
    rf.fit(X_train_unique[features], y_train)
    
    importance = pd.concat([pd.Series(features), pd.Series(rf.feature_importances_)], axis = 1)
    importance.columns = ['features', 'importance']
    importance.sort_values(by = 'importance', ascending = False, inplace = True)
    feat = importance.iloc[0]
    important_features.append(feat)
important_features
[features      135.00
 importance      0.51
 Name: 1, dtype: float64, features      128.000000
 importance      0.563757
 Name: 1, dtype: float64, features      62.00
 importance     0.51
 Name: 1, dtype: float64, features      47.00
 importance     0.51
 Name: 1, dtype: float64, features      183.000000
 importance      0.285817
 Name: 5, dtype: float64, features      184.00000
 importance      0.34728
 Name: 6, dtype: float64, features      157.00
 importance      0.34
 Name: 0, dtype: float64, features      148.000000
 importance      0.505844
 Name: 2, dtype: float64, features      132.00
 importance      0.39
 Name: 1, dtype: float64, features      120.000000
 importance      0.749683
 Name: 6, dtype: float64, features      122.00
 importance      0.34
 Name: 1, dtype: float64, features      82.000000
 importance     0.518827
 Name: 0, dtype: float64, features      125.000000
 importance      0.940524
 Name: 3, dtype: float64, features      197.000000
 importance      0.289727
 Name: 1, dtype: float64, features      207.000000
 importance      0.312834
 Name: 3, dtype: float64, features      17.000000
 importance     0.286833
 Name: 0, dtype: float64, features      243.000000
 importance      0.431557
 Name: 1, dtype: float64, features      185.000000
 importance      0.391367
 Name: 4, dtype: float64, features      182.000000
 importance      0.432045
 Name: 2, dtype: float64, features      95.000000
 importance     0.487162
 Name: 3, dtype: float64, features      84.000000
 importance     0.299008
 Name: 3, dtype: float64, features      221.00000
 importance      0.28555
 Name: 3, dtype: float64, features      8.000000
 importance    0.345509
 Name: 1, dtype: float64, features      228.000000
 importance      0.434186
 Name: 1, dtype: float64, features      117.000000
 importance      0.517013
 Name: 1, dtype: float64, features      49.000000
 importance     0.500161
 Name: 1, dtype: float64, features      100.000000
 importance      0.386775
 Name: 2, dtype: float64, features      191.000000
 importance      0.345104
 Name: 3, dtype: float64, features      34.000000
 importance     0.283901
 Name: 2, dtype: float64, features      15.000000
 importance     0.400677
 Name: 1, dtype: float64, features      107.000000
 importance      0.349126
 Name: 1, dtype: float64, features      61.000000
 importance     0.323735
 Name: 1, dtype: float64, features      41.000000
 importance     0.386338
 Name: 4, dtype: float64, features      102.000000
 importance      0.508955
 Name: 0, dtype: float64, features      189.000000
 importance      0.229269
 Name: 6, dtype: float64, features      72.000000
 importance     0.490102
 Name: 1, dtype: float64, features      60.00000
 importance     0.50052
 Name: 1, dtype: float64, features      79.000000
 importance     0.213903
 Name: 6, dtype: float64, features      234.000000
 importance      0.445719
 Name: 1, dtype: float64, features      104.000000
 importance      0.640915
 Name: 1, dtype: float64, features      179.000000
 importance      0.634779
 Name: 1, dtype: float64, features      151.00
 importance      0.51
 Name: 1, dtype: float64, features      161.000000
 importance      0.346426
 Name: 2, dtype: float64, features      5.000000
 importance    0.356386
 Name: 1, dtype: float64, features      5.000000
 importance    0.403831
 Name: 2, dtype: float64, features      93.000000
 importance     0.544349
 Name: 0, dtype: float64, features      121.00
 importance      0.51
 Name: 1, dtype: float64, features      87.000000
 importance     0.553622
 Name: 1, dtype: float64, features      174.000000
 importance      0.743723
 Name: 0, dtype: float64, features      50.000000
 importance     0.616659
 Name: 0, dtype: float64, features      7.000000
 importance    0.545702
 Name: 1, dtype: float64, features      87.0000
 importance     0.7462
 Name: 2, dtype: float64, features      86.000000
 importance     0.447693
 Name: 1, dtype: float64, features      153.00
 importance      0.51
 Name: 1, dtype: float64, features      151.00
 importance      0.51
 Name: 1, dtype: float64, features      66.000000
 importance     0.630293
 Name: 0, dtype: float64]
important_features = pd.DataFrame(important_features)
important_features.reset_index(inplace=True, drop = True)
important_features
featuresimportance
0135.00.510000
1128.00.563757
262.00.510000
347.00.510000
4183.00.285817
5184.00.347280
6157.00.340000
7148.00.505844
8132.00.390000
9120.00.749683
10122.00.340000
1182.00.518827
12125.00.940524
13197.00.289727
14207.00.312834
1517.00.286833
16243.00.431557
17185.00.391367
18182.00.432045
1995.00.487162
2084.00.299008
21221.00.285550
228.00.345509
23228.00.434186
24117.00.517013
2549.00.500161
26100.00.386775
27191.00.345104
2834.00.283901
2915.00.400677
30107.00.349126
3161.00.323735
3241.00.386338
33102.00.508955
34189.00.229269
3572.00.490102
3660.00.500520
3779.00.213903
38234.00.445719
39104.00.640915
40179.00.634779
41151.00.510000
42161.00.346426
435.00.356386
445.00.403831
4593.00.544349
46121.00.510000
4787.00.553622
48174.00.743723
4950.00.616659
507.00.545702
5187.00.746200
5286.00.447693
53153.00.510000
54151.00.510000
5566.00.630293

Let's get the features which are to be discarded.

features_to_consider = set(important_features['features'])
features_to_discard = set(corr_features) - set(features_to_consider)
features_to_discard = list(features_to_discard)
X_train_grouped_uncorr = X_train_unique.drop(labels = features_to_discard, axis = 1)

Let's get the shape of the uncorrelated dataset.

X_train_grouped_uncorr.shape
(16000, 140)
X_test_grouped_uncorr = X_test_unique.drop(labels=features_to_discard, axis = 1)
X_test_grouped_uncorr.shape
(4000, 140)
%%time
run_randomForest(X_train_grouped_uncorr, X_test_grouped_uncorr, y_train, y_test)
Accuracy on test set: 
0.95775
Wall time: 1.01 s
%%time
run_randomForest(X_train, X_test, y_train, y_test)
Accuracy on test set: 
0.9585
Wall time: 1.48 s
%%time
run_randomForest(X_train_uncorr, X_test_uncorr, y_train, y_test)
Accuracy on test set: 
0.95875
Wall time: 891 ms