Feature Selection with Filtering Method | Constant, Quasi Constant and Duplicate Feature Removal
Filtering method
Watch Full Playlist: https://www.youtube.com/playlist?list=PLc2rvfiptPSQYzmDIFuq2PqN2n28ZjxDH
Unnecessary and redundant features not only slow down
the training time of an algorithm, but they also affect the performance of the algorithm.
There are several advantages
of performing feature selection before training machine learning models:
- Models with less number of features have higher explainability.
- It is easier to implement machine learning models with reduced features.
- Fewer features lead to enhanced generalization which in turn reduces overfitting.
- Feature selection removes data redundancy.
- Training time of models with fewer features is significantly lower.
- Models with fewer features are less prone to errors.
What is filter method?
Features selected using filter methods
can be used as an input
to any machine learning models.
- Univariate -> Fisher Score, Mutual Information Gain, Variance etc
- Multi-variate -> Pearson Correlation
The univariate filter
methods are the type of methods where individual features are ranked according to specific criteria. The top N features are then selected. Different types of ranking criteria are used for univariate filter
methods, for example fisher score, mutual information, and variance of the feature.
Multivariate filter
methods are capable of removing redundant
features from the data since they take the mutual relationship between the features into account.
Univariate Filtering Methods in this lesson
- Constant Removal
- Quasi Constant Removal
- Duplicate Feature Removal
Download Data Files
Constant Feature Removal
Importing required libraries
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns
from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score from sklearn.feature_selection import VarianceThreshold
Now read the dataset from pandas
and intially number of rows are 20000
.
data = pd.read_csv('santander.csv', nrows = 20000) data.head()
ID | var3 | var15 | imp_ent_var16_ult1 | imp_op_var39_comer_ult1 | imp_op_var39_comer_ult3 | imp_op_var40_comer_ult1 | imp_op_var40_comer_ult3 | imp_op_var40_efect_ult1 | imp_op_var40_efect_ult3 | ... | saldo_medio_var33_hace2 | saldo_medio_var33_hace3 | saldo_medio_var33_ult1 | saldo_medio_var33_ult3 | saldo_medio_var44_hace2 | saldo_medio_var44_hace3 | saldo_medio_var44_ult1 | saldo_medio_var44_ult3 | var38 | TARGET | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2 | 23 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 39205.170000 | 0 |
1 | 3 | 2 | 34 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 49278.030000 | 0 |
2 | 4 | 2 | 23 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 67333.770000 | 0 |
3 | 8 | 2 | 37 | 0.0 | 195.0 | 195.0 | 0.0 | 0.0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 64007.970000 | 0 |
4 | 10 | 2 | 39 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 117310.979016 | 0 |
5 rows × 371 columns
Let's load these datasets into x and y vectors.
X = data.drop('TARGET', axis = 1) y = data['TARGET'] X.shape, y.shape
((20000, 370), (20000,))
Let's split
this dataset into train and test datasets using the below code.
Here test_size = 0.2
that means 20%
for testing and remaining for training the model.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0, stratify = y)
Constant Features Removal
In this, first we need to create a variance threshold
and then fit the model with training set of the data.
constant_filter = VarianceThreshold(threshold=0) constant_filter.fit(X_train)
VarianceThreshold(threshold=0)
Let's get the number of features left after removing constant features.
constant_filter.get_support().sum()
291
Let's print the constant features list.
constant_list = [not temp for temp in constant_filter.get_support()] constant_list
[False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, True, False, False, False, False, False, False, False, False, False, False, False, False, False, True, True, False, False, False, False, False, True, True, False, False, False, False, False, False, False, False, False, False, False, True, True, True, True, False, False, False, False, False, False, False, False, False, False, False, True, True, False, False, False, False, False, False, False, True, False, False, False, True, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, True, False, False, False, False, False, True, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, True, True, True, False, False, False, False, False, False, False, False, False, False, True, True, False, False, False, False, False, False, False, False, True, False, False, False, False, False, True, True, False, False, False, False, False, False, False, True, False, False, False, True, False, False, False, False, True, True, False, False, False, False, False, True, False, False, True, False, False, True, False, True, True, False, False, False, False, False, False, True, False, True, False, True, False, False, False, False, False, False, False, True, False, True, False, True, False, True, True, True, True, False, False, False, False, False, False, True, False, False, False, True, False, True, False, True, True, False, False, True, False, True, True, True, False, True, True, False, False, True, False, False, False, False, False, False, False, False, True, True, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False, True, False, True, True, False, False, False, False, True, False, True, True, True, False, True, True, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, True, True, True, False, False, False, False, False, True, False, False, False, False, False, False, False, False, False, False, False]
Now we will try get the name
of features which are constant.
X.columns[constant_list]
Index(['ind_var2_0', 'ind_var2', 'ind_var13_medio_0', 'ind_var13_medio', 'ind_var18_0', 'ind_var18', 'ind_var27_0', 'ind_var28_0', 'ind_var28', 'ind_var27', 'ind_var34_0', 'ind_var34', 'ind_var41', 'ind_var46_0', 'ind_var46', 'num_var13_medio_0', 'num_var13_medio', 'num_var18_0', 'num_var18', 'num_var27_0', 'num_var28_0', 'num_var28', 'num_var27', 'num_var34_0', 'num_var34', 'num_var41', 'num_var46_0', 'num_var46', 'saldo_var13_medio', 'saldo_var18', 'saldo_var28', 'saldo_var27', 'saldo_var34', 'saldo_var41', 'saldo_var46', 'delta_imp_amort_var18_1y3', 'delta_imp_amort_var34_1y3', 'delta_imp_reemb_var33_1y3', 'delta_imp_trasp_var17_out_1y3', 'delta_imp_trasp_var33_out_1y3', 'delta_num_reemb_var33_1y3', 'delta_num_trasp_var17_out_1y3', 'delta_num_trasp_var33_out_1y3', 'imp_amort_var18_hace3', 'imp_amort_var18_ult1', 'imp_amort_var34_hace3', 'imp_amort_var34_ult1', 'imp_var7_emit_ult1', 'imp_reemb_var13_hace3', 'imp_reemb_var17_hace3', 'imp_reemb_var33_hace3', 'imp_reemb_var33_ult1', 'imp_trasp_var17_in_hace3', 'imp_trasp_var17_out_hace3', 'imp_trasp_var17_out_ult1', 'imp_trasp_var33_in_hace3', 'imp_trasp_var33_out_hace3', 'imp_trasp_var33_out_ult1', 'ind_var7_emit_ult1', 'num_var2_0_ult1', 'num_var2_ult1', 'num_var7_emit_ult1', 'num_meses_var13_medio_ult3', 'num_reemb_var13_hace3', 'num_reemb_var17_hace3', 'num_reemb_var33_hace3', 'num_reemb_var33_ult1', 'num_trasp_var17_in_hace3', 'num_trasp_var17_out_hace3', 'num_trasp_var17_out_ult1', 'num_trasp_var33_in_hace3', 'num_trasp_var33_out_hace3', 'num_trasp_var33_out_ult1', 'saldo_var2_ult1', 'saldo_medio_var13_medio_hace2', 'saldo_medio_var13_medio_hace3', 'saldo_medio_var13_medio_ult1', 'saldo_medio_var13_medio_ult3', 'saldo_medio_var29_hace3'], dtype='object')
Let's go ahead and transform the x_train
and x_test
datasets into non constant datasets.
X_train_filter = constant_filter.transform(X_train) X_test_filter = constant_filter.transform(X_test)
Let's get the shape of the datasets.
X_train_filter.shape, X_test_filter.shape, X_train.shape
((16000, 291), (4000, 291), (16000, 370))
Quasi constant feature removal
These are the filters that are almost constant
or quasi constant
in other words these features have same values for large subset
of outputs and such features are not very useful for making predictions
.
There is no rule for fixing threshold value but generally we can take as 99%
similarity and 1%
of non similarity.
Let's go ahead see how many quasi constant
features are there.
quasi_constant_filter = VarianceThreshold(threshold=0.01) quasi_constant_filter.fit(X_train_filter) VarianceThreshold(threshold=0.01)
Let's see how many features are non quasi constant
.
quasi_constant_filter.get_support().sum()
245
291-245
46
To remove those quasi constant features, we need to apply transform on quasi transform filter object.
X_train_quasi_filter = quasi_constant_filter.transform(X_train_filter) X_test_quasi_filter = quasi_constant_filter.transform(X_test_filter) X_train_quasi_filter.shape, X_test_quasi_filter.shape
((16000, 245), (4000, 245))
370-245
125
In this way, we have reduced features from 370
to 245
features.
Remove Duplicate Features
If two features are exactly same those are called as duplicate features
that means these features doesn't provide any new information
and makes our model complex.
Here we have a problem as we did in quasi constant
and constant
removal sklearn doesn't have direct library to handle with duplicate features .
So, first we will do transpose
the dataset and then python
have a method to remove duplicate features.
Let's transpose the training and testing dataset by using following code.
X_train_T = X_train_quasi_filter.T X_test_T = X_test_quasi_filter.T type(X_train_T) numpy.ndarray
Let's change into pandas
dataframe.
X_train_T = pd.DataFrame(X_train_T) X_test_T = pd.DataFrame(X_test_T)
Let's check the shapes of the datasets.
X_train_T.shape, X_test_T.shape
((245, 16000), (245, 4000))
Let's go ahead and get the duplicate features.
X_train_T.duplicated().sum()
18
So, here we have 18
duplicated features.
duplicated_features = X_train_T.duplicated() duplicated_features
0 False 1 False 2 False 3 False 4 False 5 False 6 False 7 False 8 False 9 False 10 False 11 False 12 False 13 False 14 False 15 False 16 False 17 False 18 False 19 False 20 False 21 False 22 False 23 False 24 False 25 False 26 False 27 False 28 False 29 False ... 215 False 216 False 217 False 218 False 219 False 220 False 221 False 222 False 223 False 224 False 225 False 226 False 227 False 228 False 229 False 230 False 231 False 232 False 233 False 234 False 235 False 236 False 237 False 238 False 239 False 240 False 241 False 242 False 243 False 244 False Length: 245, dtype: bool
Now, we need to get to non duplicated features from the following code.
features_to_keep = [not index for index in duplicated_features]
Let's do transpose
again to get the original shape.
X_train_unique = X_train_T[features_to_keep].T X_test_unique = X_test_T[features_to_keep].T
Let's check the shape of the datasets.
X_train_unique.shape, X_train.shape
((16000, 227), (16000, 370))
Here, we can observe original dataset has 370
features and after removal of quasi constant, constant and duplicate features we have 227
features.
370-227
143
Build ML model and compare the performance of the selected feature
Let's go ahead and compare the model
between original dataset and transformed dataset.
Here we are going to build random forest classifier.
def run_randomForest(X_train, X_test, y_train, y_test): clf = RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1) clf.fit(X_train, y_train) y_pred = clf.predict(X_test) print('Accuracy on test set: ') print(accuracy_score(y_test, y_pred))
Let's calculate the accuracy.
%%time run_randomForest(X_train_unique, X_test_unique, y_train, y_test)
Accuracy on test set: 0.95875 Wall time: 2.18 s
Let's check the accuracy of the original dataset.
%%time run_randomForest(X_train, X_test, y_train, y_test)
Accuracy on test set: 0.9585 Wall time: 2.87 s
(1.51-1.26)*100/1.51
16.556291390728475
Feature Selection with Filtering Method- Correlated Feature Removal
A dataset can also contain correlated features
. Two or more than two features are correlated if they are close to each other in the linear space.
Correlation
between the output observations and the input features is very important and such features should be retained.
Summary
- Feature Space to target correlation is desired
- Feature to feature correlation is not desired
- If 2 features are highly correlated then either feature is redundant
- Correlation in feature space increases model complexity
- Removing correlated features improves model performance
- Different model shows different performance over the correlated features
corrmat = X_train_unique.corr() plt.figure(figsize=(12,8)) sns.heatmap(corrmat)
def get_correlation(data, threshold): corr_col = set() corrmat = data.corr() for i in range(len(corrmat.columns)): for j in range(i): if abs(corrmat.iloc[i, j])> threshold: colname = corrmat.columns[i] corr_col.add(colname) return corr_col
corr_features = get_correlation(X_train_unique, 0.85) corr_features
{5, 7, 9, 11, 12, 14, 15, 16, 17, 18, 23, 24, 28, 29, 30, 32, 33, 35, 36, 38, 42, 46, 47, 50, 51, 52, 53, 54, 55, 56, 57, 58, 60, 61, 62, 65, 67, 68, 69, 70, 72, 76, 80, 81, 82, 83, 84, 86, 87, 88, 91, 93, 95, 98, 100, 101, 103, 104, 111, 115, 117, 120, 121, 125, 136, 138, 143, 146, 149, 153, 154, 157, 158, 161, 162, 163, 164, 169, 170, 173, 180, 182, 183, 184, 185, 188, 189, 190, 191, 192, 193, 194, 195, 197, 198, 199, 204, 205, 207, 208, 215, 216, 217, 219, 220, 221, 223, 224, 227, 228, 229, 230, 231, 232, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243}
Let's get the length of the correlated features.
len(corr_features)
124
Let's drop the correlated features from the dataset.
X_train_uncorr = X_train_unique.drop(labels=corr_features, axis = 1) X_test_uncorr = X_test_unique.drop(labels = corr_features, axis = 1) X_train_uncorr.shape, X_test_uncorr.shape
((16000, 103), (4000, 103))
Let's find out the accuracy and training time of the uncorrelated dataset.
%%time run_randomForest(X_train_uncorr, X_test_uncorr, y_train, y_test)
Accuracy on test set: 0.95875 Wall time: 912 ms
Now we will find out the accuracy and training time of the original daatset.
%%time run_randomForest(X_train, X_test, y_train, y_test)
Accuracy on test set: 0.9585 Wall time: 1.53 s
(1.53-0.912)*100/1.53
40.3921568627451
Feature Grouping and Feature Importance
corrmat
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 235 | 236 | 237 | 238 | 239 | 240 | 241 | 242 | 243 | 244 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.000000 | -0.025277 | -0.001942 | 0.003594 | 0.004054 | -0.001697 | -0.015882 | -0.019807 | 0.000956 | -0.000588 | ... | -0.001337 | 0.002051 | -0.008500 | 0.006554 | 0.005907 | 0.008825 | -0.009174 | 0.012031 | 0.012128 | 0.006612 |
1 | -0.025277 | 1.000000 | -0.007647 | 0.001819 | 0.008981 | 0.009232 | 0.001638 | 0.001746 | 0.000614 | 0.000695 | ... | 0.000544 | 0.000586 | 0.000337 | 0.000550 | 0.000563 | 0.000922 | 0.000598 | 0.000875 | 0.000942 | 0.000415 |
2 | -0.001942 | -0.007647 | 1.000000 | 0.030919 | 0.106245 | 0.109140 | 0.048524 | 0.055708 | 0.004040 | 0.005796 | ... | 0.025522 | 0.020168 | 0.011550 | 0.019325 | 0.019527 | 0.041321 | 0.016172 | 0.043577 | 0.044281 | -0.000810 |
3 | 0.003594 | 0.001819 | 0.030919 | 1.000000 | 0.029418 | 0.024905 | 0.014513 | 0.013857 | -0.000613 | -0.000691 | ... | 0.014032 | -0.000583 | -0.000337 | -0.000548 | -0.000561 | 0.000541 | -0.000577 | 0.000231 | 0.000235 | 0.000966 |
4 | 0.004054 | 0.008981 | 0.106245 | 0.029418 | 1.000000 | 0.888789 | 0.381632 | 0.341266 | 0.012927 | 0.019674 | ... | 0.002328 | 0.016743 | -0.001662 | 0.020509 | 0.021276 | -0.001905 | -0.000635 | -0.002552 | -0.002736 | 0.003656 |
5 | -0.001697 | 0.009232 | 0.109140 | 0.024905 | 0.888789 | 1.000000 | 0.363680 | 0.384820 | 0.017671 | 0.030060 | ... | 0.000328 | 0.010860 | -0.001706 | 0.012963 | 0.013553 | 0.000871 | 0.007096 | -0.001672 | -0.001844 | 0.002257 |
6 | -0.015882 | 0.001638 | 0.048524 | 0.014513 | 0.381632 | 0.363680 | 1.000000 | 0.908158 | 0.030397 | 0.036359 | ... | -0.000485 | 0.006351 | -0.000301 | 0.002590 | 0.003867 | -0.000818 | -0.000515 | -0.000779 | -0.000839 | 0.004448 |
7 | -0.019807 | 0.001746 | 0.055708 | 0.013857 | 0.341266 | 0.384820 | 0.908158 | 1.000000 | 0.047667 | 0.056456 | ... | -0.000514 | 0.006336 | -0.000318 | 0.002476 | 0.003707 | -0.000866 | -0.000545 | -0.000825 | -0.000888 | 0.002427 |
8 | 0.000956 | 0.000614 | 0.004040 | -0.000613 | 0.012927 | 0.017671 | 0.030397 | 0.047667 | 1.000000 | 0.988256 | ... | -0.000184 | -0.000197 | -0.000114 | -0.000185 | -0.000189 | -0.000309 | -0.000195 | -0.000295 | -0.000317 | -0.000739 |
9 | -0.000588 | 0.000695 | 0.005796 | -0.000691 | 0.019674 | 0.030060 | 0.036359 | 0.056456 | 0.988256 | 1.000000 | ... | -0.000207 | -0.000222 | -0.000128 | -0.000208 | -0.000213 | -0.000349 | -0.000220 | -0.000332 | -0.000358 | -0.000811 |
10 | -0.012443 | 0.001517 | 0.042368 | 0.012451 | 0.298916 | 0.280081 | 0.805265 | 0.706608 | 0.388309 | 0.398826 | ... | -0.000452 | -0.000485 | -0.000280 | -0.000213 | -0.000100 | -0.000762 | -0.000480 | -0.000726 | -0.000782 | 0.003341 |
11 | 0.010319 | 0.009097 | 0.096719 | 0.026377 | 0.938409 | 0.824893 | 0.038751 | 0.029445 | 0.002612 | 0.007677 | ... | 0.002699 | 0.015727 | -0.001684 | 0.021203 | 0.021555 | -0.001754 | -0.000494 | -0.002467 | -0.002645 | 0.002290 |
12 | 0.005268 | 0.009360 | 0.098070 | 0.021968 | 0.838953 | 0.943622 | 0.067664 | 0.057591 | 0.002018 | 0.012266 | ... | 0.000540 | 0.009474 | -0.001732 | 0.013133 | 0.013330 | 0.001253 | 0.007871 | -0.001513 | -0.001676 | 0.001570 |
13 | 0.017605 | -0.002511 | 0.082025 | 0.016331 | 0.266746 | 0.254702 | 0.040788 | 0.033996 | 0.108329 | 0.106806 | ... | -0.001670 | 0.034002 | -0.001035 | 0.038103 | 0.038047 | 0.007400 | 0.002248 | 0.006688 | 0.006283 | 0.000707 |
14 | 0.016960 | -0.001086 | 0.095485 | 0.016458 | 0.326051 | 0.359897 | 0.048914 | 0.045136 | 0.081030 | 0.081962 | ... | -0.002040 | 0.025566 | -0.001264 | 0.028641 | 0.028613 | 0.004485 | 0.001176 | 0.004012 | 0.003599 | -0.001992 |
15 | 0.018040 | 0.002426 | 0.106415 | 0.024014 | 0.638412 | 0.565620 | 0.043920 | 0.033716 | 0.083438 | 0.084688 | ... | 0.000009 | 0.033265 | -0.001589 | 0.038978 | 0.039103 | 0.004773 | 0.001465 | 0.003984 | 0.003632 | 0.001339 |
16 | 0.017400 | -0.002401 | 0.081028 | 0.015979 | 0.263482 | 0.252160 | 0.043357 | 0.038548 | 0.214397 | 0.211633 | ... | -0.001661 | 0.033386 | -0.001029 | 0.037417 | 0.037362 | 0.007237 | 0.002188 | 0.006539 | 0.006139 | 0.000614 |
17 | 0.016745 | -0.001019 | 0.095009 | 0.016239 | 0.324417 | 0.358769 | 0.051373 | 0.049260 | 0.160240 | 0.162113 | ... | -0.002037 | 0.025295 | -0.001262 | 0.028341 | 0.028312 | 0.004412 | 0.001147 | 0.003946 | 0.003534 | -0.002038 |
18 | 0.015206 | 0.002629 | 0.110912 | 0.025558 | 0.673593 | 0.599584 | 0.190138 | 0.162168 | 0.152035 | 0.155173 | ... | -0.000074 | 0.032155 | -0.001591 | 0.037742 | 0.037884 | 0.004487 | 0.001333 | 0.003729 | 0.003377 | 0.001910 |
19 | -0.000103 | 0.000519 | 0.016886 | -0.000520 | 0.049579 | 0.042621 | 0.012454 | 0.007797 | -0.000175 | -0.000198 | ... | -0.000156 | -0.000167 | -0.000096 | -0.000156 | -0.000160 | -0.000262 | -0.000165 | -0.000250 | -0.000269 | 0.000213 |
20 | -0.001198 | 0.004590 | 0.107680 | 0.007478 | 0.227803 | 0.238159 | 0.306165 | 0.284353 | 0.125108 | 0.138077 | ... | -0.001365 | -0.001462 | -0.000846 | 0.000768 | 0.001829 | 0.009008 | -0.001449 | 0.009156 | 0.011164 | -0.001227 |
21 | -0.006814 | -0.008975 | -0.105502 | -0.002101 | -0.208030 | -0.211873 | -0.071459 | -0.078593 | -0.012763 | -0.021318 | ... | 0.002674 | -0.049920 | -0.037729 | -0.038529 | -0.041438 | -0.000421 | -0.010257 | 0.002149 | 0.002306 | -0.016447 |
22 | -0.002037 | 0.041015 | -0.102487 | 0.017541 | 0.041167 | 0.041372 | -0.006549 | -0.008179 | 0.003576 | 0.001088 | ... | 0.009114 | -0.012641 | -0.011070 | -0.008324 | -0.009368 | 0.013263 | 0.004114 | 0.013717 | 0.014768 | -0.056029 |
23 | 0.010356 | 0.008019 | 0.107570 | 0.003429 | 0.200514 | 0.182937 | 0.035401 | 0.025512 | 0.012907 | 0.018006 | ... | -0.002401 | 0.031418 | -0.001487 | 0.033291 | 0.034573 | 0.001397 | 0.011918 | -0.001487 | -0.001591 | 0.012346 |
24 | 0.012021 | 0.007439 | 0.101605 | 0.004843 | 0.220673 | 0.201909 | 0.039018 | 0.028469 | 0.014240 | 0.019760 | ... | -0.002227 | 0.034079 | -0.001380 | 0.036066 | 0.037450 | 0.002086 | 0.013156 | -0.001036 | -0.001105 | -0.003767 |
25 | 0.001732 | 0.011525 | 0.273152 | 0.010099 | 0.027387 | 0.026378 | 0.046258 | 0.033114 | -0.003889 | -0.004384 | ... | -0.003450 | 0.020817 | -0.002138 | 0.022279 | 0.023163 | -0.000543 | -0.003662 | -0.001555 | -0.001463 | 0.012034 |
26 | 0.001138 | 0.009467 | 0.231649 | 0.015117 | 0.033757 | 0.037053 | 0.044225 | 0.034049 | -0.003194 | -0.003600 | ... | -0.002834 | 0.026148 | -0.001756 | 0.027806 | 0.028888 | 0.001500 | -0.003008 | 0.000193 | 0.000460 | 0.006643 |
27 | -0.004836 | 0.009771 | 0.299165 | 0.036569 | -0.010411 | -0.013701 | 0.020327 | 0.019508 | -0.003295 | -0.003715 | ... | -0.002924 | -0.003132 | -0.001811 | -0.001902 | -0.001440 | -0.003893 | -0.003103 | -0.003587 | -0.003989 | 0.012240 |
28 | -0.006480 | 0.008796 | 0.241707 | 0.040420 | -0.012628 | -0.018755 | 0.009992 | 0.003331 | -0.002969 | -0.003347 | ... | -0.002634 | -0.002822 | -0.001632 | -0.001507 | -0.000986 | -0.004438 | -0.002796 | -0.004228 | -0.004553 | 0.007400 |
29 | -0.005811 | 0.008676 | 0.237830 | 0.041165 | -0.012035 | -0.018146 | 0.010326 | 0.003592 | -0.002929 | -0.003301 | ... | -0.002599 | -0.002784 | -0.001610 | -0.001456 | -0.000927 | -0.004378 | -0.002758 | -0.004171 | -0.004491 | 0.006121 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
215 | 0.006937 | 0.002152 | 0.043278 | 0.002314 | 0.073627 | 0.084908 | 0.009789 | 0.009653 | 0.000540 | 0.013546 | ... | -0.000644 | 0.003153 | -0.000399 | 0.003511 | 0.003709 | 0.082936 | 0.230452 | 0.029915 | 0.033617 | 0.000338 |
216 | 0.004924 | 0.002210 | 0.045622 | 0.003234 | 0.086904 | 0.093401 | 0.042081 | 0.029906 | 0.000420 | 0.012702 | ... | -0.000662 | 0.003721 | -0.000410 | 0.004129 | 0.004359 | 0.074334 | 0.206808 | 0.026758 | 0.030066 | 0.000244 |
217 | 0.008100 | 0.003979 | 0.149586 | 0.001554 | -0.003401 | -0.004867 | 0.024589 | 0.017603 | -0.001336 | -0.001506 | ... | -0.001185 | 0.011108 | -0.000734 | 0.011385 | 0.011627 | -0.001886 | -0.001258 | -0.001828 | -0.001966 | 0.017276 |
218 | -0.000582 | 0.002581 | 0.093124 | -0.001262 | -0.007050 | -0.006547 | -0.002282 | -0.002111 | -0.000865 | -0.000976 | ... | -0.000768 | 0.012837 | -0.000476 | 0.013076 | 0.013340 | -0.001294 | -0.000815 | -0.001232 | -0.001327 | 0.006644 |
219 | 0.007130 | 0.004811 | 0.178546 | 0.002540 | -0.002079 | -0.004782 | 0.036168 | 0.025934 | -0.001618 | -0.001823 | ... | -0.001435 | 0.010157 | -0.000889 | 0.010430 | 0.010646 | -0.002222 | -0.001523 | -0.002103 | -0.002237 | 0.018092 |
220 | 0.007675 | 0.004879 | 0.179565 | 0.002948 | -0.001151 | -0.003808 | 0.038964 | 0.028107 | -0.001640 | -0.001849 | ... | -0.001455 | 0.009732 | -0.000902 | 0.010011 | 0.010224 | -0.002257 | -0.001545 | -0.002116 | -0.002243 | 0.017579 |
221 | -0.006477 | 0.005759 | 0.178263 | -0.005438 | 0.002963 | -0.001631 | 0.048470 | 0.029154 | -0.001943 | -0.002190 | ... | -0.001724 | -0.001846 | -0.001068 | -0.001729 | -0.001768 | -0.002904 | -0.001829 | -0.002766 | -0.002979 | 0.014736 |
222 | -0.010219 | 0.003183 | 0.094741 | -0.003083 | 0.040691 | 0.027749 | 0.133603 | 0.084716 | -0.001073 | -0.001210 | ... | -0.000952 | -0.001020 | -0.000590 | -0.000958 | -0.000981 | -0.001604 | -0.001011 | -0.001528 | -0.001646 | 0.002052 |
223 | -0.011386 | 0.006355 | 0.200415 | 0.025778 | -0.000914 | -0.005809 | 0.038767 | 0.022672 | -0.002144 | -0.002417 | ... | -0.001903 | -0.002038 | -0.001179 | -0.001788 | -0.001769 | -0.003205 | -0.002019 | -0.003054 | -0.003288 | 0.014980 |
224 | -0.011200 | 0.006248 | 0.195652 | 0.033042 | -0.000322 | -0.005729 | 0.043137 | 0.025504 | -0.002108 | -0.002376 | ... | -0.001871 | -0.002004 | -0.001159 | -0.001801 | -0.001804 | -0.003151 | -0.001985 | -0.003002 | -0.003233 | 0.014628 |
225 | 0.006455 | 0.002629 | 0.125618 | -0.001532 | 0.003267 | 0.004018 | 0.013532 | 0.021485 | -0.000883 | -0.000995 | ... | -0.000783 | -0.000839 | -0.000485 | -0.000788 | -0.000807 | -0.001195 | -0.000831 | -0.001146 | -0.001241 | 0.014567 |
226 | 0.008361 | 0.001482 | 0.059293 | 0.000238 | 0.012429 | 0.010896 | 0.001034 | 0.002878 | -0.000492 | -0.000555 | ... | -0.000437 | -0.000468 | -0.000271 | -0.000439 | -0.000450 | -0.000573 | -0.000463 | -0.000556 | -0.000608 | 0.005688 |
227 | 0.003765 | 0.002827 | 0.135362 | -0.001817 | 0.000824 | 0.001493 | 0.012108 | 0.019415 | -0.000950 | -0.001071 | ... | -0.000843 | -0.000903 | -0.000522 | -0.000848 | -0.000868 | -0.001417 | -0.000895 | -0.001348 | -0.001452 | 0.015351 |
228 | 0.005352 | 0.002770 | 0.132537 | -0.001698 | 0.001272 | 0.001627 | 0.010082 | 0.016429 | -0.000930 | -0.001048 | ... | -0.000825 | -0.000884 | -0.000511 | -0.000830 | -0.000850 | -0.001387 | -0.000876 | -0.001318 | -0.001421 | 0.014485 |
229 | 0.008042 | 0.000356 | 0.023435 | -0.000354 | -0.001629 | -0.001719 | -0.000318 | -0.000337 | -0.000120 | -0.000136 | ... | -0.000107 | 0.002856 | -0.000066 | 0.004306 | 0.004022 | 0.000265 | -0.000113 | 0.000089 | 0.000121 | 0.013197 |
230 | 0.007870 | 0.000338 | 0.022679 | -0.000338 | -0.001669 | -0.001713 | -0.000302 | -0.000320 | -0.000114 | -0.000129 | ... | -0.000101 | -0.000108 | -0.000063 | -0.000102 | -0.000104 | -0.000171 | -0.000107 | -0.000163 | -0.000175 | 0.012842 |
231 | 0.007952 | 0.000411 | 0.025362 | -0.000285 | -0.001677 | -0.001846 | -0.000365 | -0.000387 | -0.000138 | -0.000156 | ... | -0.000123 | 0.007701 | -0.000076 | 0.011513 | 0.010768 | 0.001004 | -0.000130 | 0.000510 | 0.000619 | 0.013321 |
232 | 0.008021 | 0.000408 | 0.025406 | -0.000334 | -0.001730 | -0.001875 | -0.000362 | -0.000383 | -0.000137 | -0.000154 | ... | -0.000121 | 0.006078 | -0.000075 | 0.009101 | 0.008510 | 0.000746 | -0.000129 | 0.000360 | 0.000443 | 0.013418 |
233 | -0.001596 | 0.000391 | 0.013612 | -0.000391 | -0.001930 | -0.001981 | -0.000349 | -0.000370 | -0.000132 | -0.000149 | ... | 0.538270 | -0.000125 | -0.000073 | -0.000118 | -0.000121 | 0.031970 | -0.000124 | 0.068648 | 0.057673 | -0.000203 |
234 | 0.001830 | 0.000453 | 0.023446 | 0.008469 | 0.000833 | -0.000407 | -0.000404 | -0.000428 | -0.000153 | -0.000172 | ... | 0.950232 | -0.000145 | -0.000084 | -0.000136 | -0.000140 | 0.004649 | -0.000144 | 0.010219 | 0.008541 | -0.003446 |
235 | -0.001337 | 0.000544 | 0.025522 | 0.014032 | 0.002328 | 0.000328 | -0.000485 | -0.000514 | -0.000184 | -0.000207 | ... | 1.000000 | -0.000174 | -0.000101 | -0.000164 | -0.000168 | 0.012705 | -0.000173 | 0.027515 | 0.023072 | -0.003399 |
236 | 0.002051 | 0.000586 | 0.020168 | -0.000583 | 0.016743 | 0.010860 | 0.006351 | 0.006336 | -0.000197 | -0.000222 | ... | -0.000174 | 1.000000 | 0.484331 | 0.938668 | 0.953411 | 0.021540 | -0.000185 | 0.012393 | 0.014523 | -0.000773 |
237 | -0.008500 | 0.000337 | 0.011550 | -0.000337 | -0.001662 | -0.001706 | -0.000301 | -0.000318 | -0.000114 | -0.000128 | ... | -0.000101 | 0.484331 | 1.000000 | 0.193281 | 0.225912 | -0.000170 | -0.000107 | -0.000162 | -0.000174 | -0.000402 |
238 | 0.006554 | 0.000550 | 0.019325 | -0.000548 | 0.020509 | 0.012963 | 0.002590 | 0.002476 | -0.000185 | -0.000208 | ... | -0.000164 | 0.938668 | 0.193281 | 1.000000 | 0.998497 | 0.032162 | -0.000174 | 0.018565 | 0.021742 | -0.000525 |
239 | 0.005907 | 0.000563 | 0.019527 | -0.000561 | 0.021276 | 0.013553 | 0.003867 | 0.003707 | -0.000189 | -0.000213 | ... | -0.000168 | 0.953411 | 0.225912 | 0.998497 | 1.000000 | 0.030087 | -0.000178 | 0.017358 | 0.020331 | -0.000589 |
240 | 0.008825 | 0.000922 | 0.041321 | 0.000541 | -0.001905 | 0.000871 | -0.000818 | -0.000866 | -0.000309 | -0.000349 | ... | 0.012705 | 0.021540 | -0.000170 | 0.032162 | 0.030087 | 1.000000 | 0.329805 | 0.935317 | 0.919036 | 0.011106 |
241 | -0.009174 | 0.000598 | 0.016172 | -0.000577 | -0.000635 | 0.007096 | -0.000515 | -0.000545 | -0.000195 | -0.000220 | ... | -0.000173 | -0.000185 | -0.000107 | -0.000174 | -0.000178 | 0.329805 | 1.000000 | 0.127224 | 0.140902 | 0.011807 |
242 | 0.012031 | 0.000875 | 0.043577 | 0.000231 | -0.002552 | -0.001672 | -0.000779 | -0.000825 | -0.000295 | -0.000332 | ... | 0.027515 | 0.012393 | -0.000162 | 0.018565 | 0.017358 | 0.935317 | 0.127224 | 1.000000 | 0.993536 | 0.008604 |
243 | 0.012128 | 0.000942 | 0.044281 | 0.000235 | -0.002736 | -0.001844 | -0.000839 | -0.000888 | -0.000317 | -0.000358 | ... | 0.023072 | 0.014523 | -0.000174 | 0.021742 | 0.020331 | 0.919036 | 0.140902 | 0.993536 | 1.000000 | 0.009136 |
244 | 0.006612 | 0.000415 | -0.000810 | 0.000966 | 0.003656 | 0.002257 | 0.004448 | 0.002427 | -0.000739 | -0.000811 | ... | -0.003399 | -0.000773 | -0.000402 | -0.000525 | -0.000589 | 0.011106 | 0.011807 | 0.008604 | 0.009136 | 1.000000 |
227 rows × 227 columns
Let's get the list of correlated features from the data.
corrdata = corrmat.abs().stack() corrdata
0 0 1.000000 1 0.025277 2 0.001942 3 0.003594 4 0.004054 5 0.001697 6 0.015882 7 0.019807 8 0.000956 9 0.000588 10 0.012443 11 0.010319 12 0.005268 13 0.017605 14 0.016960 15 0.018040 16 0.017400 17 0.016745 18 0.015206 19 0.000103 20 0.001198 21 0.006814 22 0.002037 23 0.010356 24 0.012021 25 0.001732 26 0.001138 27 0.004836 28 0.006480 29 0.005811 ... 244 215 0.000338 216 0.000244 217 0.017276 218 0.006644 219 0.018092 220 0.017579 221 0.014736 222 0.002052 223 0.014980 224 0.014628 225 0.014567 226 0.005688 227 0.015351 228 0.014485 229 0.013197 230 0.012842 231 0.013321 232 0.013418 233 0.000203 234 0.003446 235 0.003399 236 0.000773 237 0.000402 238 0.000525 239 0.000589 240 0.011106 241 0.011807 242 0.008604 243 0.009136 244 1.000000 Length: 51529, dtype: float64
Let's arrange the correlated data in the descending order.
corrdata = corrdata.sort_values(ascending=False) corrdata
29 58 1.000000e+00 58 29 1.000000e+00 134 158 1.000000e+00 158 134 1.000000e+00 182 182 1.000000e+00 181 181 1.000000e+00 159 159 1.000000e+00 160 160 1.000000e+00 161 161 1.000000e+00 162 162 1.000000e+00 163 163 1.000000e+00 164 164 1.000000e+00 165 165 1.000000e+00 166 166 1.000000e+00 167 167 1.000000e+00 168 168 1.000000e+00 169 169 1.000000e+00 170 170 1.000000e+00 171 171 1.000000e+00 158 158 1.000000e+00 173 173 1.000000e+00 174 174 1.000000e+00 175 175 1.000000e+00 176 176 1.000000e+00 177 177 1.000000e+00 183 183 1.000000e+00 178 178 1.000000e+00 179 179 1.000000e+00 180 180 1.000000e+00 172 172 1.000000e+00 ... 113 60 8.925381e-06 60 113 8.925381e-06 82 193 8.892757e-06 193 82 8.892757e-06 230 110 8.848510e-06 110 230 8.848510e-06 235 15 8.707147e-06 15 235 8.707147e-06 186 243 7.715459e-06 243 186 7.715459e-06 150 120 7.232908e-06 120 150 7.232908e-06 103 189 5.738723e-06 189 103 5.738723e-06 13 120 5.200500e-06 120 13 5.200500e-06 243 162 3.905074e-06 162 243 3.905074e-06 186 126 3.594093e-06 126 186 3.594093e-06 159 242 2.877380e-06 242 159 2.877380e-06 107 68 2.392837e-06 68 107 2.392837e-06 111 229 1.934954e-06 229 111 1.934954e-06 231 150 6.044672e-07 150 231 6.044672e-07 231 123 3.966696e-07 123 231 3.966696e-07 Length: 51529, dtype: float64
Let's get the correlated data between 1 and 0.85.
corrdata = corrdata[corrdata>0.85] corrdata = corrdata[corrdata<1] corrdata
143 135 1.000000 135 143 1.000000 136 128 1.000000 128 136 1.000000 31 62 1.000000 62 31 1.000000 20 47 1.000000 47 20 1.000000 52 23 1.000000 23 52 1.000000 53 24 1.000000 24 53 1.000000 33 69 1.000000 69 33 1.000000 157 133 1.000000 133 157 1.000000 237 149 1.000000 149 237 1.000000 154 132 1.000000 132 154 1.000000 146 230 0.999997 230 146 0.999997 238 122 0.999945 122 238 0.999945 148 149 0.999929 149 148 0.999929 237 148 0.999929 148 237 0.999929 231 232 0.999892 232 231 0.999892 ... 183 52 0.860163 52 183 0.860163 183 23 0.860163 23 183 0.860163 79 195 0.859806 195 79 0.859806 8 193 0.859270 193 8 0.859270 29 61 0.858830 61 29 0.858830 58 0.858830 58 61 0.858830 84 77 0.858529 77 84 0.858529 83 189 0.858484 189 83 0.858484 84 194 0.857731 194 84 0.857731 76 190 0.857717 190 76 0.857717 151 173 0.854991 173 151 0.854991 41 163 0.852233 163 41 0.852233 66 67 0.851384 67 66 0.851384 61 28 0.851022 28 61 0.851022 72 35 0.850893 35 72 0.850893 Length: 534, dtype: float64
corrdata = pd.DataFrame(corrdata).reset_index() corrdata.columns = ['features1', 'features2', 'corr_value'] corrdata
features1 | features2 | corr_value | |
---|---|---|---|
0 | 143 | 135 | 1.000000 |
1 | 135 | 143 | 1.000000 |
2 | 136 | 128 | 1.000000 |
3 | 128 | 136 | 1.000000 |
4 | 31 | 62 | 1.000000 |
5 | 62 | 31 | 1.000000 |
6 | 20 | 47 | 1.000000 |
7 | 47 | 20 | 1.000000 |
8 | 52 | 23 | 1.000000 |
9 | 23 | 52 | 1.000000 |
10 | 53 | 24 | 1.000000 |
11 | 24 | 53 | 1.000000 |
12 | 33 | 69 | 1.000000 |
13 | 69 | 33 | 1.000000 |
14 | 157 | 133 | 1.000000 |
15 | 133 | 157 | 1.000000 |
16 | 237 | 149 | 1.000000 |
17 | 149 | 237 | 1.000000 |
18 | 154 | 132 | 1.000000 |
19 | 132 | 154 | 1.000000 |
20 | 146 | 230 | 0.999997 |
21 | 230 | 146 | 0.999997 |
22 | 238 | 122 | 0.999945 |
23 | 122 | 238 | 0.999945 |
24 | 148 | 149 | 0.999929 |
25 | 149 | 148 | 0.999929 |
26 | 237 | 148 | 0.999929 |
27 | 148 | 237 | 0.999929 |
28 | 231 | 232 | 0.999892 |
29 | 232 | 231 | 0.999892 |
... | ... | ... | ... |
504 | 183 | 52 | 0.860163 |
505 | 52 | 183 | 0.860163 |
506 | 183 | 23 | 0.860163 |
507 | 23 | 183 | 0.860163 |
508 | 79 | 195 | 0.859806 |
509 | 195 | 79 | 0.859806 |
510 | 8 | 193 | 0.859270 |
511 | 193 | 8 | 0.859270 |
512 | 29 | 61 | 0.858830 |
513 | 61 | 29 | 0.858830 |
514 | 61 | 58 | 0.858830 |
515 | 58 | 61 | 0.858830 |
516 | 84 | 77 | 0.858529 |
517 | 77 | 84 | 0.858529 |
518 | 83 | 189 | 0.858484 |
519 | 189 | 83 | 0.858484 |
520 | 84 | 194 | 0.857731 |
521 | 194 | 84 | 0.857731 |
522 | 76 | 190 | 0.857717 |
523 | 190 | 76 | 0.857717 |
524 | 151 | 173 | 0.854991 |
525 | 173 | 151 | 0.854991 |
526 | 41 | 163 | 0.852233 |
527 | 163 | 41 | 0.852233 |
528 | 66 | 67 | 0.851384 |
529 | 67 | 66 | 0.851384 |
530 | 61 | 28 | 0.851022 |
531 | 28 | 61 | 0.851022 |
532 | 72 | 35 | 0.850893 |
533 | 35 | 72 | 0.850893 |
534 rows × 3 columns
Let's have a list of uncorrelated features from the dataset.
grouped_feature_list = [] correlated_groups_list = [] for feature in corrdata.features1.unique(): if feature not in grouped_feature_list: correlated_block = corrdata[corrdata.features1 == feature] grouped_feature_list = grouped_feature_list + list(correlated_block.features2.unique()) + [feature] correlated_groups_list.append(correlated_block)
len(correlated_groups_list)
56
X_train.shape, X_train_uncorr.shape
((16000, 370), (16000, 103))
for group in correlated_groups_list: print(group)
features1 features2 corr_value 0 143 135 1.0 features1 features2 corr_value 2 136 128 1.000000 197 136 169 0.959468 features1 features2 corr_value 4 31 62 1.0 features1 features2 corr_value 6 20 47 1.0 features1 features2 corr_value 8 52 23 1.000000 297 52 24 0.927683 299 52 53 0.927683 448 52 21 0.877297 505 52 183 0.860163 features1 features2 corr_value 12 33 69 1.000000 224 33 32 0.947113 228 33 68 0.946571 322 33 26 0.917665 337 33 55 0.914178 422 33 184 0.884383 features1 features2 corr_value 14 157 133 1.0 features1 features2 corr_value 16 237 149 1.000000 26 237 148 0.999929 features1 features2 corr_value 18 154 132 1.0 features1 features2 corr_value 20 146 230 0.999997 36 146 229 0.999778 59 146 231 0.997052 68 146 232 0.996772 76 146 113 0.996424 89 146 120 0.993307 245 146 170 0.944314 features1 features2 corr_value 22 238 122 0.999945 49 238 239 0.998497 264 238 236 0.938668 features1 features2 corr_value 34 82 78 0.999859 features1 features2 corr_value 40 108 115 0.999478 97 108 219 0.992870 115 108 125 0.987333 142 108 220 0.982474 280 108 217 0.933815 features1 features2 corr_value 46 199 197 0.998753 362 199 196 0.905699 371 199 198 0.904341 features1 features2 corr_value 50 181 208 0.997718 345 181 205 0.911453 467 181 207 0.871801 features1 features2 corr_value 72 17 14 0.996739 396 17 16 0.890442 408 17 13 0.888669 features1 features2 corr_value 86 242 243 0.993536 122 242 126 0.986744 276 242 240 0.935317 features1 features2 corr_value 92 28 57 0.993186 124 28 58 0.986371 126 28 29 0.986371 185 28 185 0.964067 381 28 27 0.901032 399 28 30 0.889321 531 28 61 0.851022 features1 features2 corr_value 94 51 22 0.992882 385 51 182 0.899063 features1 features2 corr_value 100 44 46 0.990593 377 44 98 0.902736 410 44 95 0.888337 features1 features2 corr_value 102 77 81 0.989793 461 77 80 0.874240 517 77 84 0.858529 features1 features2 corr_value 104 109 223 0.989341 151 109 224 0.980951 356 109 221 0.907987 413 109 111 0.887721 features1 features2 corr_value 112 9 8 0.988256 417 9 193 0.886955 444 9 192 0.878045 features1 features2 corr_value 116 227 228 0.987304 188 227 225 0.962657 features1 features2 corr_value 118 116 117 0.987013 features1 features2 corr_value 128 91 49 0.985951 features1 features2 corr_value 130 54 25 0.985875 419 54 100 0.886309 features1 features2 corr_value 134 76 75 0.984751 353 76 74 0.908497 477 76 191 0.870551 522 76 190 0.857717 features1 features2 corr_value 136 38 35 0.984077 261 38 34 0.940390 306 38 36 0.922699 496 38 72 0.864661 features1 features2 corr_value 138 18 15 0.983164 465 18 16 0.872133 470 18 13 0.870936 features1 features2 corr_value 140 215 107 0.983156 146 215 216 0.981815 features1 features2 corr_value 161 56 61 0.976942 187 56 27 0.962726 211 56 30 0.953194 features1 features2 corr_value 164 162 163 0.975002 288 162 161 0.930635 369 162 164 0.904702 463 162 41 0.874083 features1 features2 corr_value 166 102 103 0.974341 features1 features2 corr_value 168 83 79 0.973140 263 83 188 0.938960 273 83 84 0.936080 315 83 194 0.919405 351 83 80 0.910385 518 83 189 0.858484 features1 features2 corr_value 174 70 72 0.972088 500 70 35 0.862850 features1 features2 corr_value 180 59 60 0.968504 features1 features2 corr_value 207 195 189 0.956666 313 195 80 0.920961 330 195 194 0.916442 378 195 84 0.902276 428 195 188 0.882312 509 195 79 0.859806 features1 features2 corr_value 216 235 234 0.950232 349 235 106 0.911179 features1 features2 corr_value 220 10 104 0.948845 features1 features2 corr_value 234 180 179 0.945288 features1 features2 corr_value 236 241 151 0.944812 features1 features2 corr_value 243 42 41 0.944451 415 42 161 0.887059 503 42 164 0.861507 features1 features2 corr_value 248 12 5 0.943622 434 12 11 0.881673 features1 features2 corr_value 266 4 11 0.938409 402 4 5 0.888789 features1 features2 corr_value 274 93 92 0.935867 features1 features2 corr_value 290 89 121 0.928898 features1 features2 corr_value 304 88 87 0.924 features1 features2 corr_value 318 174 204 0.918533 features1 features2 corr_value 333 50 21 0.916137 features1 features2 corr_value 354 6 7 0.908158 features1 features2 corr_value 372 64 65 0.904095 488 64 87 0.866430 features1 features2 corr_value 374 101 86 0.903641 394 101 40 0.892951 features1 features2 corr_value 390 131 153 0.89633 features1 features2 corr_value 525 173 151 0.854991 features1 features2 corr_value 528 66 67 0.851384
Feature Importance based on tree based classifiers
Let's get the list of important features from the following code.
important_features = [] for group in correlated_groups_list: features = list(group.features1.unique()) + list(group.features2.unique()) rf = RandomForestClassifier(n_estimators=100, random_state=0) rf.fit(X_train_unique[features], y_train) importance = pd.concat([pd.Series(features), pd.Series(rf.feature_importances_)], axis = 1) importance.columns = ['features', 'importance'] importance.sort_values(by = 'importance', ascending = False, inplace = True) feat = importance.iloc[0] important_features.append(feat)
important_features
[features 135.00 importance 0.51 Name: 1, dtype: float64, features 128.000000 importance 0.563757 Name: 1, dtype: float64, features 62.00 importance 0.51 Name: 1, dtype: float64, features 47.00 importance 0.51 Name: 1, dtype: float64, features 183.000000 importance 0.285817 Name: 5, dtype: float64, features 184.00000 importance 0.34728 Name: 6, dtype: float64, features 157.00 importance 0.34 Name: 0, dtype: float64, features 148.000000 importance 0.505844 Name: 2, dtype: float64, features 132.00 importance 0.39 Name: 1, dtype: float64, features 120.000000 importance 0.749683 Name: 6, dtype: float64, features 122.00 importance 0.34 Name: 1, dtype: float64, features 82.000000 importance 0.518827 Name: 0, dtype: float64, features 125.000000 importance 0.940524 Name: 3, dtype: float64, features 197.000000 importance 0.289727 Name: 1, dtype: float64, features 207.000000 importance 0.312834 Name: 3, dtype: float64, features 17.000000 importance 0.286833 Name: 0, dtype: float64, features 243.000000 importance 0.431557 Name: 1, dtype: float64, features 185.000000 importance 0.391367 Name: 4, dtype: float64, features 182.000000 importance 0.432045 Name: 2, dtype: float64, features 95.000000 importance 0.487162 Name: 3, dtype: float64, features 84.000000 importance 0.299008 Name: 3, dtype: float64, features 221.00000 importance 0.28555 Name: 3, dtype: float64, features 8.000000 importance 0.345509 Name: 1, dtype: float64, features 228.000000 importance 0.434186 Name: 1, dtype: float64, features 117.000000 importance 0.517013 Name: 1, dtype: float64, features 49.000000 importance 0.500161 Name: 1, dtype: float64, features 100.000000 importance 0.386775 Name: 2, dtype: float64, features 191.000000 importance 0.345104 Name: 3, dtype: float64, features 34.000000 importance 0.283901 Name: 2, dtype: float64, features 15.000000 importance 0.400677 Name: 1, dtype: float64, features 107.000000 importance 0.349126 Name: 1, dtype: float64, features 61.000000 importance 0.323735 Name: 1, dtype: float64, features 41.000000 importance 0.386338 Name: 4, dtype: float64, features 102.000000 importance 0.508955 Name: 0, dtype: float64, features 189.000000 importance 0.229269 Name: 6, dtype: float64, features 72.000000 importance 0.490102 Name: 1, dtype: float64, features 60.00000 importance 0.50052 Name: 1, dtype: float64, features 79.000000 importance 0.213903 Name: 6, dtype: float64, features 234.000000 importance 0.445719 Name: 1, dtype: float64, features 104.000000 importance 0.640915 Name: 1, dtype: float64, features 179.000000 importance 0.634779 Name: 1, dtype: float64, features 151.00 importance 0.51 Name: 1, dtype: float64, features 161.000000 importance 0.346426 Name: 2, dtype: float64, features 5.000000 importance 0.356386 Name: 1, dtype: float64, features 5.000000 importance 0.403831 Name: 2, dtype: float64, features 93.000000 importance 0.544349 Name: 0, dtype: float64, features 121.00 importance 0.51 Name: 1, dtype: float64, features 87.000000 importance 0.553622 Name: 1, dtype: float64, features 174.000000 importance 0.743723 Name: 0, dtype: float64, features 50.000000 importance 0.616659 Name: 0, dtype: float64, features 7.000000 importance 0.545702 Name: 1, dtype: float64, features 87.0000 importance 0.7462 Name: 2, dtype: float64, features 86.000000 importance 0.447693 Name: 1, dtype: float64, features 153.00 importance 0.51 Name: 1, dtype: float64, features 151.00 importance 0.51 Name: 1, dtype: float64, features 66.000000 importance 0.630293 Name: 0, dtype: float64]
important_features = pd.DataFrame(important_features) important_features.reset_index(inplace=True, drop = True)
important_features
features | importance | |
---|---|---|
0 | 135.0 | 0.510000 |
1 | 128.0 | 0.563757 |
2 | 62.0 | 0.510000 |
3 | 47.0 | 0.510000 |
4 | 183.0 | 0.285817 |
5 | 184.0 | 0.347280 |
6 | 157.0 | 0.340000 |
7 | 148.0 | 0.505844 |
8 | 132.0 | 0.390000 |
9 | 120.0 | 0.749683 |
10 | 122.0 | 0.340000 |
11 | 82.0 | 0.518827 |
12 | 125.0 | 0.940524 |
13 | 197.0 | 0.289727 |
14 | 207.0 | 0.312834 |
15 | 17.0 | 0.286833 |
16 | 243.0 | 0.431557 |
17 | 185.0 | 0.391367 |
18 | 182.0 | 0.432045 |
19 | 95.0 | 0.487162 |
20 | 84.0 | 0.299008 |
21 | 221.0 | 0.285550 |
22 | 8.0 | 0.345509 |
23 | 228.0 | 0.434186 |
24 | 117.0 | 0.517013 |
25 | 49.0 | 0.500161 |
26 | 100.0 | 0.386775 |
27 | 191.0 | 0.345104 |
28 | 34.0 | 0.283901 |
29 | 15.0 | 0.400677 |
30 | 107.0 | 0.349126 |
31 | 61.0 | 0.323735 |
32 | 41.0 | 0.386338 |
33 | 102.0 | 0.508955 |
34 | 189.0 | 0.229269 |
35 | 72.0 | 0.490102 |
36 | 60.0 | 0.500520 |
37 | 79.0 | 0.213903 |
38 | 234.0 | 0.445719 |
39 | 104.0 | 0.640915 |
40 | 179.0 | 0.634779 |
41 | 151.0 | 0.510000 |
42 | 161.0 | 0.346426 |
43 | 5.0 | 0.356386 |
44 | 5.0 | 0.403831 |
45 | 93.0 | 0.544349 |
46 | 121.0 | 0.510000 |
47 | 87.0 | 0.553622 |
48 | 174.0 | 0.743723 |
49 | 50.0 | 0.616659 |
50 | 7.0 | 0.545702 |
51 | 87.0 | 0.746200 |
52 | 86.0 | 0.447693 |
53 | 153.0 | 0.510000 |
54 | 151.0 | 0.510000 |
55 | 66.0 | 0.630293 |
Let's get the features which are to be discarded.
features_to_consider = set(important_features['features']) features_to_discard = set(corr_features) - set(features_to_consider) features_to_discard = list(features_to_discard) X_train_grouped_uncorr = X_train_unique.drop(labels = features_to_discard, axis = 1)
Let's get the shape of the uncorrelated dataset.
X_train_grouped_uncorr.shape
(16000, 140)
X_test_grouped_uncorr = X_test_unique.drop(labels=features_to_discard, axis = 1) X_test_grouped_uncorr.shape
(4000, 140)
%%time run_randomForest(X_train_grouped_uncorr, X_test_grouped_uncorr, y_train, y_test)
Accuracy on test set: 0.95775 Wall time: 1.01 s
%%time run_randomForest(X_train, X_test, y_train, y_test)
Accuracy on test set: 0.9585 Wall time: 1.48 s
%%time run_randomForest(X_train_uncorr, X_test_uncorr, y_train, y_test)
Accuracy on test set: 0.95875 Wall time: 891 ms
3 Comments