# Feature Selection Based on Univariate (ANOVA) Test for Classification | Machine Learning | KGP Talkie

## Feature Selection Based on Univariate (ANOVA) Test for Classification

Watch Full Playlist: https://www.youtube.com/playlist?list=PLc2rvfiptPSQYzmDIFuq2PqN2n28ZjxDH

### What is Univariate (ANOVA) Test

The elimination process aims to reduce the size of the input feature set and at the same time to retain the class discriminatory information for classification problems.

An F-test is any statistical test in which the test statistic has an F-distribution under the null hypothesis.

Analysis of variance (ANOVA) is a collection of statistical models and their associated estimation procedures (such as the "variation" among and between groups) used to analyze the differences among group means in a sample.

The F-test is used for comparing the factors of the total deviation. For example, in one-way, or single-factor ANOVA, statistical significance is tested for by comparing the F test statistic

### Classification Problem

Importing required libraries

```import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
```
```from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import f_classif, f_regression
from sklearn.feature_selection import SelectKBest, SelectPercentile```

```data = pd.read_csv('train.csv', nrows = 20000)

5 rows Ă— 371 columns

```X = data.drop('TARGET', axis = 1)
y = data['TARGET']
X.shape, y.shape
```
`((20000, 370), (20000,))`

Now train, test and split the data with test size equals to `0.2`.

```X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0, stratify = y)
```

### Remove Constant, Quasi Constant, and Correlated Features

These are the filters that are almost `constant` or `quasi constant` in other words these features have same values for `large subset` of outputs and such features are not very useful for making `predictions`

Let's remove the constant and quasi constant features with the following code.

```constant_filter = VarianceThreshold(threshold=0.01)
constant_filter.fit(X_train)
X_train_filter = constant_filter.transform(X_train)
X_test_filter = constant_filter.transform(X_test)
```

Let's get the shape of the filtered data.

```X_train_filter.shape, X_test_filter.shape
```
`((16000, 245), (4000, 245))`

If two features are exactly same those are called as `duplicate features` that means these features doesn't provide any `new information` and makes our model complex.

Here we have a problem as we did in `quasi constant` and `constant` removal sklearn doesn't have direct library to handle with duplicate features .

So, first we will do `transpose` the dataset and then `python` have a method to remove duplicate features.

Let's remove the duplicated features by using the following code.

```X_train_T = X_train_filter.T
X_test_T = X_test_filter.T
```
```X_train_T = pd.DataFrame(X_train_T)
X_test_T = pd.DataFrame(X_test_T)
```

Let's get the total number of duplicated features.

```X_train_T.duplicated().sum()
```
`18`
```duplicated_features = X_train_T.duplicated()
```

Now, we will find out the non duplicated features from the dataset.

```features_to_keep = [not index for index in duplicated_features]
X_train_unique = X_train_T[features_to_keep].T
X_test_unique = X_test_T[features_to_keep].T
```
```X_train_unique.shape, X_train.shape
```
`((16000, 227), (16000, 370))`

#### Now do F-Test

```sel = f_classif(X_train_unique, y_train)
sel
```
```(array([3.42911520e-01, 1.22929093e+00, 1.61291330e+02, 4.01025132e-01,
8.37661151e-01, 2.39279390e-03, 4.41633351e-02, 1.36337510e-01,
1.84647123e+00, 2.03640367e+00, 7.98057954e-03, 1.14063993e+00,
6.32266614e-03, 1.55626237e+01, 1.53553790e+01, 1.28615978e+01,
1.61834746e+01, 1.59638013e+01, 1.21977511e+01, 9.03776687e-02,
1.00443179e+00, 1.53946148e+01, 2.50428951e+02, 2.98696944e+01,
1.06266841e+01, 2.63630437e+01, 1.66417611e+01, 3.13699473e+01,
2.47256550e+01, 2.60021376e+01, 3.26742018e+01, 9.94259060e+00,
1.48208220e+01, 1.50040146e+01, 1.34739830e+01, 7.03118653e+00,
1.36234772e+01, 7.95962134e+00, 3.15161070e+02, 1.79631284e+00,
1.66910747e+00, 1.21138302e+01, 1.10928892e+01, 1.00443179e+00,
2.31851572e+00, 8.93973153e+01, 7.53868668e+00, 2.38490562e+02,
2.98696944e+01, 1.06266841e+01, 2.61694409e+01, 1.66053267e+01,
2.93013259e+01, 2.44433356e+01, 2.60021376e+01, 5.59623841e+00,
5.65080530e+00, 3.11715028e+01, 9.94259060e+00, 6.69237272e-01,
6.73931889e-01, 5.91355150e-01, 2.16653744e+00, 1.57036464e+00,
1.48180592e+01, 1.50040146e+01, 4.10147572e+00, 5.08119829e+00,
2.86061739e-01, 4.74076524e-04, 3.22895933e-02, 3.61497992e+00,
2.62641383e-01, 1.44465136e+00, 2.39577575e+00, 3.25151692e+00,
2.66120176e-01, 1.33584657e+00, 2.15986976e+00, 2.95680783e+01,
2.74320562e+02, 1.79136749e+00, 1.65942415e+00, 4.55732338e-01,
8.03423196e+01, 5.33753163e+00, 3.43569515e+00, 5.38991827e+00,
6.48705021e+00, 1.14907051e+01, 2.46676043e+02, 1.48964854e+00,
1.48528608e+00, 1.35499717e+00, 5.04105291e+00, 8.00857735e-02,
5.92081628e-01, 7.49538059e+00, 1.43768803e+01, 3.96797511e+00,
1.84630418e+01, 5.93034025e-01, 6.23117305e-02, 1.32846978e-01,
7.36058444e+00, 4.67453255e-01, 6.53434886e-01, 2.32603599e+01,
8.82160365e-02, 4.03681937e-01, 1.12281656e-01, 1.22229167e+00,
9.50849020e+00, 3.31504999e-01, 1.52799424e+02, 9.58201843e-01,
3.81283407e-01, 8.05456673e-01, 2.11768899e-01, 4.23427422e-02,
4.23427422e-02, 4.23427422e-02, 4.23675848e-01, 9.58201843e-01,
8.05456673e-01, 4.23675848e-01, 7.83475034e+00, 7.84514734e-01,
4.28901812e-02, 1.44260945e-01, 4.33508271e-02, 4.23427422e-02,
3.34880062e-02, 1.90957786e-01, 4.06328805e-01, 1.70136127e-01,
4.23427422e-02, 5.36587189e-01, 1.87563339e+00, 4.23427422e-02,
4.23427422e-02, 4.23427422e-02, 1.25864897e-01, 1.50227029e-01,
7.58252261e-01, 3.69870284e-01, 6.31366809e-02, 1.39484806e+00,
5.24649450e+00, 8.74444426e-02, 1.20564528e+01, 1.08123286e+00,
8.46910021e-02, 2.36606015e-01, 5.89389684e+00, 2.77252663e-01,
4.15074036e-01, 1.44558159e-01, 1.17723957e+00, 9.22407334e-01,
1.45895164e+01, 1.86656969e+00, 5.43234215e+00, 1.86971763e-02,
3.09123385e+02, 7.12088878e+00, 1.49660894e+01, 2.43275497e+01,
4.52466899e+00, 2.03980835e-01, 5.87673213e-03, 4.98543138e-02,
5.16359722e-02, 1.09646850e-01, 2.06155459e+00, 2.99184059e+00,
2.21995621e-02, 1.13858713e-01, 1.14255501e+01, 1.13785982e+01,
1.19082872e+01, 1.18528440e+01, 2.65465286e-02, 1.52894509e-01,
4.63685902e+00, 2.10080736e+00, 1.65523608e-01, 2.16891078e-01,
1.40302586e+00, 5.48359285e-01, 6.35218588e-02, 4.88987865e+00,
2.49656443e+00, 4.58216058e+00, 4.15099427e+00, 4.56305342e-01,
1.66491238e-01, 3.90777488e-01, 3.50953637e-01, 5.52484208e+00,
2.37194124e+00, 7.35792170e+00, 7.47930913e+00, 1.19139338e+01,
3.63667170e+00, 1.46817492e+01, 1.40921857e+01, 2.55113543e+00,
7.93363123e-01, 2.95584767e+00, 2.83339311e+00, 4.73780486e-02,
4.26696894e-02, 6.24420202e-02, 6.13788649e-02, 5.70774760e-02,
7.65160310e-02, 1.10327676e-01, 1.26598304e-01, 4.23427422e-02,
1.11726086e-01, 1.17106404e-01, 3.13117156e-01, 1.24267517e-01,
2.84184735e-01, 3.29540269e-01, 1.12297080e+01]),
array([5.58161700e-01, 2.67561647e-01, 8.89333290e-37, 5.26569363e-01,
3.60080335e-01, 9.60986695e-01, 8.33552698e-01, 7.11954403e-01,
1.74213527e-01, 1.53591870e-01, 9.28817521e-01, 2.85533263e-01,
9.36623841e-01, 8.01575252e-05, 8.94375507e-05, 3.36393721e-04,
5.77577141e-05, 6.48544590e-05, 4.79763179e-04, 7.63701483e-01,
3.16255673e-01, 8.76012543e-05, 5.56578484e-56, 4.68990120e-08,
1.11700314e-03, 2.86219940e-07, 4.53647534e-05, 2.16766394e-08,
6.67830586e-07, 3.44933857e-07, 1.10916535e-08, 1.61796584e-03,
1.18682969e-04, 1.07709938e-04, 2.42680916e-04, 8.01812206e-03,
2.24116226e-04, 4.78913410e-03, 7.66573763e-70, 1.80177928e-01,
1.96396787e-01, 5.01825968e-04, 8.68554202e-04, 3.16255673e-01,
1.27861727e-01, 3.66783202e-21, 6.04554908e-03, 2.03825983e-53,
4.68990120e-08, 1.11700314e-03, 3.16348432e-07, 4.62436764e-05,
6.28457802e-08, 7.73029885e-07, 3.44933857e-07, 1.80109375e-02,
1.74590458e-02, 2.40048097e-08, 1.61796584e-03, 4.13329839e-01,
4.11696353e-01, 4.41906921e-01, 1.41063166e-01, 2.10172382e-01,
1.18856798e-04, 1.07709938e-04, 4.28623726e-02, 2.42001211e-02,
5.92762818e-01, 9.82629065e-01, 8.57395823e-01, 5.72793629e-02,
6.08318344e-01, 2.29405921e-01, 1.21683164e-01, 7.13761984e-02,
6.05953475e-01, 2.47785024e-01, 1.41676361e-01, 5.47783585e-08,
4.18717532e-61, 1.80778657e-01, 1.97699723e-01, 4.99635011e-01,
3.49020462e-19, 2.08836652e-02, 6.38201266e-02, 2.02659144e-02,
1.08755946e-02, 7.01140336e-04, 3.55791184e-55, 2.22288988e-01,
2.22967265e-01, 2.44423761e-01, 2.47670413e-02, 7.77184868e-01,
4.41626654e-01, 6.19258331e-03, 1.50177954e-04, 4.63904749e-02,
1.74254389e-05, 4.41259644e-01, 8.02881978e-01, 7.15503084e-01,
6.67404495e-03, 4.94171032e-01, 4.18899308e-01, 1.42788438e-06,
7.66461321e-01, 5.25202939e-01, 7.37565700e-01, 2.68928005e-01,
2.04871607e-03, 5.64782330e-01, 6.11812415e-35, 3.27655140e-01,
5.36925966e-01, 3.69480411e-01, 6.45390733e-01, 8.36970444e-01,
8.36970444e-01, 8.36970444e-01, 5.15117892e-01, 3.27655140e-01,
3.69480411e-01, 5.15117892e-01, 5.13125866e-03, 3.75777233e-01,
8.35934829e-01, 7.04086312e-01, 8.35068733e-01, 8.36970444e-01,
8.54802468e-01, 6.62126552e-01, 5.23847887e-01, 6.79996382e-01,
8.36970444e-01, 4.63861289e-01, 1.70850496e-01, 8.36970444e-01,
8.36970444e-01, 8.36970444e-01, 7.22763262e-01, 6.98323652e-01,
3.83889093e-01, 5.43083617e-01, 8.01608490e-01, 2.37605638e-01,
2.20039777e-02, 7.67455348e-01, 5.17497373e-04, 2.98437634e-01,
7.71041949e-01, 6.26674911e-01, 1.52044006e-02, 5.98514898e-01,
5.19414532e-01, 7.03796034e-01, 2.77935032e-01, 3.36858193e-01,
1.34160887e-04, 1.71887678e-01, 1.97795129e-02, 8.91239914e-01,
1.49493801e-68, 7.62677506e-03, 1.09894509e-04, 8.20840983e-07,
3.34247932e-02, 6.51532760e-01, 9.38895107e-01, 8.23319823e-01,
8.20243529e-01, 7.40550887e-01, 1.51075546e-01, 8.37042875e-02,
8.81559318e-01, 7.35797526e-01, 7.26140977e-04, 7.44713683e-04,
5.60290090e-04, 5.77206713e-04, 8.70574777e-01, 6.95789674e-01,
3.13071224e-02, 1.47240985e-01, 6.84126595e-01, 6.41425380e-01,
2.36235196e-01, 4.58999733e-01, 8.01016915e-01, 2.70286705e-02,
1.14114721e-01, 3.23215275e-02, 4.16265059e-02, 4.99365521e-01,
6.83254631e-01, 5.31899920e-01, 5.53582168e-01, 1.87603701e-02,
1.23553129e-01, 6.68393143e-03, 6.24807382e-03, 5.58595551e-04,
5.65377099e-02, 1.27759206e-04, 1.74681308e-04, 1.10234773e-01,
3.73098519e-01, 8.55867690e-02, 9.23426211e-02, 8.27692918e-01,
8.36351110e-01, 8.02680255e-01, 8.04332893e-01, 8.11179263e-01,
7.82079079e-01, 7.39775741e-01, 7.21990132e-01, 8.36970444e-01,
7.38191886e-01, 7.32198763e-01, 5.75781477e-01, 7.24455979e-01,
5.93978825e-01, 5.65937990e-01, 8.06846135e-04]))```

Let's arrange the values in ascending order.

```p_values = pd.Series(sel[1])
p_values.index = X_train_unique.columns
p_values.sort_values(ascending = True, inplace = True)
```

Let's plot the pvalues with respect to the features using the bar plot.

```p_values.plot.bar(figsize = (16, 5))
plt.title('pvalues with respect to features')
plt.show()
```
```p_values = p_values[p_values<0.05]
```
```p_values.index
```
```Int64Index([ 40, 182,  86,  22, 101,  51,   2, 127,  49,  91,  30,  27,  61,
52,  23,  85,  56,  25,  54,  29,  58,  28,  57, 185, 119, 111,
26,  55,  16,  17,  13,  21,  14,  69,  33, 184,  32,  68, 223,
178, 109, 224,  36,  34,  15,  18,  44, 168, 221, 198, 199, 100,
196, 197, 244,  46,  24,  53,  62,  31, 125,  38, 144,  50, 108,
220, 115, 219, 183,  35,  98, 172,  60,  59, 217, 180,  95,  92,
166,  72, 105, 209, 202, 211, 186, 212,  70, 110],
dtype='int64')```
```X_train_p = X_train_unique[p_values.index]
X_test_p = X_test_unique[p_values.index]
```

### Build the classifiers and compare the performance

Let's build the Random forest classifier and then get the predicted y value.

```def run_randomForest(X_train, X_test, y_train, y_test):
clf = RandomForestClassifier(n_estimators=100, random_state=0, n_jobs = -1)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print('Accuracy: ', accuracy_score(y_test, y_pred))
```

Let's calculate the accuracy and training time of trained dataset.

```%%time
run_randomForest(X_train_p, X_test_p, y_train, y_test)
```
```Accuracy:  0.953
Wall time: 814 ms
```

Let's calculate the accuracy and training time of original dataset.

```%%time
run_randomForest(X_train, X_test, y_train, y_test)
```
```Accuracy:  0.9585
Wall time: 1.49 s
```