# Feature Selection Based on Mutual Information (Entropy) Gain for Classification and Regression | Machine Learning | KGP Talkie

## Feature Selection Based on Mutual Information (Entropy) Gain

### What is Mutual Information

The elimination process aims to `reduce` the size of the input feature set and at the same time to `retain` the class `discriminatory` information for classification problems.

Mutual information (MI) is a measure of the amount of `information` between two `random variables` is symmetric and non-negative, and it could be zero if and only if the variables are `independent`.

It is NP hard `optimization` problem in computer science branch. The best approach which we in general follow is greedy solution for `feature selection`. Those approaches are step-wise forward feature selection or step-wise backward feature selection.

### Classification Problem

Dataset Available at: https://github.com/laxmimerit/Data-Files-for-Feature-Selection

Importing required libraries

```import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns```
```from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score```
```from sklearn.feature_selection import VarianceThreshold, mutual_info_classif, mutual_info_regression
from sklearn.feature_selection import SelectKBest, SelectPercentile```

Let's read the data into the variable data.

```data = pd.read_csv('train.csv', nrows = 20000)
```

5 rows × 371 columns

```X = data.drop('TARGET', axis = 1)
y = data['TARGET']
X.shape, y.shape```
`((20000, 370), (20000,))`

Let's go ahead and train , test and split the dataset.

```X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0, stratify = y)
```

### Remove constant, quasi constant, and duplicate features

These are the filters that are almost `constant` or `quasi constant` in other words these features have same values for `large subset` of outputs and such features are not very useful for making `predictions`.

There is no rule for fixing threshold value but generally we can take as `99%` similarity and `1%` of non similarity.

Let's go ahead see how many `quasi constant` features are there.

```constant_filter = VarianceThreshold(threshold=0.01)
constant_filter.fit(X_train)
X_train_filter = constant_filter.transform(X_train)
X_test_filter = constant_filter.transform(X_test)
```

Let's transpose the dataset training and testing dataset.

```X_train_T = X_train_filter.T
X_test_T = X_test_filter.T
X_train_T = pd.DataFrame(X_train_T)
X_test_T = pd.DataFrame(X_test_T)```

Let's get the total number of duplicated rows.

```X_train_T.duplicated().sum()
```
`18`
```duplicated_features = X_train_T.duplicated()
```

Let's get the non-duplicated features.

```features_to_keep = [not index for index in duplicated_features]
X_train_unique = X_train_T[features_to_keep].T
X_test_unique = X_test_T[features_to_keep].T
X_train_unique.shape, X_test_unique.shape```
`((16000, 227), (4000, 227))`

Now, we can observe here out of `370` we have only `227` features.

### Calculate the MI

Let's calculate the mutual information among the `227` features.

#### mutual_info_classif( )

Mutual information measures the dependency between the variables. It is equal to zero if and only if two random variables are independent, and higher values mean higher dependency.

```mi = mutual_info_classif(X_train_unique, y_train)
len(mi)```
`227`
`mi[: 10]`
```array([0.0025571 , 0.        , 0.01479401, 0.        , 0.        ,
0.00133223, 0.        , 0.        , 0.00197431, 0.        ])```
```mi = pd.Series(mi)
mi.index = X_train_unique.columns```

Let's sort the values in the descending order.

```mi.sort_values(ascending=False, inplace = True)
```

Let's observe the Mutual Information with respect to features from following bar plot.

```plt.title('Mutual information with respect to features')
mi.plot.bar(figsize = (16,5))
plt.show()```

Let's go ahead and work with percentile. We will select 10 percentile of the features. Let's have a look at following code.

```sel = SelectPercentile(mutual_info_classif, percentile=10).fit(X_train_unique, y_train)
X_train_unique.columns[sel.get_support()]
```
```Int64Index([  2,  22,  40,  49,  50,  51,  52,  61,  86,  91,  98, 100, 101,
105, 119, 125, 127, 182, 187, 209, 210, 211, 212],
dtype='int64')```
`len(X_train_unique.columns[sel.get_support()])`
`23`

Let's transform the training and testing dataset. Let's have a look at the following code.

```X_train_mi = sel.transform(X_train_unique)
X_test_mi = sel.transform(X_test_unique)
X_train_mi.shape```
`(16000, 23)`

### Build the model and compare the performance

Let's apply the `Random forest classifier` with number of estimators equals to `100`. And then predict the y values by using tesung dataset.

```def run_randomForest(X_train, X_test, y_train, y_test):
clf = RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print('Accuracy on test set: ')
print(accuracy_score(y_test, y_pred))
```

Now will calculate the accuarcy and traing time of trined dataset.

```%%time
run_randomForest(X_train_mi, X_test_mi, y_train, y_test)```
```Accuracy on test set:
0.95825
Wall time: 1.14 s
```

Now will calculate the accuarcy and traing time of trined dataset.

```%%time
run_randomForest(X_train, X_test, y_train, y_test)```
```Accuracy on test set:
0.9585
Wall time: 2.41 s```
```(1.46-0.57)*100/1.46
```
`60.95890410958904`

### Mutual Information Gain in Regression

```from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
```

Load boston dataset into the variable boston.

```boston = load_boston()
print(boston.DESCR)```
```.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**

:Number of Instances: 506

:Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

:Attribute Information (in order):
- CRIM     per capita crime rate by town
- ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS    proportion of non-retail business acres per town
- CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX      nitric oxides concentration (parts per 10 million)
- RM       average number of rooms per dwelling
- AGE      proportion of owner-occupied units built prior to 1940
- DIS      weighted distances to five Boston employment centres
- TAX      full-value property-tax rate per \$10,000
- PTRATIO  pupil-teacher ratio by town
- B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT    % lower status of the population
- MEDV     Median value of owner-occupied homes in \$1000's

:Missing Attribute Values: None

:Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   ```
```X = pd.DataFrame(data = boston.data, columns=boston.feature_names)
```
```y = boston.target
```

Now, train, test and split the dataset with test size equals to `0.2`.

```X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
```
```mi = mutual_info_regression(X_train, y_train)
mi = pd.Series(mi)
mi.index = X_train.columns
mi.sort_values(ascending=False, inplace = True)
mi```
```LSTAT      0.676729
RM         0.557777
INDUS      0.504754
PTRATIO    0.492141
NOX        0.445376
TAX        0.373128
CRIM       0.349371
AGE        0.347299
DIS        0.321057
ZN         0.201467
B          0.152778
CHAS       0.008383
dtype: float64```
```plt.title('Mutual information with respect to features')
mi.plot.bar()
plt.show()```
```sel = SelectKBest(mutual_info_regression, k = 9).fit(X_train, y_train)
X_train.columns[sel.get_support()]
```
`Index(['CRIM', 'INDUS', 'NOX', 'RM', 'AGE', 'DIS', 'TAX', 'PTRATIO', 'LSTAT'], dtype='object')`

Let's apply `Linear regression` function and find out the predicted value of y.

```model = LinearRegression()
model.fit(X_train, y_train)
y_predict = model.predict(X_test)
r2_score(y_test, y_predict)```
`0.5892223849182507`

Let's calculate the RMS error value.

```np.sqrt(mean_squared_error(y_test, y_predict))
```
`5.783509315085146`

Let's get the `standard deviation` of y.

```np.std(y)
```
`9.188011545278203`

Let's transform the trained dataset.

```X_train_9 = sel.transform(X_train)
X_train_9.shape
```
`(404, 9)`
```X_test_9 = sel.transform(X_test)
model = LinearRegression()
model.fit(X_train_9, y_train)
y_predict = model.predict(X_test_9)
print('r2_score')
r2_score(y_test, y_predict)
r2_score```
`0.5317127606961576`

Let's calculate the RMS error value.

```print('rmse')
np.sqrt(mean_squared_error(y_test, y_predict))```
```rmse
6.175103151293747```