Feature Selection Based on Mutual Information (Entropy) Gain for Classification and Regression | Machine Learning | KGP Talkie

Published by Srishailam Sri on

Feature Selection Based on Mutual Information (Entropy) Gain

Watch Full Playlist: https://www.youtube.com/playlist?list=PLc2rvfiptPSQYzmDIFuq2PqN2n28ZjxDH

What is Mutual Information

The elimination process aims to reduce the size of the input feature set and at the same time to retain the class discriminatory information for classification problems.

Mutual information (MI) is a measure of the amount of information between two random variables is symmetric and non-negative, and it could be zero if and only if the variables are independent.


It is NP hard optimization problem in computer science branch. The best approach which we in general follow is greedy solution for feature selection. Those approaches are step-wise forward feature selection or step-wise backward feature selection.


Classification Problem

Dataset Available at: https://github.com/laxmimerit/Data-Files-for-Feature-Selection

Importing required libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import VarianceThreshold, mutual_info_classif, mutual_info_regression
from sklearn.feature_selection import SelectKBest, SelectPercentile

Let’s read the data into the variable data.

data = pd.read_csv('train.csv', nrows = 20000)

5 rows × 371 columns

X = data.drop('TARGET', axis = 1)
y = data['TARGET']
X.shape, y.shape
((20000, 370), (20000,))

Let’s go ahead and train , test and split the dataset.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0, stratify = y)

Remove constant, quasi constant, and duplicate features

These are the filters that are almost constant or quasi constant in other words these features have same values for large subset of outputs and such features are not very useful for making predictions.

There is no rule for fixing threshold value but generally we can take as 99% similarity and 1% of non similarity.

Let’s go ahead see how many quasi constant features are there.

constant_filter = VarianceThreshold(threshold=0.01)
X_train_filter = constant_filter.transform(X_train)
X_test_filter = constant_filter.transform(X_test)

Let’s transpose the dataset training and testing dataset.

X_train_T = X_train_filter.T
X_test_T = X_test_filter.T
X_train_T = pd.DataFrame(X_train_T)
X_test_T = pd.DataFrame(X_test_T)

Let’s get the total number of duplicated rows.

duplicated_features = X_train_T.duplicated()

Let’s get the non-duplicated features.

features_to_keep = [not index for index in duplicated_features]
X_train_unique = X_train_T[features_to_keep].T
X_test_unique = X_test_T[features_to_keep].T
X_train_unique.shape, X_test_unique.shape
((16000, 227), (4000, 227))

Now, we can observe here out of 370 we have only 227 features.

Calculate the MI

Let’s calculate the mutual information among the 227 features.

mutual_info_classif( )

Mutual information measures the dependency between the variables. It is equal to zero if and only if two random variables are independent, and higher values mean higher dependency.

mi = mutual_info_classif(X_train_unique, y_train) 
mi[: 10]
array([0.0025571 , 0.        , 0.01479401, 0.        , 0.        ,
       0.00133223, 0.        , 0.        , 0.00197431, 0.        ])
mi = pd.Series(mi)
mi.index = X_train_unique.columns

Let’s sort the values in the descending order.

mi.sort_values(ascending=False, inplace = True)

Let’s observe the Mutual Information with respect to features from following bar plot.

plt.title('Mutual information with respect to features')
mi.plot.bar(figsize = (16,5))

Let’s go ahead and work with percentile. We will select 10 percentile of the features. Let’s have a look at following code.

sel = SelectPercentile(mutual_info_classif, percentile=10).fit(X_train_unique, y_train)
Int64Index([  2,  22,  40,  49,  50,  51,  52,  61,  86,  91,  98, 100, 101,
            105, 119, 125, 127, 182, 187, 209, 210, 211, 212],

Let’s transform the training and testing dataset. Let’s have a look at the following code.

X_train_mi = sel.transform(X_train_unique)
X_test_mi = sel.transform(X_test_unique)
(16000, 23)

Build the model and compare the performance

Let’s apply the Random forest classifier with number of estimators equals to 100. And then predict the y values by using tesung dataset.

def run_randomForest(X_train, X_test, y_train, y_test):
    clf = RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print('Accuracy on test set: ')
    print(accuracy_score(y_test, y_pred))

Now will calculate the accuarcy and traing time of trined dataset.

run_randomForest(X_train_mi, X_test_mi, y_train, y_test)
Accuracy on test set: 
Wall time: 1.14 s

Now will calculate the accuarcy and traing time of trined dataset.

run_randomForest(X_train, X_test, y_train, y_test)
Accuracy on test set: 
Wall time: 2.41 s

Mutual Information Gain in Regression

from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

Load boston dataset into the variable boston.

boston = load_boston()
.. _boston_dataset:

Boston house prices dataset

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
X = pd.DataFrame(data = boston.data, columns=boston.feature_names)
y = boston.target

Now, train, test and split the dataset with test size equals to 0.2.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
mi = mutual_info_regression(X_train, y_train)
mi = pd.Series(mi)
mi.index = X_train.columns
mi.sort_values(ascending=False, inplace = True)
LSTAT      0.676729
RM         0.557777
INDUS      0.504754
PTRATIO    0.492141
NOX        0.445376
TAX        0.373128
CRIM       0.349371
AGE        0.347299
DIS        0.321057
RAD        0.203106
ZN         0.201467
B          0.152778
CHAS       0.008383
dtype: float64
plt.title('Mutual information with respect to features')
sel = SelectKBest(mutual_info_regression, k = 9).fit(X_train, y_train)
Index(['CRIM', 'INDUS', 'NOX', 'RM', 'AGE', 'DIS', 'TAX', 'PTRATIO', 'LSTAT'], dtype='object')

Let’s apply Linear regression function and find out the predicted value of y.

model = LinearRegression()
model.fit(X_train, y_train)
y_predict = model.predict(X_test)
r2_score(y_test, y_predict)

Let’s calculate the RMS error value.

np.sqrt(mean_squared_error(y_test, y_predict))

Let’s get the standard deviation of y.


Let’s transform the trained dataset.

X_train_9 = sel.transform(X_train)
(404, 9)
X_test_9 = sel.transform(X_test)
model = LinearRegression()
model.fit(X_train_9, y_train)
y_predict = model.predict(X_test_9)
r2_score(y_test, y_predict)

Let’s calculate the RMS error value.

np.sqrt(mean_squared_error(y_test, y_predict))

Notify of
Inline Feedbacks
View all comments
Would love your thoughts, please comment.x