#mutual information#feature selection#classification#regression#scikit-learn#python

Feature Selection with Mutual Information

Learn how to use mutual information (entropy gain) to select the most predictive features for classification and regression in Python with scikit-learn.

May 25, 2026 at 11:15 AM11 min readFollowFollow (Hindi)

Topics You Will Master

What mutual information is and why it detects non-linear relationships that correlation misses
How to compute mutual_info_classif and mutual_info_regression scores for your features
How to use SelectKBest and SelectPercentile to automatically keep only the top features
How to measure the speed and accuracy trade-off of using fewer features
Best For

Python developers and data scientists who understand basic supervised learning and want a fast, model-agnostic way to rank and trim features before training.

Expected Outcome

A working feature selection pipeline using mutual information for both a classification dataset (370 features reduced to 23) and a regression dataset (Boston housing), with a side-by-side accuracy comparison showing that the reduced set matches full-feature performance at half the training time.

Choosing which features to keep is one of the most impactful decisions in a machine learning project. Including too many features slows training, increases overfitting risk, and makes models harder to interpret. Mutual information is a filter-based method that scores each feature by how much information it shares with the target variable — before any model is trained.

Mutual information (MI) measures statistical dependence between two variables. Unlike Pearson correlation, which only captures straight-line (linear) relationships, mutual information can detect any kind of relationship — linear or not. A score of 0 means the feature and the target are completely independent; a higher score means the feature carries more information about the target.

The formal definition of mutual information between two random variables and is:

Where:

  • — the mutual information score between variables and
  • — the joint probability of and occurring together
  • — the marginal probability of alone
  • — the marginal probability of alone

When , the variables are independent and . The more their joint distribution diverges from independence, the larger becomes.

In this tutorial you will apply mutual information to a high-dimensional bank customer churn dataset (370 features) for a classification task and to the Boston Housing dataset (13 features) for a regression task. You will compare model accuracy and training speed before and after feature selection.

You can also

.

Prerequisites: Python 3.x, Scikit-learn, Pandas, NumPy, Matplotlib, Seaborn.

Classification Problem

We will demonstrate mutual information feature selection on a classification dataset. The dataset is available at github.com/laxmimerit/Data-Files-for-Feature-Selection.

Import the required libraries for data manipulation, visualization, and modeling:

PYTHON
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Import Scikit-learn modules for modeling, metrics, and feature selection:

PYTHON
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
PYTHON
from sklearn.feature_selection import VarianceThreshold, mutual_info_classif, mutual_info_regression
from sklearn.feature_selection import SelectKBest, SelectPercentile

Read the first 20,000 rows of the dataset:

PYTHON
data = pd.read_csv('train.csv', nrows = 20000)
data.head()
OUTPUT
IDvar3var15imp_ent_var16_ult1imp_op_var39_comer_ult1imp_op_var39_comer_ult3imp_op_var40_comer_ult1imp_op_var40_comer_ult3imp_op_var40_efect_ult1imp_op_var40_efect_ult3...saldo_medio_var33_hace2saldo_medio_var33_hace3saldo_medio_var33_ult1saldo_medio_var33_ult3saldo_medio_var44_hace2saldo_medio_var44_hace3saldo_medio_var44_ult1saldo_medio_var44_ult3var38TARGET
012230.00.00.00.00.000...0.00.00.00.00.00.00.00.039205.1700000
132340.00.00.00.00.000...0.00.00.00.00.00.00.00.049278.0300000
242230.00.00.00.00.000...0.00.00.00.00.00.00.00.067333.7700000
382370.0195.0195.00.00.000...0.00.00.00.00.00.00.00.064007.9700000
4102370.00.00.00.00.000...0.00.00.00.00.00.00.00.0117310.9790160

Separate the target variable and inspect the shape of the features:

PYTHON
X = data.drop('TARGET', axis = 1)
y = data['TARGET']
X.shape, y.shape
OUTPUT
((20000, 370), (20000,))

The dataset contains 20,000 rows and 370 feature columns. Split it into training and testing sets, stratifying on the target to preserve class balance:

PYTHON
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0, stratify = y)

Remove Constant, Quasi-Constant, and Duplicate Features

Before calculating mutual information, it is important to remove features that carry no information at all. A constant feature has the same value for every row. A quasi-constant feature has almost the same value (for example, 99% of rows are zero). A duplicate feature is an exact copy of another column. None of these add predictive power, and keeping them wastes computation.

You can also

.

Use VarianceThreshold with a threshold of 0.01 to remove quasi-constant features (those with less than 1% variance):

PYTHON
constant_filter = VarianceThreshold(threshold=0.01)
constant_filter.fit(X_train)
X_train_filter = constant_filter.transform(X_train)
X_test_filter = constant_filter.transform(X_test)

Transpose the filtered datasets so that each feature becomes a row, then wrap them in a DataFrame — this makes it easy to use duplicated() to spot identical features:

PYTHON
X_train_T = X_train_filter.T
X_test_T = X_test_filter.T
X_train_T = pd.DataFrame(X_train_T)
X_test_T = pd.DataFrame(X_test_T)

Count the number of duplicated rows (features) in the training dataset:

PYTHON
X_train_T.duplicated().sum()
OUTPUT
18

There are 18 duplicate features. Identify them and keep only the unique ones:

PYTHON
duplicated_features = X_train_T.duplicated()
PYTHON
features_to_keep = [not index for index in duplicated_features]
X_train_unique = X_train_T[features_to_keep].T
X_test_unique = X_test_T[features_to_keep].T
X_train_unique.shape, X_test_unique.shape
OUTPUT
((16000, 227), (4000, 227))

After removing constant, quasi-constant, and duplicate features, the feature space drops from 370 to 227 columns. This is a clean input for mutual information scoring.

Calculate the Mutual Information

With the cleaned feature set, compute mutual information scores for all 227 features against the target using mutual_info_classif. This function estimates how much information each feature shares with the class label:

PYTHON
mi = mutual_info_classif(X_train_unique, y_train)
len(mi)
OUTPUT
227

Inspect the first 10 mutual information scores to get a sense of the range:

PYTHON
mi[: 10]
OUTPUT
array([0.0025571 , 0.        , 0.01479401, 0.        , 0.        ,
       0.00133223, 0.        , 0.        , 0.00197431, 0.        ])

Several features score exactly 0, meaning they are statistically independent of the target. Convert the scores to a Pandas Series, align them with the column names, and sort in descending order:

PYTHON
mi = pd.Series(mi)
mi.index = X_train_unique.columns
PYTHON
mi.sort_values(ascending=False, inplace = True)

The bar plot below shows that only a small number of features carry a high mutual information score — most features cluster near zero:

Bar plot showing sorted mutual information scores for 227 features in classification dataset, titled "Mutual information gain in classification with respect to features"

This long-tail distribution is typical: a handful of features drive most of the predictive signal. Use SelectPercentile to keep the top 10% of features automatically:

PYTHON
sel = SelectPercentile(mutual_info_classif, percentile=10).fit(X_train_unique, y_train)
X_train_unique.columns[sel.get_support()]
OUTPUT
Int64Index([  2,  22,  40,  49,  50,  51,  52,  61,  86,  91,  98, 100, 101,
            105, 119, 125, 127, 182, 187, 209, 210, 211, 212],
           dtype='int64')

Count the total number of features kept:

PYTHON
len(X_train_unique.columns[sel.get_support()])
OUTPUT
23

The top 10% corresponds to 23 features. Transform both datasets to keep only these:

PYTHON
X_train_mi = sel.transform(X_train_unique)
X_test_mi = sel.transform(X_test_unique)
X_train_mi.shape
OUTPUT
(16000, 23)

Build the Model and Compare Performance

To confirm that 23 features are enough, train a RandomForestClassifier on both the reduced set and the full original dataset, then compare accuracy and wall time.

Define a helper function that trains the classifier and prints accuracy:

PYTHON
def run_randomForest(X_train, X_test, y_train, y_test):
    clf = RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print('Accuracy on test set: ')
    print(accuracy_score(y_test, y_pred))

Train and evaluate the model using only the 23 mutual-information-selected features:

PYTHON
%%time
run_randomForest(X_train_mi, X_test_mi, y_train, y_test)
OUTPUT
Accuracy on test set: 
0.95825
Wall time: 1.14 s

Train and evaluate the model using all 370 original features:

PYTHON
%%time
run_randomForest(X_train, X_test, y_train, y_test)
OUTPUT
Accuracy on test set: 
0.9585
Wall time: 2.41 s

Calculate the percentage decrease in training time when using the selected features:

PLAINTEXT
(1.46-0.57)*100/1.46
PLAINTEXT
60.95890410958904

The reduced model (23 features) scores 95.83% accuracy versus 95.85% for the full model — a negligible difference. Yet training time drops by approximately 61%. This is the core value of mutual information filtering: near-identical performance at a fraction of the computational cost.

Mutual Information for Regression

Mutual information also works for regression tasks where the target is a continuous number rather than a class label. mutual_info_regression estimates the information each feature shares with a continuous target. We will demonstrate this on the Boston Housing dataset.

Import the required libraries for regression datasets and metrics:

PYTHON
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

Load the Boston housing dataset and print its description:

PYTHON
boston = load_boston()
print(boston.DESCR)
OUTPUT
.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**

    :Number of Instances: 506

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.

Create a Pandas DataFrame from the features:

PYTHON
X = pd.DataFrame(data = boston.data, columns=boston.feature_names)
X.head()
OUTPUT
CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTAT
00.0063218.02.310.00.5386.57565.24.09001.0296.015.3396.904.98
10.027310.07.070.00.4696.42178.94.96712.0242.017.8396.909.14
20.027290.07.070.00.4697.18561.14.96712.0242.017.8392.834.03
30.032370.02.180.00.4586.99845.86.06223.0222.018.7394.632.94
40.069050.02.180.00.4587.14754.26.06223.0222.018.7396.905.33

Assign the target variable (median home value) and split into training and test sets:

PYTHON
y = boston.target
PYTHON
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Ranking Features by Mutual Information

Calculate mutual information scores for each of the 13 features against the continuous target, then sort them from most informative to least:

PYTHON
mi = mutual_info_regression(X_train, y_train)
mi = pd.Series(mi)
mi.index = X_train.columns
mi.sort_values(ascending=False, inplace = True)
mi
OUTPUT
LSTAT      0.676729
RM         0.557777
INDUS      0.504754
PTRATIO    0.492141
NOX        0.445376
TAX        0.373128
CRIM       0.349371
AGE        0.347299
DIS        0.321057
RAD        0.203106
ZN         0.201467
B          0.152778
CHAS       0.008383
dtype: float64

LSTAT (percentage of lower-status population) and RM (average rooms per dwelling) lead by a wide margin, while CHAS (a river-adjacency dummy variable) scores nearly zero. The bar plot below makes this ranking easy to read at a glance:

Bar plot showing sorted mutual information scores for Boston housing dataset features in regression, titled "Mutual information gain in regression with respect to features"

Selecting the Top Features

Use SelectKBest with k=9 to keep the nine most informative features:

PYTHON
sel = SelectKBest(mutual_info_regression, k = 9).fit(X_train, y_train)
X_train.columns[sel.get_support()]
OUTPUT
Index(['CRIM', 'INDUS', 'NOX', 'RM', 'AGE', 'DIS', 'TAX', 'PTRATIO', 'LSTAT'], dtype='object')

Comparing Full vs. Reduced Feature Set

Fit a linear regression model on the full 13-feature dataset and compute the score — a measure of how much variance in the target the model explains (1.0 is perfect):

PYTHON
model = LinearRegression()
model.fit(X_train, y_train)
y_predict = model.predict(X_test)
r2_score(y_test, y_predict)
OUTPUT
0.5892223849182507

Calculate the Root Mean Squared Error (RMSE) for the full-feature model. RMSE is in the same units as the target (thousands of dollars), so lower is better:

PYTHON
np.sqrt(mean_squared_error(y_test, y_predict))
OUTPUT
5.783509315085146

For context, check the standard deviation of the target variable to understand the scale of the error:

PYTHON
np.std(y)
OUTPUT
9.188011545278203

The RMSE of 5.78 is well below the target standard deviation of 9.19, which is a reasonable result for a simple linear model. Now transform the datasets to keep only the 9 selected features and repeat the evaluation:

PYTHON
X_train_9 = sel.transform(X_train)
X_train_9.shape
OUTPUT
(404, 9)
PYTHON
X_test_9 = sel.transform(X_test)
model = LinearRegression()
model.fit(X_train_9, y_train)
y_predict = model.predict(X_test_9)
print('r2_score')
r2_score(y_test, y_predict)
r2_score
OUTPUT
0.5317127606961576

Calculate the RMSE on the reduced dataset:

PYTHON
print('rmse')
np.sqrt(mean_squared_error(y_test, y_predict))
OUTPUT
rmse
6.175103151293747

The 9-feature model yields an of 0.53 versus 0.59 for the full 13-feature model — a modest drop. For a dataset with only 13 features to begin with, the trade-off is less dramatic than in the classification example, but SelectKBest still removes the two weakest features (B and CHAS) cleanly and without model training.

Conclusion

In this tutorial you applied mutual information as a filter-based feature selection method to both a classification task (370 features, churn dataset) and a regression task (13 features, Boston Housing). By measuring the entropy shared between each feature and the target, you ranked all features without training any model, then selected the most informative subset using SelectPercentile and SelectKBest.

Key takeaways:

  • Mutual information captures both linear and non-linear relationships, making it more powerful than simple correlation for feature ranking.
  • In the classification experiment, keeping only 23 out of 370 features cut training time by 61% with no meaningful drop in accuracy (95.83% vs 95.85%).
  • mutual_info_classif is used for discrete targets; mutual_info_regression is used for continuous targets — the selection API (SelectKBest, SelectPercentile) is identical for both.
  • Always remove constant, quasi-constant, and duplicate features before computing MI scores — they inflate the feature count without contributing information.
  • MI is a greedy filter method: it evaluates features independently and does not account for interactions between features.

Next steps:

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments