#feature selection#roc-auc#mean squared error#classification#regression#scikit-learn#python

Feature Selection with ROC-AUC and MSE

Learn how to select features using ROC-AUC for classification and Mean Squared Error for regression. Score every feature individually, rank them, and keep only the most predictive ones.

May 21, 2026 at 4:30 PM14 min readFollowFollow (Hindi)

Topics You Will Master

What ROC-AUC is and why it measures classification quality for a single feature
What Mean Squared Error (MSE) is and how it signals a feature's value in regression
How to loop over features, fit a simple model on each one, and record a score
How to rank features by their individual score and discard the weakest ones
How to compare a reduced feature set against the full set to confirm the trade-off
Best For

Python developers and data scientists who understand supervised learning basics and want a practical, metric-driven way to filter features before training a full model.

Expected Outcome

A working feature selection pipeline that reduces 370 classification features to 11 using ROC-AUC, and identifies the 2 strongest regression features using MSE — with accuracy and speed comparisons that confirm the value of the reduction.

When you have hundreds of features in a dataset, many of them carry little or no useful information. Training a model on all of them wastes time and can hurt performance. Univariate feature selection is a simple remedy: you score each feature on its own, rank them, and keep only the ones that score well.

This tutorial covers two scoring methods. For binary classification — predicting one of two outcomes — we score each feature by its ROC-AUC (Receiver Operating Characteristic — Area Under the Curve). A score above 0.5 means the feature carries signal; exactly 0.5 means it is no better than a random guess. For regression — predicting a continuous number — we score each feature by its Mean Squared Error (MSE): the lower the error, the stronger the feature's individual relationship with the target.

You will work through two complete examples: a 370-feature bank classification dataset and the 13-feature Boston Housing dataset. By the end you will have reduced both datasets to their most informative features and confirmed that the smaller sets retain nearly all predictive power.

Prerequisites: Python 3.x, Scikit-learn, Pandas, NumPy, Matplotlib, Seaborn.

You can also

.

Understanding ROC-AUC

The ROC curve plots the True Positive Rate (the fraction of actual positives that the model correctly identifies) against the False Positive Rate (the fraction of actual negatives that the model incorrectly flags as positive) at every possible classification threshold.

The plot below shows a typical ROC curve. The blue line is the model; the green diagonal is a random classifier with AUC = 0.5. The further the blue line bows toward the top-left corner, the better the model:

ROC Curve chart showing Sensitivity (True Positive Rate) on the y-axis and 1 minus Specificity (False Positive Rate) on the x-axis, with the model curve bowing above the diagonal reference line

The AUC (Area Under the Curve) summarises the entire curve in a single number:

Where:

  • — a value between 0 and 1; 1 is a perfect classifier, 0.5 is random, below 0.5 means the model is worse than random
  • — True Positive Rate at decision threshold : the fraction of genuine positives correctly predicted
  • — False Positive Rate at threshold : the fraction of genuine negatives incorrectly predicted as positive

In this tutorial, we fit a Random Forest on a single feature at a time and record the resulting AUC as that feature's score. Features that score above 0.5 are kept; the rest are discarded.

Use of ROC-AUC in Classification Feature Selection

Import the required libraries for data structures, visualization, and calculation:

PYTHON
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Import Scikit-learn packages for validation, metrics, and models:

PYTHON
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.feature_selection import VarianceThreshold

The dataset used in this section is available at laxmimerit/Data-Files-for-Feature-Selection on GitHub.

Load the classification dataset and inspect its first rows:

PYTHON
data = pd.read_csv('train.csv', nrows = 20000)
data.head()
OUTPUT
IDvar3var15imp_ent_var16_ult1imp_op_var39_comer_ult1imp_op_var39_comer_ult3imp_op_var40_comer_ult1imp_op_var40_comer_ult3imp_op_var40_efect_ult1imp_op_var40_efect_ult3...saldo_medio_var33_hace2saldo_medio_var33_hace3saldo_medio_var33_ult1saldo_medio_var33_ult3saldo_medio_var44_hace2saldo_medio_var44_hace3saldo_medio_var44_ult1saldo_medio_var44_ult3var38TARGET
012230.00.00.00.00.000...0.00.00.00.00.00.00.00.039205.1700000
132340.00.00.00.00.000...0.00.00.00.00.00.00.00.049278.0300000
242230.00.00.00.00.000...0.00.00.00.00.00.00.00.067333.7700000
382370.0195.0195.00.00.000...0.00.00.00.00.00.00.00.064007.9700000
4102370.00.00.00.00.000...0.00.00.00.00.00.00.00.0117310.9790160

Separate features from the target column and confirm the dataset dimensions:

PYTHON
X = data.drop('TARGET', axis = 1)
y = data['TARGET']
X.shape, y.shape
OUTPUT
((20000, 370), (20000,))

Split the data into 80% training and 20% test sets, stratifying on class labels so that both splits preserve the same class balance:

PYTHON
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0, stratify = y)

Remove Constant, Quasi-Constant, and Duplicate Features

Before scoring features by ROC-AUC, remove any features that carry no variance — they cannot possibly be useful predictors. You can also

for a detailed walkthrough of this step.

Apply a 1% variance threshold to filter out constant and quasi-constant features. Any feature where more than 99% of values are identical is removed:

PYTHON
#remove constant and quasi constant features
constant_filter = VarianceThreshold(threshold=0.01)
constant_filter.fit(X_train)
X_train_filter = constant_filter.transform(X_train)
X_test_filter = constant_filter.transform(X_test)
X_train_filter.shape, X_test_filter.shape
OUTPUT
((16000, 245), (4000, 245))

Check how many features were removed by the variance filter:

PLAINTEXT
370-245
PLAINTEXT
125

The variance filter removed 125 features. The remaining 245 still include duplicates — pairs of columns that are byte-for-byte identical and therefore add no new information.

For more details on constant and duplicate removal, see Constant, Quasi-Constant, and Duplicate Feature Removal.

Transpose the filtered datasets so that each feature becomes a row, making it straightforward to identify and drop duplicate rows:

PYTHON
#remove duplicate features
X_train_T = X_train_filter.T
X_test_T = X_test_filter.T
X_train_T = pd.DataFrame(X_train_T)
X_test_T = pd.DataFrame(X_test_T)

Count the number of duplicate feature rows in the training set:

PYTHON
X_train_T.duplicated().sum()
OUTPUT
18

Mark which rows (features) are duplicates:

PYTHON
duplicated_features = X_train_T.duplicated()

Keep only the unique features and transpose back to the original shape — rows as samples, columns as features:

PYTHON
features_to_keep = [not index for index in duplicated_features]
X_train_unique = X_train_T[features_to_keep].T
X_test_unique = X_test_T[features_to_keep].T
X_train_unique.shape, X_train.shape
OUTPUT
((16000, 227), (16000, 370))

After removing constant, quasi-constant, and duplicate features, you are left with 227 unique informative features — down from the original 370.

Calculate ROC-AUC Score per Feature

With the reduced feature set ready, score each remaining feature individually. The loop below fits a RandomForestClassifier on a single feature at a time, predicts on the test set, and records the ROC-AUC score:

PYTHON
roc_auc = []
for feature in X_train_unique.columns:
    clf = RandomForestClassifier(n_estimators=100, random_state=0)
    clf.fit(X_train_unique[feature].to_frame(), y_train)
    y_pred = clf.predict(X_test_unique[feature].to_frame())
    roc_auc.append(roc_auc_score(y_test, y_pred))

Print the full list of ROC-AUC scores — one value per feature:

PYTHON
print(roc_auc)
OUTPUT
[0.5020561820568537, 0.5, 0.5, 0.49986968986187125, 0.501373452866903, 0.49569976544175137, 0.5028068643863192, 0.49986968986187125, 0.5, 0.5, 0.4997393797237425, 0.5017643832812891, 0.49569976544175137, 0.49960906958561374, 0.49895751889497003, 0.49700286682303885, 0.49960906958561374, 0.5021553136956755, 0.4968725566849101, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.49986968986187125, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5029371745244479, 0.4959603857180089, 0.5, 0.5048318679438659, 0.4997393797237425, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.49921813917122754, 0.49921813917122754, 0.49824600955181303, 0.5, 0.5, 0.5, 0.4990878290330988, 0.4983763196899418, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5025462441100617, 0.4990878290330988, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.49986968986187125, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.4997393797237425, 0.5, 0.5, 0.49986968986187125, 0.4991581805187143, 0.4988272087568413, 0.49674224654678134, 0.4995491109331005, 0.5, 0.5, 0.5022856238338043, 0.5012431427287742, 0.5, 0.5, 0.5, 0.49986968986187125, 0.5, 0.4997393797237425, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5076595179963898]

Convert the list to a Pandas Series, align it with the feature column names, and sort from highest to lowest score:

PYTHON
roc_values = pd.Series(roc_auc)
roc_values.index = X_train_unique.columns
roc_values.sort_values(ascending =False, inplace = True)

Inspect the sorted ROC-AUC values:

PYTHON
roc_values
OUTPUT
244    0.507660
107    0.504832
104    0.502937
6      0.502807
155    0.502546
         ...   
18     0.496873
211    0.496742
105    0.495960
12     0.495700
5      0.495700
Length: 227, dtype: float64

Features with a score of 0.5 or below offer no predictive value above random chance. The bar plot below confirms that the vast majority of features cluster right at 0.5, with only a handful rising above it:

Bar plot showing univariate ROC-AUC scores for all 227 classification features, with most bars at 0.5 and a few exceeding it

Filter the Series to keep only features that score above 0.5:

PYTHON
sel = roc_values[roc_values>0.5]
sel
OUTPUT
244    0.507660
107    0.504832
104    0.502937
6      0.502807
155    0.502546
215    0.502286
17     0.502155
0      0.502056
11     0.501764
4      0.501373
216    0.501243
dtype: float64

Eleven features survive the ROC-AUC filter. Build the reduced training and test sets using those feature indices:

PYTHON
X_train_roc = X_train_unique[sel.index]
X_test_roc = X_test_unique[sel.index]

Compare Model Performance Before and After Selection

To confirm the selection is meaningful, train a RandomForestClassifier on both the reduced set and the original full set and compare accuracy and training time.

Define a reusable helper function that fits the classifier and prints accuracy:

PYTHON
def run_randomForest(X_train, X_test, y_train, y_test):
    clf = RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print('Accuracy on test set: ', accuracy_score(y_test, y_pred))

Train and evaluate using only the 11 ROC-AUC-selected features:

PYTHON
%%time
run_randomForest(X_train_roc, X_test_roc, y_train, y_test)
OUTPUT
Accuracy on test set:  0.95275
Wall time: 917 ms

Check the shape of the selected feature space to confirm you are working with 11 features:

PYTHON
X_train_roc.shape
OUTPUT
(16000, 11)

Train and evaluate using the full original 370 features as a baseline:

PYTHON
%%time
run_randomForest(X_train, X_test, y_train, y_test)
OUTPUT
Accuracy on test set:  0.9585
Wall time: 1.76 s

Using only 11 features instead of 370 cuts training time from 1.76 s to 917 ms while losing only 0.58 percentage points of accuracy — a strong trade-off for most real-world pipelines.

Feature Selection Using MSE in Regression

Univariate performance-based selection works for regression too. Instead of AUC, we measure Mean Squared Error (MSE) — the average squared difference between predicted and actual values. A feature with low MSE has a strong individual linear relationship with the target; a high MSE means the feature is a weak predictor on its own.

The MSE formula is:

Where:

  • — the number of samples in the test set
  • — the actual target value for sample
  • — the predicted target value for sample
  • — the average squared prediction error; lower values indicate a better-fitting feature

Import the packages for linear modeling and regression metrics:

PYTHON
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
PYTHON
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

Load the Boston Housing dataset and print its description to understand the 13 available features:

PYTHON
boston = load_boston()
print(boston.DESCR)
OUTPUT
.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**

    :Number of Instances: 506

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.

.. topic:: References

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

Create a Pandas DataFrame from the feature matrix and inspect the first rows:

PYTHON
X = pd.DataFrame(boston.data, columns=boston.feature_names)
X.head()
OUTPUT
CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTAT
00.0063218.02.310.00.5386.57565.24.09001.0296.015.3396.904.98
10.027310.07.070.00.4696.42178.94.96712.0242.017.8396.909.14
20.027290.07.070.00.4697.18561.14.96712.0242.017.8392.834.03
30.032370.02.180.00.4586.99845.86.06223.0222.018.7394.632.94
40.069050.02.180.00.4587.14754.26.06223.0222.018.7396.905.33

Assign the target variable (median house prices):

PYTHON
y = boston.target

Split the regression dataset into 80% training and 20% test sets:

PYTHON
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Scoring Each Feature by MSE

The loop below fits a LinearRegression on each feature individually and records its MSE on the test set. A lower MSE means the feature alone can predict the target more accurately:

PYTHON
mse = []
for feature in X_train.columns:
    clf = LinearRegression()
    clf.fit(X_train[feature].to_frame(), y_train)
    y_pred = clf.predict(X_test[feature].to_frame())
    mse.append(mean_squared_error(y_test, y_pred))

Print the raw MSE values in feature order:

PLAINTEXT
mse
PLAINTEXT
[76.38674157646072, 84.66034377707905, 77.02905244667242, 79.36120219345942, 76.95375968209433, 46.907351627395315, 80.3915476111525, 82.61874125667718, 82.46499985731933, 78.30831374720843, 81.79497121208001, 77.75285601192718, 46.33630536002592]

Convert the results to a Pandas Series indexed by feature name and sort from highest MSE to lowest:

PYTHON
mse = pd.Series(mse, index = X_train.columns)
mse.sort_values(ascending=False, inplace = True)
mse
OUTPUT
ZN         84.660344
DIS        82.618741
RAD        82.465000
PTRATIO    81.794971
AGE        80.391548
CHAS       79.361202
TAX        78.308314
B          77.752856
INDUS      77.029052
NOX        76.953760
CRIM       76.386742
RM         46.907352
LSTAT      46.336305
dtype: float64

The bar plot below makes the winner obvious — RM (average rooms per dwelling) and LSTAT (lower-status population percentage) stand apart from the rest with much lower MSE values:

Bar plot showing univariate Mean Squared Error values for each of the 13 Boston housing features, with RM and LSTAT clearly lower than all others

Build the reduced training and test sets using only the two best-performing features, RM and LSTAT:

PYTHON
X_train_2 = X_train[['RM', 'LSTAT']]
X_test_2 = X_test[['RM', 'LSTAT']]

Evaluating the Reduced Regression Model

Evaluate a LinearRegression trained on only the 2 selected features:

PYTHON
%%time
model = LinearRegression()
model.fit(X_train_2, y_train)
y_pred = model.predict(X_test_2)
print('r2_score: ', r2_score(y_test, y_pred))
print('rmse: ', np.sqrt(mean_squared_error(y_test, y_pred)))
print('sd of house price: ', np.std(y))
OUTPUT
r2_score:  0.5409084827186417
rmse:  6.114172522817782
sd of house price:  9.188011545278203
Wall time: 3 ms

Now train on all 13 original features as the baseline:

PYTHON
%%time
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print('r2_score: ', r2_score(y_test, y_pred))
print('rmse: ', np.sqrt(mean_squared_error(y_test, y_pred)))
print('sd of house price: ', np.std(y))
OUTPUT
r2_score:  0.5892223849182507
rmse:  5.783509315085135
sd of house price:  9.188011545278203
Wall time: 4 ms

With just 2 features the model reaches an R² of 0.54 compared to 0.59 with all 13 — a reasonable result given that 11 features were discarded. The RMSE of 6.11 is well within one standard deviation of house prices (9.19), confirming that RM and LSTAT together capture the majority of the target signal.

Conclusion

In this tutorial you implemented univariate performance-based feature selection across two tasks. For the 370-feature classification dataset, you scored every feature by its individual ROC-AUC and reduced the set from 227 unique features to 11, cutting training time nearly in half while retaining 95.3% accuracy. For the Boston Housing regression dataset, you scored each of the 13 features by MSE from a single-feature linear regression and identified RM and LSTAT as the two most informative predictors.

Key takeaways:

  • An ROC-AUC of exactly 0.5 means a feature performs no better than random guessing — any feature at or below that threshold can be dropped.
  • MSE scores features by their individual linear fit to the target; RM and LSTAT achieved roughly half the error of every other Boston feature, making the selection clear-cut.
  • Removing constant, quasi-constant, and duplicate features first makes the per-feature scoring loop faster and avoids inflating the results with trivially useless columns.
  • A small accuracy loss (0.58 pp in this case) is often acceptable in exchange for a significantly smaller feature set and faster training.
  • This approach is model-agnostic for the scoring step — you can swap in any estimator to score features, as long as you apply the same estimator consistently across all features.

Next steps:

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments