When you have a dataset with many features, not all of them help the model. Some are redundant, some add noise, and some are simply irrelevant to the target. Training on all of them wastes time and can hurt accuracy. Feature selection is the process of choosing a subset of features that carries the most predictive signal.
Recursive Feature Elimination (RFE) is a wrapper method — it wraps around any estimator that reports feature importances and repeatedly trims the weakest features until you reach your target count. The name "recursive" captures exactly what it does: fit, rank, remove the weakest, then repeat on the remaining features.
In this tutorial you will work with the breast cancer dataset (569 samples, 30 numeric features, binary target: malignant or benign). You will first use SelectFromModel as a simple baseline, then apply RFE with a RandomForestClassifier, and finally with a GradientBoostingClassifier. You will also sweep over every possible feature count to find the sweet spot where accuracy peaks.
Prerequisites: Python 3.x, NumPy, Pandas, Scikit-learn, Seaborn, Matplotlib.
Setting Up: Imports and Data
Start by importing every library the notebook needs in one place:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score
Load the breast cancer dataset from Scikit-learn's built-in datasets:
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
data.keys()
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])
Print the full dataset description to understand what each feature measures:
print(data.DESCR)
.. _breast_cancer_dataset:
Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------
**Data Set Characteristics:**
:Number of Instances: 569
:Number of Attributes: 30 numeric, predictive attributes and the class
:Attribute Information:
- radius (mean of distances from center to points on the perimeter)
- texture (standard deviation of gray-scale values)
- perimeter
- area
- smoothness (local variation in radius lengths)
- compactness (perimeter^2 / area - 1.0)
- concavity (severity of concave portions of the contour)
- concave points (number of concave portions of the contour)
- symmetry
- fractal dimension ("coastline approximation" - 1)
The mean, standard error, and "worst" or largest (mean of the three
largest values) of these features were computed for each image,
resulting in 30 features. For instance, field 3 is Mean Radius, field
13 is Radius SE, field 23 is Worst Radius.
- class:
- WDBC-Malignant
- WDBC-Benign
:Summary Statistics:
===================================== ====== ======
Min Max
===================================== ====== ======
radius (mean): 6.981 28.11
texture (mean): 9.71 39.28
perimeter (mean): 43.79 188.5
area (mean): 143.5 2501.0
smoothness (mean): 0.053 0.163
compactness (mean): 0.019 0.345
concavity (mean): 0.0 0.427
concave points (mean): 0.0 0.201
symmetry (mean): 0.106 0.304
fractal dimension (mean): 0.05 0.097
radius (standard error): 0.112 2.873
texture (standard error): 0.36 4.885
perimeter (standard error): 0.757 21.98
area (standard error): 6.802 542.2
smoothness (standard error): 0.002 0.031
compactness (standard error): 0.002 0.135
concavity (standard error): 0.0 0.396
concave points (standard error): 0.0 0.053
symmetry (standard error): 0.008 0.079
fractal dimension (standard error): 0.001 0.03
radius (worst): 7.93 36.04
texture (worst): 12.02 49.54
perimeter (worst): 50.41 251.2
area (worst): 185.2 4254.0
smoothness (worst): 0.071 0.223
compactness (worst): 0.027 1.058
concavity (worst): 0.0 1.252
concave points (worst): 0.0 0.291
symmetry (worst): 0.156 0.664
fractal dimension (worst): 0.055 0.208
===================================== ====== ======
:Missing Attribute Values: None
:Class Distribution: 212 - Malignant, 357 - Benign
:Creator: Dr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian
:Donor: Nick Street
:Date: November, 1995
This is a copy of UCI ML Breast Cancer Wisconsin (Diagnostic) datasets.
https://goo.gl/U2Uwz2
Features are computed from a digitized image of a fine needle
aspirate (FNA) of a breast mass. They describe
characteristics of the cell nuclei present in the image.
Separating plane described above was obtained using
Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree
Construction Via Linear Programming." Proceedings of the 4th
Midwest Artificial Intelligence and Cognitive Science Society,
pp. 97-101, 1992], a classification method which uses linear
programming to construct a decision tree. Relevant features
were selected using an exhaustive search in the space of 1-4
features and 1-3 separating planes.
The actual linear program used to obtain the separating plane
in the 3-dimensional space is that described in:
[K. P. Bennett and O. L. Mangasarian: "Robust Linear
Programming Discrimination of Two Linearly Inseparable Sets",
Optimization Methods and Software 1, 1992, 23-34].
This database is also available through the UW CS ftp server:
ftp ftp.cs.wisc.edu
cd math-prog/cpo-dataset/machine-learn/WDBC/
Build the feature matrix X as a labelled DataFrame so column names are preserved throughout the tutorial:
X = pd.DataFrame(data = data.data, columns=data.feature_names)
X.head()
| mean radius | mean texture | mean perimeter | mean area | mean smoothness | mean compactness | mean concavity | mean concave points | mean symmetry | mean fractal dimension | ... | worst radius | worst texture | worst perimeter | worst area | worst smoothness | worst compactness | worst concavity | worst concave points | worst symmetry | worst fractal dimension | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | 0.2419 | 0.07871 | ... | 25.38 | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 |
| 1 | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | 0.1812 | 0.05667 | ... | 24.99 | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 |
| 2 | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | 0.2069 | 0.05999 | ... | 23.57 | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 |
| 3 | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | 0.2597 | 0.09744 | ... | 14.91 | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 |
| 4 | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | 0.1809 | 0.05883 | ... | 22.54 | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 |
5 rows × 30 columns
Store the binary target labels in y:
y = data.target
Split the data into 80 % training and 20 % test sets:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 0)
X_train.shape, X_test.shape
((455, 30), (114, 30))
The training set has 455 samples and the test set has 114 samples, each with 30 features.
Baseline: Feature Selection with SelectFromModel
Before applying full RFE, it is useful to establish a baseline using SelectFromModel — a faster, one-step approach. It trains a single RandomForestClassifier, computes the mean feature importance across all trees, and keeps only the features that score above that mean.
Feature importance in a Random Forest measures how much each feature reduces impurity (measured by the Gini criterion) on average across all trees in the ensemble. Features that appear near the top of many trees get a high importance score.
Fit the selector on the training data:
sel = SelectFromModel(RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1))
sel.fit(X_train, y_train)
sel.get_support()
array([ True, False, True, True, False, False, True, True, False, False, False, False, False, True, False, False, False, False, False, False, True, False, True, True, False, False, False, True, False, False])
True entries mark the features whose importance exceeded the mean threshold. Retrieve their names:
X_train.columns
Index(['mean radius', 'mean texture', 'mean perimeter', 'mean area', 'mean smoothness', 'mean compactness', 'mean concavity', 'mean concave points', 'mean symmetry', 'mean fractal dimension', 'radius error', 'texture error', 'perimeter error', 'area error', 'smoothness error', 'compactness error', 'concavity error', 'concave points error', 'symmetry error', 'fractal dimension error', 'worst radius', 'worst texture', 'worst perimeter', 'worst area', 'worst smoothness', 'worst compactness', 'worst concavity', 'worst concave points', 'worst symmetry', 'worst fractal dimension'], dtype='object')
Filter the column list to keep only the selected features:
features = X_train.columns[sel.get_support()]
features
Index(['mean radius', 'mean perimeter', 'mean area', 'mean concavity', 'mean concave points', 'area error', 'worst radius', 'worst perimeter', 'worst area', 'worst concave points'],dtype='object')
len(features)
10
SelectFromModel selected 10 of the 30 features. Now inspect the underlying importance scores to understand why. The mean importance score (the threshold used for selection) is:
np.mean(sel.estimator_.feature_importances_)
0.03333333333333333
The individual per-feature importance scores are:
sel.estimator_.feature_importances_
array([0.03699612, 0.01561296, 0.06016409, 0.0371452 , 0.0063401 ,
0.00965994, 0.0798662 , 0.08669071, 0.00474992, 0.00417092,
0.02407355, 0.00548033, 0.01254423, 0.03880038, 0.00379521,
0.00435162, 0.00452503, 0.00556905, 0.00610635, 0.00528878,
0.09556258, 0.01859305, 0.17205401, 0.05065305, 0.00943096,
0.01565491, 0.02443166, 0.14202709, 0.00964898, 0.01001304])
Any feature with a score above 0.033 (the mean) was selected; the remaining features were pruned. Now create a helper function to evaluate a Random Forest on any train/test split and report accuracy:
def run_randomForest(X_train, X_test, y_train, y_test):
clf = RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print('Accuracy: ', accuracy_score(y_test, y_pred))
Transform the splits down to the selected features and measure accuracy:
X_train_rfc = sel.transform(X_train)
X_test_rfc = sel.transform(X_test)
%%time
run_randomForest(X_train_rfc, X_test_rfc, y_train, y_test)
Accuracy: 0.9473684210526315
Wall time: 250 ms
Compare that against training on all 30 features:
%%time
run_randomForest(X_train, X_test, y_train, y_test)
Accuracy: 0.9649122807017544
Wall time: 256 ms
Dropping from 30 to 10 features cost about 2 % accuracy. The baseline is good but not yet optimal — that is where RFE comes in.
Recursive Feature Elimination with Random Forest
SelectFromModel selects features in one shot. RFE does it recursively: it fits the model, drops the single least important feature, refits on the remaining features, and repeats until only n_features_to_select remain. This allows the model to re-evaluate feature relationships at each step, capturing interactions that a one-shot importance ranking can miss.
How RFE Works
At each step of the elimination loop, the estimator is refitted on the current feature set and a ranking score is computed for every remaining feature . The feature with the lowest score is removed:
Where:
- — the set of features still in use at step
- — the importance score (e.g. Gini-based impurity reduction) assigned to feature by the estimator
- — the feature with the lowest importance score, which is removed at this step
The loop runs until , where is the target number of features you specify.
Apply RFE with a Random Forest estimator, targeting 15 features:
from sklearn.feature_selection import RFE
sel = RFE(RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1), n_features_to_select = 15)
sel.fit(X_train, y_train)
RFE(estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
oob_score=False, random_state=0, verbose=0, warm_start=False),
n_features_to_select=15, step=1, verbose=0)
Check which of the 30 features were retained:
sel.get_support()
array([ True, True, True, True, False, False, True, True, False,
False, False, False, False, True, False, False, False, False,
False, False, True, True, True, True, True, False, True,
True, True, False])
Retrieve their names:
features = X_train.columns[sel.get_support()]
features
Index(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
'mean concavity', 'mean concave points', 'area error', 'worst radius',
'worst texture', 'worst perimeter', 'worst area', 'worst smoothness',
'worst concavity', 'worst concave points', 'worst symmetry'],
dtype='object')
Confirm the count:
len(features)
15
Transform both splits to the 15 selected features and evaluate:
X_train_rfe = sel.transform(X_train)
X_test_rfe = sel.transform(X_test)
%%time
run_randomForest(X_train_rfe, X_test_rfe, y_train, y_test)
Accuracy: 0.9736842105263158
Wall time: 251 ms
Compare against the full 30-feature baseline:
%%time
run_randomForest(X_train, X_test, y_train, y_test)
Accuracy: 0.9649122807017544
Wall time: 254 ms
With just 15 features, RFE achieved 97.37 % — nearly 1 % better than using all 30. Removing irrelevant features reduced noise and allowed the model to focus on the most informative signals.
Recursive Feature Elimination with Gradient Boosting
Gradient Boosting builds trees sequentially, where each new tree corrects the errors of the previous ones. This gives it a different perspective on feature importance compared to a Random Forest, which builds trees in parallel. Using a GradientBoostingClassifier inside RFE can therefore surface a different optimal feature subset.
Import the estimator:
from sklearn.ensemble import GradientBoostingClassifier
Apply RFE with the gradient boosting estimator, targeting 12 features:
sel = RFE(GradientBoostingClassifier(n_estimators=100, random_state=0), n_features_to_select = 12)
sel.fit(X_train, y_train)
RFE(estimator=GradientBoostingClassifier(criterion='friedman_mse', init=None,
learning_rate=0.1, loss='deviance', max_depth=3,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_sampl... subsample=1.0, tol=0.0001, validation_fraction=0.1,
verbose=0, warm_start=False),
n_features_to_select=12, step=1, verbose=0)
Check the selected features:
sel.get_support()
array([False, True, False, False, True, False, False, True, True,
False, False, False, False, True, False, False, True, False,
False, False, True, True, True, True, False, False, True,
True, False, False])
features = X_train.columns[sel.get_support()]
features
Index(['mean texture', 'mean smoothness', 'mean concave points',
'mean symmetry', 'area error', 'concavity error', 'worst radius',
'worst texture', 'worst perimeter', 'worst area', 'worst concavity',
'worst concave points'],
dtype='object')
len(features)
12
Notice that the gradient boosting selector chose a different set of 12 features than the Random Forest chose at 15 — mean smoothness, mean symmetry, and concavity error appear here but not in the earlier RFE run. Transform the splits and evaluate:
X_train_rfe = sel.transform(X_train)
X_test_rfe = sel.transform(X_test)
%%time
run_randomForest(X_train_rfe, X_test_rfe, y_train, y_test)
Accuracy: 0.9736842105263158
Wall time: 253 ms
%%time
run_randomForest(X_train, X_test, y_train, y_test)
Accuracy: 0.9649122807017544
Wall time: 253 ms
The gradient boosting selector at 12 features matches the Random Forest selector at 15 — 97.37 % — using 3 fewer features.
Sweeping Feature Counts to Find the Optimal Subset
Rather than guessing how many features to keep, you can iterate over all possible counts and record accuracy at each step. This sweep reveals whether there is a small subset that outperforms the full feature set.
Gradient Boosting RFE Sweep
Run RFE with the gradient boosting estimator for every feature count from 1 to 30:
for index in range(1, 31):
sel = RFE(GradientBoostingClassifier(n_estimators=100, random_state=0), n_features_to_select = index)
sel.fit(X_train, y_train)
X_train_rfe = sel.transform(X_train)
X_test_rfe = sel.transform(X_test)
print('Selected Feature: ', index)
run_randomForest(X_train_rfe, X_test_rfe, y_train, y_test)
print()
Selected Feature: 1
Accuracy: 0.8771929824561403
Selected Feature: 2
Accuracy: 0.9035087719298246
Selected Feature: 3
Accuracy: 0.9649122807017544
Selected Feature: 4
Accuracy: 0.9736842105263158
Selected Feature: 5
Accuracy: 0.9649122807017544
Selected Feature: 6
Accuracy: 0.9912280701754386
Selected Feature: 7
Accuracy: 0.9736842105263158
Selected Feature: 8
Accuracy: 0.9649122807017544
Selected Feature: 9
Accuracy: 0.9736842105263158
Selected Feature: 10
Accuracy: 0.956140350877193
Selected Feature: 11
Accuracy: 0.956140350877193
Selected Feature: 12
Accuracy: 0.9736842105263158
Selected Feature: 13
Accuracy: 0.956140350877193
Selected Feature: 14
Accuracy: 0.9649122807017544
Selected Feature: 15
Accuracy: 0.9649122807017544
Selected Feature: 16
Accuracy: 0.9824561403508771
Selected Feature: 17
Accuracy: 0.9649122807017544
Selected Feature: 18
Accuracy: 0.9736842105263158
Selected Feature: 19
Accuracy: 0.9649122807017544
Selected Feature: 20
Accuracy: 0.956140350877193
Selected Feature: 21
Accuracy: 0.9736842105263158
Selected Feature: 22
Accuracy: 0.9824561403508771
Selected Feature: 23
Accuracy: 0.9649122807017544
Selected Feature: 24
Accuracy: 0.9649122807017544
Selected Feature: 25
Accuracy: 0.9736842105263158
Selected Feature: 26
Accuracy: 0.9736842105263158
Selected Feature: 27
Accuracy: 0.9649122807017544
Selected Feature: 28
Accuracy: 0.9649122807017544
Selected Feature: 29
Accuracy: 0.9649122807017544
Selected Feature: 30
Accuracy: 0.9649122807017544
The accuracy peaks at 6 features (99.12 %), which is actually higher than using all 30 (96.49 %). Beyond 6, accuracy fluctuates but never consistently exceeds the full-feature baseline. This confirms that the remaining 24 features add noise rather than signal.
Locking In the Best Gradient Boosting Subset
Retrain with exactly 6 features to confirm the peak result:
sel = RFE(GradientBoostingClassifier(n_estimators=100, random_state=0), n_features_to_select = 6)
sel.fit(X_train, y_train)
X_train_rfe = sel.transform(X_train)
X_test_rfe = sel.transform(X_test)
print('Selected Feature: ', 6)
run_randomForest(X_train_rfe, X_test_rfe, y_train, y_test)
print()
Selected Feature: 6
Accuracy: 0.9912280701754386
The 6-feature model achieves 99.12 %. Inspect which features were chosen:
features = X_train.columns[sel.get_support()]
features
Index(['mean concave points', 'area error', 'worst texture', 'worst perimeter',
'worst area', 'worst concave points'],
dtype='object')
These six features — dominated by "worst" measurements and key shape descriptors — carry almost all the diagnostic information in the dataset.
Random Forest RFE Sweep
Run the same sweep using a Random Forest estimator to compare how the two estimators rank features differently:
for index in range(1, 31):
sel = RFE(RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1), n_features_to_select = index)
sel.fit(X_train, y_train)
X_train_rfe = sel.transform(X_train)
X_test_rfe = sel.transform(X_test)
print('Selected Feature: ', index)
run_randomForest(X_train_rfe, X_test_rfe, y_train, y_test)
print()
Selected Feature: 1
Accuracy: 0.8947368421052632
Selected Feature: 2
Accuracy: 0.9298245614035088
Selected Feature: 3
Accuracy: 0.9473684210526315
Selected Feature: 4
Accuracy: 0.9649122807017544
Selected Feature: 5
Accuracy: 0.9649122807017544
Selected Feature: 6
Accuracy: 0.956140350877193
Selected Feature: 7
Accuracy: 0.956140350877193
Selected Feature: 8
Accuracy: 0.9649122807017544
Selected Feature: 9
Accuracy: 0.9736842105263158
Selected Feature: 10
Accuracy: 0.9736842105263158
Selected Feature: 11
Accuracy: 0.9649122807017544
Selected Feature: 12
Accuracy: 0.9736842105263158
Selected Feature: 13
Accuracy: 0.9649122807017544
Selected Feature: 14
Accuracy: 0.9736842105263158
Selected Feature: 15
Accuracy: 0.9736842105263158
Selected Feature: 16
Accuracy: 0.9736842105263158
Selected Feature: 17
Accuracy: 0.9824561403508771
Selected Feature: 18
Accuracy: 0.9649122807017544
Selected Feature: 19
Accuracy: 0.9649122807017544
Selected Feature: 20
Accuracy: 0.9736842105263158
Selected Feature: 21
Accuracy: 0.9736842105263158
Selected Feature: 22
Accuracy: 0.9736842105263158
Selected Feature: 23
Accuracy: 0.9649122807017544
Selected Feature: 24
Accuracy: 0.9824561403508771
Selected Feature: 25
Accuracy: 0.956140350877193
Selected Feature: 26
Accuracy: 0.956140350877193
Selected Feature: 27
Accuracy: 0.9649122807017544
Selected Feature: 28
Accuracy: 0.9649122807017544
Selected Feature: 29
Accuracy: 0.9649122807017544
Selected Feature: 30
Accuracy: 0.9649122807017544
With the Random Forest estimator, accuracy plateaus around 97.37 % and never reaches the 99.12 % peak that Gradient Boosting achieved at 6 features. This illustrates an important practical point: the choice of estimator inside RFE influences which features are deemed important and how high accuracy can go.
Conclusion
In this tutorial you applied Recursive Feature Elimination to the breast cancer dataset using two estimators: RandomForestClassifier and GradientBoostingClassifier. Starting from 30 features, the Gradient Boosting RFE sweep identified a 6-feature subset that achieved 99.12 % accuracy — surpassing the 96.49 % baseline trained on all features. The Random Forest RFE topped out at 97.37 % at around 17 features, showing that the internal estimator meaningfully affects which subset RFE discovers.
Key takeaways:
- RFE re-evaluates feature importance at every elimination step, allowing it to capture interactions that a single-pass importance threshold (like
SelectFromModel) can miss. - The choice of estimator inside RFE matters: Gradient Boosting found a stronger 6-feature subset than Random Forest found at any count.
- Sweeping over all feature counts is the safest way to find the optimal number — never assume the full feature set gives the best accuracy.
- Feature reduction does not always hurt accuracy; removing noisy or redundant features can actually improve it.
SelectFromModelis a faster alternative when you only need a rough first cut and cannot afford the runtime of multiple RFE fits.
Next steps:
- Learn the forward and backward selection variants of wrapper methods in Step Forward and Step Backward Feature Selection.
- Explore how regularization-based embedded methods compare to RFE in Lasso and Ridge Regularisation for Feature Selection.
- Understand the filter methods that can serve as a cheap pre-screening step before running RFE in Constant, Quasi-Constant, and Duplicate Feature Removal.
