Lasso and Ridge Regularization for Feature Selection

When you train a machine learning model on too many features, it can memorize patterns in the training data that do not generalize to new data — a problem called overfitting. One of the most effective ways to fight overfitting is regularization — adding a penalty to the model that discourages large coefficients.

What makes regularization especially useful for feature selection is that the L1 variant (called Lasso) can push the coefficients of unimportant features all the way to zero. A feature with a zero coefficient contributes nothing to the prediction, so Lasso effectively removes it from the model automatically. This makes Lasso an embedded feature selection method — the selection happens as part of the training process, not as a separate preprocessing step.

In this tutorial you will apply Lasso and Ridge regularization to the Titanic survival dataset. You will use SelectFromModel from Scikit-Learn to extract the features Lasso considers important, then compare model accuracy against a Ridge classifier tuned with cross-validation.

Prerequisites: Python 3.x, Scikit-learn, Pandas, NumPy, Seaborn, Matplotlib.

What is Regularization?

Every supervised model tries to minimize a loss function — for linear models that is typically the Residual Sum of Squares (RSS), which measures how far the model's predictions are from the true labels. The problem is that minimizing RSS alone can produce very large coefficients that fit training noise perfectly but fail on unseen data.

Regularization fixes this by adding an extra term to the loss function that penalizes large coefficients. The model must now find a balance: reduce prediction error while also keeping coefficients small.

There are two common penalty types:

L1 Regularization (Lasso) — penalizes the sum of the absolute values of the coefficients. This can shrink some coefficients to exactly zero, removing their features from the model entirely.
L2 Regularization (Ridge) — penalizes the sum of the squared values of the coefficients. This shrinks all coefficients toward zero but rarely reaches exactly zero, so all features remain in the model with reduced influence.
Elastic Net — combines both L1 and L2 penalties, giving you the benefits of each.

Bias, Variance, and Model Complexity

To understand why regularization works, you need to understand bias and variance — the two main sources of prediction error you can actually control.

Bias error measures how far the model's average predictions are from the true values. A high-bias model is too simple and consistently wrong in the same direction (underfitting).
Variance measures how much the model's predictions change when trained on different samples of the data. A high-variance model is too complex and fits training noise (overfitting).

The diagram below shows how these two errors behave as model complexity increases:

Bias-variance tradeoff diagram showing how bias decreases and variance increases with model complexity, with MSE as the sum of both

As complexity rises, bias falls but variance climbs. The total error (MSE = bias² + variance) has a sweet spot — the optimal model complexity where the combined error is lowest. Regularization is a direct tool for controlling this tradeoff: a stronger penalty reduces variance at the cost of a small increase in bias.

Lasso (L1) Regularization

The Lasso loss function adds the L1 norm of the weight vector to the RSS:

Cost (w) = i = 1 \sum N (y_{i} - w_{0} h_{0} (x_{i}) - w_{1} h_{1} (x_{i}))^{2} + λ (∣ w_{0} ∣ + ∣ w_{1} ∣)

Where:

$y_{i}$ — the true label for the $i$ -th training sample
$w_{0}, w_{1}$ — the model's regression coefficients (one per feature)
$h (x_{i})$ — the feature values for sample $i$
$λ$ — the regularization strength; higher values impose a stronger penalty
$∣ w_{0} ∣ + ∣ w_{1} ∣$ — the L1 norm, i.e., the sum of absolute coefficient values

Because the L1 penalty creates sharp corners in the constraint geometry, the optimizer frequently lands on solutions where some coefficients are exactly zero. The plot below shows Lasso coefficient paths as the regularization strength increases — several features are driven to zero well before others:

Lasso coefficient path plot showing how feature coefficients shrink to zero at different L1 regularization strengths

Features whose coefficients reach zero can be dropped from the model entirely. This is the core mechanism behind Lasso as a feature selector.

Choosing the Regularization Strength λ

The value of $λ$ determines how aggressively the model penalizes large coefficients. Too small a $λ$ gives you almost no regularization; too large a $λ$ forces all coefficients toward zero and produces a useless model. The plot below shows how the test error (RMSE) evolves as $λ$ increases:

Test error vs lambda plot showing a U-shaped curve with a minimum at the optimal regularization strength

The curve has a clear minimum — the optimal $λ$ sits at the bottom.

The standard method for finding this value is to split your data into three parts:

Training set — fit the model and learn the coefficients for a given $λ$ .
Validation set — measure performance to pick the best $λ$ without touching the test set.
Test set — assess the final generalization error once after selecting $λ$ .

The diagram below illustrates this three-way split and how each set is used:

Data split diagram showing training set used to fit the model, validation set used to select the optimal lambda, and test set used to assess final generalization error

Ridge (L2) Regularization

Ridge regularization uses the squared L2 norm of the weight vector instead of the absolute values:

Cost (w) = RSS (w) + λ ∥ w ∥_{2}^{2}

Where:

$RSS (w)$ — residual sum of squares, the standard least-squares prediction error
$λ$ — regularization strength; controls the balance between fit quality and coefficient size
$∥ w ∥_{2}^{2}$ — the squared L2 norm, i.e., the sum of squared coefficient values

The diagram below summarizes the practical effect of $λ$ in Ridge:

Diagram showing that large lambda produces high bias and low variance while small lambda produces low bias and high variance, with the note that lambda controls model complexity

A large $λ$ pushes all coefficients close to zero (high bias, low variance). A small $λ$ approaches standard least-squares regression (low bias, high variance). Because the L2 penalty uses squares rather than absolute values, it never forces a coefficient to exactly zero — it only shrinks them progressively. The Ridge coefficient path confirms this:

Ridge coefficient path plot showing all feature coefficients decreasing gradually toward zero as L2 regularization strength increases, without any coefficient reaching exactly zero

Compare this to the Lasso plot: Ridge coefficients decrease smoothly but remain non-zero, while Lasso coefficients hit zero abruptly. This is why Lasso selects features and Ridge does not.

Lasso Feature Selection on Titanic

Now you will apply everything above to a real classification problem. The goal is to predict whether a Titanic passenger survived using a subset of features selected by Lasso regularization.

Load and Clean the Data

Import all required libraries in one block:

PYTHON

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

PYTHON

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import Lasso, LogisticRegression
from sklearn.feature_selection import SelectFromModel
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

Load the Titanic dataset and check for missing values:

PYTHON

titanic = sns.load_dataset('titanic')
titanic.isnull().sum()

OUTPUT

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

The age column has 177 missing values and deck has 688 — too many to impute reliably. Drop them both:

PYTHON

titanic.drop(labels = ['age', 'deck'], axis = 1, inplace = True)

Drop the remaining rows with any missing values and confirm the dataset is clean:

PYTHON

titanic = titanic.dropna()
titanic.isnull().sum()

OUTPUT

survived       0
pclass         0
sex            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
embark_town    0
alive          0
alone          0
dtype: int64

Inspect the first few rows to understand the structure:

PYTHON

titanic.head()

OUTPUT

	survived	pclass	sex	sibsp	fare	embarked	class	who	adult_male	embark_town	alive	alone
0	0	3	male	1	7.2500	S	Third	man	True	Southampton	no	False
1	1	1	female	1	71.2833	C	First	woman	False	Cherbourg	yes	False
2	1	3	female	0	7.9250	S	Third	woman	False	Southampton	yes	True
3	1	1	female	1	53.1000	S	First	woman	False	Southampton	yes	False
4	0	3	male	0	8.0500	S	Third	man	True	Southampton	no	True

Prepare Features

Select the seven features you will use for classification:

PYTHON

data = titanic[['pclass', 'sex', 'sibsp', 'parch', 'embarked', 'who', 'alone']].copy()

PYTHON

data.head()

OUTPUT

	pclass	sex	sibsp	embarked	who	alone
0	3	male	1	S	man	False
1	1	female	1	C	woman	False
2	3	female	0	S	woman	True
3	1	female	1	S	woman	False
4	3	male	0	S	man	True

Encode the categorical columns as integers so they can be passed to a linear model:

PYTHON

sex = {'male': 0, 'female': 1}
data['sex'] = data['sex'].map(sex)                       ports = {'S': 0, 'C': 1, 'Q': 2}
data['embarked'] = data['embarked'].map(ports)             who = {'man': 0, 'woman': 1, 'child': 2}
data['who'] = data['who'].map(who)                       alone = {True: 1, False: 0}
data['alone'] = data['alone'].map(alone)

Assign features to X and the target to y, then split into training and test sets:

PYTHON

X = data.copy()
y = titanic['survived']
x.head()

OUTPUT

No.	pclass	sex	sibsp	embarked	who	alone
0	3	0	1	0	0	0
1	1	1	1	1	1	0
2	3	1	0	0	1	1
3	1	1	1	0	1	0
4	3	0	0	0	0	1

PYTHON

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 42)

Select Features with SelectFromModel

SelectFromModel is a meta-transformer — it wraps any estimator that produces coefficients or feature importances and uses those values to select the most important features. Here you wrap a LogisticRegression with L1 penalty (penalty='l1') and a small C value. C is the inverse of the regularization strength — a small C means strong regularization, which forces more coefficients to zero:

PYTHON

sel = SelectFromModel(LogisticRegression(C = 0.05, penalty = 'l1', solver = 'liblinear'))
sel.fit(X_train, y_train)

PYTHON

SelectFromModel(estimator=LogisticRegression(C=0.05, penalty='l1', solver='liblinear'))

get_support() returns a boolean mask showing which features were kept (True) and which were eliminated (False):

PYTHON

sel.get_support()

OUTPUT

array([ True,  True,  True, False, False,  True, False])

Four features were selected. Retrieve their names:

PYTHON

features = X_train.columns[sel.get_support()]
features

OUTPUT

Index(['pclass', 'sex', 'sibsp', 'who'], dtype='object')

Lasso kept pclass, sex, sibsp, and who — the features most correlated with survival. Apply the transformation to produce reduced training and test sets:

PYTHON

X_train_l1 = sel.transform(X_train)
X_test_l1 = sel.transform(X_test)
X_train_l1.shape, X_test_l1.shape

OUTPUT

((595, 4), (294, 4))

The training set now has 4 columns instead of 7, and the test set matches.

Build and Compare Random Forest Models

Define a helper function that trains a RandomForestClassifier and prints the accuracy:

PYTHON

def run_randomForest(X_train, X_test, y_train, y_test):
    clf = RandomForestClassifier(n_estimators=100, random_state=0, n_jobs = -1)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print('Accuracy: ', accuracy_score(y_test, y_pred))

Train on the Lasso-selected features (4 columns):

PYTHON

%%time
run_randomForest(X_train_l1, X_test_l1, y_train, y_test)

OUTPUT

Accuracy:  0.826530612244898
Wall time: 517 ms

Train on all 7 original features for comparison:

PYTHON

%%time
run_randomForest(X_train, X_test, y_train, y_test)

OUTPUT

Accuracy:  0.8163265306122449
Wall time: 169 ms

The Lasso-selected model achieves 82.7 % accuracy versus 81.6 % for the full feature set — removing the three weakest features actually improved performance slightly, showing that those features were adding noise rather than signal.

Ridge Regression for Classification

Ridge regularization is available for classification through RidgeClassifier. Import it:

PYTHON

from sklearn.linear_model import RidgeClassifier

Fit a Ridge classifier with a fixed alpha of 300 (alpha is equivalent to $λ$ in the formula — a larger value means stronger regularization):

PYTHON

rr = RidgeClassifier(alpha=300)
rr.fit(X_train, y_train)               RidgeClassifier(alpha=300)

Check the accuracy on the test set:

PYTHON

rr.score(X_test, y_test)

OUTPUT

0.8231292517006803

Inspect the coefficients — because Ridge uses L2 regularization, none of them are exactly zero:

PYTHON

rr.coef_

OUTPUT

array([[-0.20537487,  0.24017869, -0.07964489, -0.00072071,  0.05154718, 0.26474716, -0.07454003]])

All seven features remain in the model, but the smaller coefficients reveal the less influential features (parch at −0.00072 is nearly irrelevant).

Tune Alpha with Cross-Validation

Instead of guessing alpha, use RidgeClassifierCV to find the best value automatically. It performs Generalized Cross-Validation (GCV) — an efficient approximation of Leave-One-Out cross-validation — across every alpha you specify:

PYTHON

from sklearn.linear_model import RidgeClassifierCV

PYTHON

rr = RidgeClassifierCV(alphas=[10, 20, 50, 100, 200, 300], cv = 10 )
rr.fit(X_train, y_train)        RidgeClassifierCV(alphas=array([ 10, 20, 50, 100, 200, 300]), cv=10)

Evaluate on the test set:

PYTHON

rr.score(X_test, y_test)

OUTPUT

0.8197278911564626

Read the coefficients the cross-validated model learned:

PYTHON

rr.coef_

OUTPUT

array([[-0.23422431,  0.29215915, -0.09681069, -0.01263653,  0.05860246, 0.31323408, -0.09073738]])

Check which alpha was selected as the best:

PYTHON

rr.alpha_

OUTPUT

Confirm the full list of candidates that were evaluated:

PYTHON

rr.alphas

OUTPUT

array([ 10,  20,  50, 100, 200, 300])

RidgeClassifierCV selected alpha=200 as the best regularization strength across 10-fold cross-validation, achieving approximately 82 % accuracy.

Conclusion

In this tutorial you used Lasso (L1) and Ridge (L2) regularization as embedded feature selection methods on the Titanic dataset. Lasso's SelectFromModel pipeline automatically eliminated three of seven features, and the resulting Random Forest model scored 82.7 % — slightly better than using all features. The Ridge classifier with cross-validated alpha selection achieved 82 % using all features with shrunk coefficients.

Key takeaways:

Lasso (L1) forces some coefficients to exactly zero, making it a true feature selector; features with zero coefficients can be safely dropped.
Ridge (L2) shrinks all coefficients toward zero but never eliminates any feature; it reveals relative importance rather than performing hard selection.
SelectFromModel wraps any coefficient-based estimator and turns it into a feature selection transformer compatible with any Scikit-Learn pipeline.
RidgeClassifierCV automates regularization strength tuning via cross-validation, removing the need to guess alpha manually.
Removing weak features with Lasso can improve model accuracy by reducing noise, even when the number of features is small.

Next steps:

Explore how raw regression coefficients compare to regularized ones in Use of Linear and Logistic Regression Coefficients with Lasso and Ridge for Feature Selection.
Apply a model-agnostic wrapper approach in Recursive Feature Elimination (RFE) using tree-based and gradient-based estimators.
Start from first principles with statistical filters in Feature Selection with Filter Methods to see how constant, quasi-constant, and duplicate features are removed before modeling.

Lasso and Ridge Regularization for Feature Selection

Topics You Will Master

What is Regularization?

Bias, Variance, and Model Complexity

Lasso (L1) Regularization

Choosing the Regularization Strength λ

Ridge (L2) Regularization

Lasso Feature Selection on Titanic

Load and Clean the Data

Prepare Features

Select Features with SelectFromModel

Build and Compare Random Forest Models

Ridge Regression for Classification

Tune Alpha with Cross-Validation

Conclusion

Latest recommendations you might like

LinkedIn Auto Connect Bot

Dimensionality Reduction with LDA and PCA in Python

Find this tutorial useful?

Discussion & Comments