#feature selection#lasso#ridge#regularization#scikit-learn#python#logistic regression

Lasso and Ridge Regularization for Feature Selection

Learn how Lasso (L1) and Ridge (L2) regularization act as embedded feature selectors. Apply SelectFromModel and RidgeClassifierCV on the Titanic dataset in Python.

May 22, 2026 at 8:15 AM11 min readFollowFollow (Hindi)

Topics You Will Master

What regularization is and why it prevents overfitting
How L1 (Lasso) regularization forces feature coefficients to exactly zero
How L2 (Ridge) regularization shrinks coefficients without eliminating them
How to use SelectFromModel to automatically select features from a Lasso model
How to tune the regularization strength with RidgeClassifierCV
Best For

Python developers and data scientists who understand basic classification and want to add automated feature selection directly inside their model training pipeline.

Expected Outcome

A working Lasso-based feature selection pipeline on the Titanic dataset, plus a Ridge classifier with cross-validated regularization strength, with accuracy scores you can compare.

When you train a machine learning model on too many features, it can memorize patterns in the training data that do not generalize to new data — a problem called overfitting. One of the most effective ways to fight overfitting is regularization — adding a penalty to the model that discourages large coefficients.

What makes regularization especially useful for feature selection is that the L1 variant (called Lasso) can push the coefficients of unimportant features all the way to zero. A feature with a zero coefficient contributes nothing to the prediction, so Lasso effectively removes it from the model automatically. This makes Lasso an embedded feature selection method — the selection happens as part of the training process, not as a separate preprocessing step.

In this tutorial you will apply Lasso and Ridge regularization to the Titanic survival dataset. You will use SelectFromModel from Scikit-Learn to extract the features Lasso considers important, then compare model accuracy against a Ridge classifier tuned with cross-validation.

Prerequisites: Python 3.x, Scikit-learn, Pandas, NumPy, Seaborn, Matplotlib.

What is Regularization?

Every supervised model tries to minimize a loss function — for linear models that is typically the Residual Sum of Squares (RSS), which measures how far the model's predictions are from the true labels. The problem is that minimizing RSS alone can produce very large coefficients that fit training noise perfectly but fail on unseen data.

Regularization fixes this by adding an extra term to the loss function that penalizes large coefficients. The model must now find a balance: reduce prediction error while also keeping coefficients small.

There are two common penalty types:

  • L1 Regularization (Lasso) — penalizes the sum of the absolute values of the coefficients. This can shrink some coefficients to exactly zero, removing their features from the model entirely.
  • L2 Regularization (Ridge) — penalizes the sum of the squared values of the coefficients. This shrinks all coefficients toward zero but rarely reaches exactly zero, so all features remain in the model with reduced influence.
  • Elastic Net — combines both L1 and L2 penalties, giving you the benefits of each.

Bias, Variance, and Model Complexity

To understand why regularization works, you need to understand bias and variance — the two main sources of prediction error you can actually control.

  • Bias error measures how far the model's average predictions are from the true values. A high-bias model is too simple and consistently wrong in the same direction (underfitting).
  • Variance measures how much the model's predictions change when trained on different samples of the data. A high-variance model is too complex and fits training noise (overfitting).

The diagram below shows how these two errors behave as model complexity increases:

Bias-variance tradeoff diagram showing how bias decreases and variance increases with model complexity, with MSE as the sum of both

As complexity rises, bias falls but variance climbs. The total error (MSE = bias² + variance) has a sweet spot — the optimal model complexity where the combined error is lowest. Regularization is a direct tool for controlling this tradeoff: a stronger penalty reduces variance at the cost of a small increase in bias.

Lasso (L1) Regularization

The Lasso loss function adds the L1 norm of the weight vector to the RSS:

Where:

  • — the true label for the -th training sample
  • — the model's regression coefficients (one per feature)
  • — the feature values for sample
  • — the regularization strength; higher values impose a stronger penalty
  • — the L1 norm, i.e., the sum of absolute coefficient values

Because the L1 penalty creates sharp corners in the constraint geometry, the optimizer frequently lands on solutions where some coefficients are exactly zero. The plot below shows Lasso coefficient paths as the regularization strength increases — several features are driven to zero well before others:

Lasso coefficient path plot showing how feature coefficients shrink to zero at different L1 regularization strengths

Features whose coefficients reach zero can be dropped from the model entirely. This is the core mechanism behind Lasso as a feature selector.

Choosing the Regularization Strength λ

The value of determines how aggressively the model penalizes large coefficients. Too small a gives you almost no regularization; too large a forces all coefficients toward zero and produces a useless model. The plot below shows how the test error (RMSE) evolves as increases:

Test error vs lambda plot showing a U-shaped curve with a minimum at the optimal regularization strength

The curve has a clear minimum — the optimal sits at the bottom.

The standard method for finding this value is to split your data into three parts:

  • Training set — fit the model and learn the coefficients for a given .
  • Validation set — measure performance to pick the best without touching the test set.
  • Test set — assess the final generalization error once after selecting .

The diagram below illustrates this three-way split and how each set is used:

Data split diagram showing training set used to fit the model, validation set used to select the optimal lambda, and test set used to assess final generalization error

Ridge (L2) Regularization

Ridge regularization uses the squared L2 norm of the weight vector instead of the absolute values:

Where:

  • — residual sum of squares, the standard least-squares prediction error
  • — regularization strength; controls the balance between fit quality and coefficient size
  • — the squared L2 norm, i.e., the sum of squared coefficient values

The diagram below summarizes the practical effect of in Ridge:

Diagram showing that large lambda produces high bias and low variance while small lambda produces low bias and high variance, with the note that lambda controls model complexity

A large pushes all coefficients close to zero (high bias, low variance). A small approaches standard least-squares regression (low bias, high variance). Because the L2 penalty uses squares rather than absolute values, it never forces a coefficient to exactly zero — it only shrinks them progressively. The Ridge coefficient path confirms this:

Ridge coefficient path plot showing all feature coefficients decreasing gradually toward zero as L2 regularization strength increases, without any coefficient reaching exactly zero

Compare this to the Lasso plot: Ridge coefficients decrease smoothly but remain non-zero, while Lasso coefficients hit zero abruptly. This is why Lasso selects features and Ridge does not.

Lasso Feature Selection on Titanic

Now you will apply everything above to a real classification problem. The goal is to predict whether a Titanic passenger survived using a subset of features selected by Lasso regularization.

Load and Clean the Data

Import all required libraries in one block:

PYTHON
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
PYTHON
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import Lasso, LogisticRegression
from sklearn.feature_selection import SelectFromModel
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

Load the Titanic dataset and check for missing values:

PYTHON
titanic = sns.load_dataset('titanic')
titanic.isnull().sum()
OUTPUT
survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

The age column has 177 missing values and deck has 688 — too many to impute reliably. Drop them both:

PYTHON
titanic.drop(labels = ['age', 'deck'], axis = 1, inplace = True)

Drop the remaining rows with any missing values and confirm the dataset is clean:

PYTHON
titanic = titanic.dropna()
titanic.isnull().sum()
OUTPUT
survived       0
pclass         0
sex            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
embark_town    0
alive          0
alone          0
dtype: int64

Inspect the first few rows to understand the structure:

PYTHON
titanic.head()
OUTPUT
survivedpclasssexsibspparchfareembarkedclasswhoadult_maleembark_townalivealone
003male107.2500SThirdmanTrueSouthamptonnoFalse
111female1071.2833CFirstwomanFalseCherbourgyesFalse
213female007.9250SThirdwomanFalseSouthamptonyesTrue
311female1053.1000SFirstwomanFalseSouthamptonyesFalse
403male008.0500SThirdmanTrueSouthamptonnoTrue

Prepare Features

Select the seven features you will use for classification:

PYTHON
data = titanic[['pclass', 'sex', 'sibsp', 'parch', 'embarked', 'who', 'alone']].copy()
PYTHON
data.head()
OUTPUT
pclasssexsibspparchembarkedwhoalone
03male10SmanFalse
11female10CwomanFalse
23female00SwomanTrue
31female10SwomanFalse
43male00SmanTrue

Encode the categorical columns as integers so they can be passed to a linear model:

PYTHON
sex = {'male': 0, 'female': 1}
data['sex'] = data['sex'].map(sex)                       ports = {'S': 0, 'C': 1, 'Q': 2}
data['embarked'] = data['embarked'].map(ports)             who = {'man': 0, 'woman': 1, 'child': 2}
data['who'] = data['who'].map(who)                       alone = {True: 1, False: 0}
data['alone'] = data['alone'].map(alone)

Assign features to X and the target to y, then split into training and test sets:

PYTHON
X = data.copy()
y = titanic['survived']
x.head()
OUTPUT
No.pclasssexsibspparchembarkedwhoalone
03010000
11110110
23100011
31110010
43000001
PYTHON
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 42)

Select Features with SelectFromModel

SelectFromModel is a meta-transformer — it wraps any estimator that produces coefficients or feature importances and uses those values to select the most important features. Here you wrap a LogisticRegression with L1 penalty (penalty='l1') and a small C value. C is the inverse of the regularization strength — a small C means strong regularization, which forces more coefficients to zero:

PYTHON
sel = SelectFromModel(LogisticRegression(C = 0.05, penalty = 'l1', solver = 'liblinear'))
sel.fit(X_train, y_train)
PYTHON
SelectFromModel(estimator=LogisticRegression(C=0.05, penalty='l1', solver='liblinear'))

get_support() returns a boolean mask showing which features were kept (True) and which were eliminated (False):

PYTHON
sel.get_support()
OUTPUT
array([ True,  True,  True, False, False,  True, False])

Four features were selected. Retrieve their names:

PYTHON
features = X_train.columns[sel.get_support()]
features
OUTPUT
Index(['pclass', 'sex', 'sibsp', 'who'], dtype='object')

Lasso kept pclass, sex, sibsp, and who — the features most correlated with survival. Apply the transformation to produce reduced training and test sets:

PYTHON
X_train_l1 = sel.transform(X_train)
X_test_l1 = sel.transform(X_test)
X_train_l1.shape, X_test_l1.shape
OUTPUT
((595, 4), (294, 4))

The training set now has 4 columns instead of 7, and the test set matches.

Build and Compare Random Forest Models

Define a helper function that trains a RandomForestClassifier and prints the accuracy:

PYTHON
def run_randomForest(X_train, X_test, y_train, y_test):
    clf = RandomForestClassifier(n_estimators=100, random_state=0, n_jobs = -1)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print('Accuracy: ', accuracy_score(y_test, y_pred))

Train on the Lasso-selected features (4 columns):

PYTHON
%%time
run_randomForest(X_train_l1, X_test_l1, y_train, y_test)
OUTPUT
Accuracy:  0.826530612244898
Wall time: 517 ms

Train on all 7 original features for comparison:

PYTHON
%%time
run_randomForest(X_train, X_test, y_train, y_test)
OUTPUT
Accuracy:  0.8163265306122449
Wall time: 169 ms

The Lasso-selected model achieves 82.7 % accuracy versus 81.6 % for the full feature set — removing the three weakest features actually improved performance slightly, showing that those features were adding noise rather than signal.

Ridge Regression for Classification

Ridge regularization is available for classification through RidgeClassifier. Import it:

PYTHON
from sklearn.linear_model import RidgeClassifier

Fit a Ridge classifier with a fixed alpha of 300 (alpha is equivalent to in the formula — a larger value means stronger regularization):

PYTHON
rr = RidgeClassifier(alpha=300)
rr.fit(X_train, y_train)               RidgeClassifier(alpha=300)

Check the accuracy on the test set:

PYTHON
rr.score(X_test, y_test)
OUTPUT
0.8231292517006803

Inspect the coefficients — because Ridge uses L2 regularization, none of them are exactly zero:

PYTHON
rr.coef_
OUTPUT
array([[-0.20537487,  0.24017869, -0.07964489, -0.00072071,  0.05154718, 0.26474716, -0.07454003]])

All seven features remain in the model, but the smaller coefficients reveal the less influential features (parch at −0.00072 is nearly irrelevant).

Tune Alpha with Cross-Validation

Instead of guessing alpha, use RidgeClassifierCV to find the best value automatically. It performs Generalized Cross-Validation (GCV) — an efficient approximation of Leave-One-Out cross-validation — across every alpha you specify:

PYTHON
from sklearn.linear_model import RidgeClassifierCV
PYTHON
rr = RidgeClassifierCV(alphas=[10, 20, 50, 100, 200, 300], cv = 10 )
rr.fit(X_train, y_train)        RidgeClassifierCV(alphas=array([ 10, 20, 50, 100, 200, 300]), cv=10)

Evaluate on the test set:

PYTHON
rr.score(X_test, y_test)
OUTPUT
0.8197278911564626

Read the coefficients the cross-validated model learned:

PYTHON
rr.coef_
OUTPUT
array([[-0.23422431,  0.29215915, -0.09681069, -0.01263653,  0.05860246, 0.31323408, -0.09073738]])

Check which alpha was selected as the best:

PYTHON
rr.alpha_
OUTPUT
200

Confirm the full list of candidates that were evaluated:

PYTHON
rr.alphas
OUTPUT
array([ 10,  20,  50, 100, 200, 300])

RidgeClassifierCV selected alpha=200 as the best regularization strength across 10-fold cross-validation, achieving approximately 82 % accuracy.

Conclusion

In this tutorial you used Lasso (L1) and Ridge (L2) regularization as embedded feature selection methods on the Titanic dataset. Lasso's SelectFromModel pipeline automatically eliminated three of seven features, and the resulting Random Forest model scored 82.7 % — slightly better than using all features. The Ridge classifier with cross-validated alpha selection achieved 82 % using all features with shrunk coefficients.

Key takeaways:

  • Lasso (L1) forces some coefficients to exactly zero, making it a true feature selector; features with zero coefficients can be safely dropped.
  • Ridge (L2) shrinks all coefficients toward zero but never eliminates any feature; it reveals relative importance rather than performing hard selection.
  • SelectFromModel wraps any coefficient-based estimator and turns it into a feature selection transformer compatible with any Scikit-Learn pipeline.
  • RidgeClassifierCV automates regularization strength tuning via cross-validation, removing the need to guess alpha manually.
  • Removing weak features with Lasso can improve model accuracy by reducing noise, even when the number of features is small.

Next steps:

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments