#feature selection#lasso#ridge#l1 regularization#l2 regularization#linear regression#logistic regression#scikit-learn#python

Regression Coefficients for Feature Selection with Lasso and Ridge

Learn how to use linear and logistic regression coefficients with Lasso (L1) and Ridge (L2) regularization to select the most informative features in Python.

May 17, 2026 at 5:15 PM13 min readFollowFollow (Hindi)

Topics You Will Master

How regression coefficients act as feature importance scores
How Lasso (L1) regularization drives unimportant coefficients to exactly zero
How Ridge (L2) regularization shrinks coefficients without removing features
How to use SelectFromModel with regularized regression to automate feature pruning
How to evaluate the impact of feature selection on classification accuracy
Best For

Python developers and data scientists who understand basic supervised learning and want a practical, code-first guide to embedded feature selection using regularized regression.

Expected Outcome

A working feature selection pipeline on the Titanic dataset that uses both Lasso and Ridge regularized logistic regression to rank and prune features, with accuracy benchmarked against the full feature set.

When you train a regression model, each feature gets a coefficient — a number that tells you how strongly that feature influences the prediction. Features with large coefficients matter a lot; features with coefficients near zero contribute almost nothing. Regularization — adding a penalty to the model's loss function — forces these coefficients to shrink, and with the right type of regularization you can push the weakest ones all the way to zero, automatically eliminating those features.

This tutorial shows you how to apply that idea in practice. You will use the Titanic dataset to select features with two regularized models: a plain Linear Regression (unregularized, as a baseline), a Logistic Regression with L1 (Lasso) penalty, and a Logistic Regression with L2 (Ridge) penalty. For each approach you will extract the selected features with SelectFromModel, then measure accuracy using a Random Forest classifier on the pruned feature set.

Prerequisites: Python 3.x, scikit-learn, Pandas, NumPy, Seaborn.

Linear Regression and Coefficient-Based Feature Importance

Linear regression predicts a continuous target variable as a weighted sum of input features. The weight assigned to each feature is its coefficient, and the magnitude of that coefficient tells you how important the feature is.

The model takes this form:

Where:

  • — the target value for observation
  • — the intercept (the predicted value when all features are zero)
  • — the regression coefficient (how much changes for a one-unit increase in )
  • — the predictor (input) feature for observation
  • — the random error term (noise not captured by the model)

Assumptions of Linear Regression

For coefficient magnitudes to reliably rank feature importance, the model's core assumptions should hold:

  • There is a linear relationship between each feature and the target .
  • Each feature should be roughly normally (Gaussian) distributed.
  • Features should not be highly correlated with each other (multicollinearity can distort coefficients).
  • Features should be on the same scale — if one feature is measured in thousands and another in fractions, their raw coefficients are not directly comparable.

Lasso (L1) and Ridge (L2) Regularization

Without regularization, a linear model freely assigns large coefficients, which can lead to overfitting — the model memorizes the training data but performs poorly on new data. Regularization adds a penalty term to the loss function that discourages large coefficients, keeping the model simpler and more general.

There are two main forms:

  • L1 regularization (Lasso): Penalizes the sum of the absolute values of the coefficients. This penalty can shrink weak coefficients all the way to exactly zero, effectively removing those features from the model. Lasso is therefore a built-in feature selection method.
  • L2 regularization (Ridge): Penalizes the sum of the squared values of the coefficients. This shrinks all coefficients toward zero but rarely reaches exactly zero, so all features are retained. You can still use coefficient magnitude to rank importance.
  • Elastic Net (L1 + L2 combined): A hybrid that blends both penalties, balancing sparsity and stability.

Before diving into the code, it is worth understanding the bias-variance tradeoff that regularization manages. The diagram below illustrates how total prediction error decomposes into bias and variance as model complexity changes:

Bias-variance tradeoff curve showing how MSE splits into bias and variance components across model complexity

Increasing model complexity reduces bias but increases variance. Regularization lets you tune this tradeoff by controlling the strength of the penalty with a parameter .

Lasso (L1) Regularization

The Lasso loss function is the standard residual sum of squares plus the L1 penalty:

Lasso L1 regularization loss function formula showing RSS plus lambda times the sum of absolute coefficient values

Where:

  • — residual sum of squares: the total squared difference between actual and predicted values
  • — regularization strength; a larger applies a heavier penalty
  • — the regression coefficient for feature
  • — the total number of features

Because the L1 penalty uses absolute values, its gradient has a constant magnitude, which can push coefficients to exactly zero. The chart below shows how Lasso coefficient paths evolve as the L1 regularization strength increases — most features are driven to zero at high penalty values:

Lasso L1 coefficient paths showing how feature coefficients shrink and reach zero as L1 regularization strength increases

Choosing the Regularization Strength λ

The choice of controls how aggressively the model penalizes complexity. You can observe its effect by plotting test error against different values of : error first falls as the model becomes less overfit, then rises again once the model is too constrained:

Test error (RMSE) vs lambda value curve showing a U-shaped relationship with an optimal lambda at the minimum

To find the optimal , split your data into three parts:

  • Training set: Fit the model and learn the regression coefficients with regularization applied.
  • Validation set: Evaluate model performance to choose the best . If accuracy is insufficient, adjust and retrain.
  • Test set: Final evaluation of generalization error using the selected on the validation set.

The diagram below shows which role each split plays:

Diagram of training, validation, and test set splits showing how each is used: training to fit coefficients, validation to select lambda, test for final evaluation

Ridge Regularization (L2)

Ridge regression adds a penalty equal to the sum of the squared coefficient values:

Where:

  • — residual sum of squares
  • — regularization strength (tuning parameter)
  • — the regression coefficient for feature

The L2 squared penalty has a smooth gradient that never reaches exactly zero, so Ridge keeps all features in the model but makes their coefficients small. The diagram below names this quantity and highlights the role of as a balance between fit quality and coefficient magnitude:

Ridge regression (L2 regularization) formula diagram showing RSS(w) plus lambda times the squared L2 norm of the coefficients

The effect of on model behavior is straightforward: a large produces high bias and low variance (all coefficients near zero); a small produces low bias and high variance (coefficients grow freely):

Diagram showing effect of large vs small lambda on Ridge regression: large lambda gives high bias low variance, small lambda gives low bias high variance

As you increase L2 regularization, all coefficients shrink smoothly toward zero without crossing it. The chart below illustrates this gradual shrinkage across features:

Ridge L2 coefficient paths showing all feature coefficients shrinking smoothly toward zero as L2 regularization strength increases

The L2 regularization forces parameters to stay relatively small. The bigger the penalty, the smaller and more stable the coefficients become — but no feature is ever fully eliminated.

Difference Between L1 and L2 Regularization

L1 RegularizationL2 Regularization
It penalizes sum of absolute value of weightsIt regularization penalizes sum of square weights
It has a sparse solutionIt has a non sparse solution
It has multiple solutionsIt has one solution
It has built in feature selectionIt has no feature selection
It is robust to outliersIt is not robust to outliers
It generates model that are simple and interpretable but cannot learn complex patternsIt gives better prediction when output variable is a function of all input features

Feature Selection on the Titanic Dataset

With the theory in place, you can now apply both regularization methods to a real dataset. The goal is to predict Titanic passenger survival using a subset of features selected by regularized regression.

Start by importing all the libraries you need:

PYTHON
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
PYTHON
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest, SelectPercentile
from sklearn.metrics import accuracy_score
PYTHON
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.feature_selection import SelectFromModel

Load the Titanic dataset from Seaborn and inspect the first few rows:

PYTHON
titanic = sns.load_dataset('titanic')
titanic.head()
OUTPUT
survivedpclasssexagesibspparchfareembarkedclasswhoadult_maledeckembark_townalivealone
003male22.0107.2500SThirdmanTrueNaNSouthamptonnoFalse
111female38.01071.2833CFirstwomanFalseCCherbourgyesFalse
213female26.0007.9250SThirdwomanFalseNaNSouthamptonyesTrue
311female35.01053.1000SFirstwomanFalseCSouthamptonyesFalse
403male35.0008.0500SThirdmanTrueNaNSouthamptonnoTrue

Check how many missing values exist in each column:

PYTHON
titanic.isnull().sum()
OUTPUT
survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

The age column has 177 missing values and deck has 688 — too many to impute reliably. Drop both columns and then remove any remaining rows with nulls:

PYTHON
titanic.drop(labels = ['age', 'deck'], axis = 1, inplace = True)
titanic = titanic.dropna()
titanic.isnull().sum()
OUTPUT
survived       0
pclass         0
sex            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
embark_town    0
alive          0
alone          0
dtype: int64

Select the seven features you will use for prediction:

PYTHON
data = titanic[['pclass', 'sex', 'sibsp', 'parch', 'embarked', 'who', 'alone']].copy()
data.head()
OUTPUT
pclasssexsibspparchembarkedwhoalone
03male10SmanFalse
11female10CwomanFalse
23female00SwomanTrue
31female10SwomanFalse
43male00SmanTrue

Confirm there are no missing values in the feature set:

PYTHON
data.isnull().sum()
OUTPUT
pclass      0
sex         0
sibsp       0
parch       0
embarked    0
who         0
alone       0
dtype: int64

The categorical columns must be encoded as integers before fitting a regression model. Encode sex, embarked, who, and alone:

PYTHON
sex = {'male': 0, 'female': 1}
data['sex'] = data['sex'].map(sex)
data.head()
OUTPUT
pclasssexsibspparchembarkedwhoalone
03010SmanFalse
11110CwomanFalse
23100SwomanTrue
31110SwomanFalse
43000SmanTrue
PYTHON
ports = {'S': 0, 'C': 1, 'Q': 2}
data['embarked'] = data['embarked'].map(ports)
who = {'man': 0, 'woman': 1, 'child': 2}
data['who'] = data['who'].map(who)
alone = {True: 1, False: 0}
data['alone'] = data['alone'].map(alone)
data.head()
OUTPUT
pclasssexsibspparchembarkedwhoalone
03010000
11110110
23100011
31110010
43000001

Separate features from the target label and check the resulting shapes:

PYTHON
X = data.copy()
y = titanic['survived']
X.shape, y.shape
OUTPUT
((889, 7), (889,))

Split the data into training (70%) and test (30%) sets:

PYTHON
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 43)

Estimating Linear Regression Coefficients

SelectFromModel wraps any estimator and keeps only the features whose absolute coefficient exceeds a threshold (the mean by default). Wrap a plain LinearRegression to use its coefficients as importance scores:

PYTHON
sel = SelectFromModel(LinearRegression())

Fit the selector on the training data:

PYTHON
sel.fit(X_train, y_train)
SelectFromModel(estimator=LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False),
        max_features=None, norm_order=1, prefit=False, threshold=None)

Call get_support() to see which of the seven features were selected (True) and which were dropped:

PYTHON
sel.get_support()
OUTPUT
array([ True,  True, False, False, False,  True, False])

Three features passed the threshold. Inspect the raw coefficients to understand why:

PYTHON
sel.estimator_.coef_
OUTPUT
array([-0.13750402,  0.26606466, -0.07470416, -0.0668525 ,  0.04793674,
        0.23857799, -0.12929595])

The mean absolute coefficient is the default threshold for SelectFromModel. Calculate it to confirm:

PYTHON
mean = np.mean(np.abs(sel.estimator_.coef_))
mean
OUTPUT
0.13727657291370804

Features with an absolute coefficient above this mean are kept. Compare each coefficient's absolute value against that threshold:

PYTHON
np.abs(sel.estimator_.coef_)
OUTPUT
array([0.13750402, 0.26606466, 0.07470416, 0.0668525 , 0.04793674,
       0.23857799, 0.12929595])

The three features above the threshold of 0.137 are pclass, sex, and who. Retrieve their names directly:

PYTHON
features = X_train.columns[sel.get_support()]
features
OUTPUT
Index(['pclass', 'sex', 'who'], dtype='object')

Apply the selector to reduce both training and test sets to the three chosen features:

PYTHON
X_train_reg = sel.transform(X_train)
X_test_reg = sel.transform(X_test)
X_test_reg.shape
OUTPUT
(267, 3)

Define a helper function that trains a RandomForestClassifier and prints accuracy and wall time — you will reuse this for every feature set:

PYTHON
def run_randomForest(X_train, X_test, y_train, y_test):
    clf = RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1)
    clf = clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print('Accuracy: ', accuracy_score(y_test, y_pred))

Benchmark accuracy on the three-feature subset selected by Linear Regression:

PYTHON
%%time
run_randomForest(X_train_reg, X_test_reg, y_train, y_test)
OUTPUT
Accuracy:  0.8239700374531835
Wall time: 250 ms

Compare against accuracy on the full seven-feature set to verify no information was lost:

PYTHON
%%time
run_randomForest(X_train, X_test, y_train, y_test)
OUTPUT
Accuracy:  0.8239700374531835
Wall time: 252 ms

The accuracy is identical (82.4%) on both the pruned and full feature sets. Confirm the original training size:

PYTHON
X_train.shape
OUTPUT
(622, 7)

The three selected features carry all the predictive signal — the remaining four add no value.

Logistic Regression Coefficients with L1 Regularization

Now apply Lasso regularization via LogisticRegression with penalty='l1'. The C parameter is the inverse of regularization strength — a smaller C applies a stronger penalty and drives more coefficients to zero:

PYTHON
sel = SelectFromModel(LogisticRegression(penalty = 'l1', C = 0.05, solver = 'liblinear'))
sel.fit(X_train, y_train)
sel.get_support()
OUTPUT
array([ True,  True,  True, False, False,  True, False])

Lasso selected four features this time. Inspect the raw coefficients — notice that two of them are exactly zero:

PYTHON
sel.estimator_.coef_
OUTPUT
array([[-0.54045394, 0.78039608, -0.14081954, 0., 0.        ,0.94106713, 0.]])

parch, embarked, and alone received zero coefficients and were dropped. Transform the feature sets:

PYTHON
X_train_l1 = sel.transform(X_train)
X_test_l1 = sel.transform(X_test)

Measure accuracy on the Lasso-selected feature subset:

PYTHON
%%time
run_randomForest(X_train_l1, X_test_l1, y_train, y_test)
OUTPUT
Accuracy:  0.8277153558052435
Wall time: 251 ms

The L1-selected model reaches 82.8% accuracy — a marginal improvement over the unregularized Linear Regression selection, using one additional feature (sibsp).

Feature Selection with L2 Regularization

Apply Ridge regularization via LogisticRegression with penalty='l2' and the same C=0.05:

PYTHON
sel = SelectFromModel(LogisticRegression(penalty = 'l2', C = 0.05, solver = 'liblinear'))
sel.fit(X_train, y_train)
sel.get_support()
OUTPUT
array([ True,  True, False, False, False,  True, False])

Ridge selected the same three features as the unregularized baseline. Check the coefficients — unlike Lasso, none reach exactly zero:

PYTHON
sel.estimator_.coef_
OUTPUT
array([[-0.55749685,  0.85692344, -0.30436065, -0.11841967,  0.2435823 ,
         1.00124155, -0.29875898]])

All seven features have non-zero coefficients, but only three exceed the mean absolute threshold. Apply the transform:

PYTHON
X_train_l1 = sel.transform(X_train)
X_test_l1 = sel.transform(X_test)

Benchmark accuracy on the Ridge-selected feature subset:

PYTHON
%%time
run_randomForest(X_train_l1, X_test_l1, y_train, y_test)
OUTPUT
Accuracy:  0.8239700374531835
Wall time: 250 ms

The Ridge-selected model achieves the same 82.4% accuracy as the full feature set, confirming that the three features (pclass, sex, who) capture all the relevant signal in this dataset.

Conclusion

In this tutorial you applied Linear Regression, Lasso (L1), and Ridge (L2) regularized Logistic Regression to the Titanic dataset using scikit-learn's SelectFromModel. All three methods converged on pclass, sex, and who as the most important features, achieving 82.4% accuracy — identical to training on all seven features. The Lasso model added sibsp and achieved a marginal 82.8% accuracy by explicitly zeroing out the weakest coefficients.

Key takeaways:

  • Regression coefficients are a direct measure of feature importance — features with larger absolute coefficients contribute more to the prediction.
  • L1 (Lasso) regularization can push coefficients to exactly zero, performing automatic feature elimination without any manual threshold tuning.
  • L2 (Ridge) regularization shrinks all coefficients without eliminating any — it is better used for ranking features than for hard selection.
  • SelectFromModel provides a clean, pipeline-compatible API to extract the features that pass a coefficient threshold from any regularized estimator.
  • Matching accuracy on a pruned subset confirms the removed features were genuinely uninformative — not just low-coefficient due to scaling.

Next steps:

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments