Feature Engineering: Linear Model Assumptions

When you train a linear regression or logistic regression model, the mathematics behind it rests on a set of assumptions about your input features. If those assumptions are violated, the model can produce inaccurate predictions, unstable coefficients, or misleading confidence intervals — even if the code runs without errors.

This tutorial walks you through each assumption, shows you how to detect violations visually and numerically, and demonstrates how log transformations can bring your features back into compliance. You will work with the Boston housing dataset from scikit-learn alongside a small simulated dataset that shows you exactly what "correct" plots look like.

Prerequisites: Python 3.x, Pandas, NumPy, Matplotlib, Seaborn, SciPy, scikit-learn, Yellowbrick.

What Linear Models Assume

A linear model assumes that the target $Y$ can be expressed as a weighted sum of the input features:

Y \approx β_{0} + β_{1} X_{1} + β_{2} X_{2} + \dots + β_{n} X_{n}

Where:

$Y$ — the target variable you are trying to predict
$X_{1}, X_{2}, \dots, X_{n}$ — the independent input features
$β_{0}$ — the intercept (the predicted value when all features are zero)
$β_{1}, β_{2}, \dots, β_{n}$ — the regression coefficients (how much each feature moves the prediction)

For this equation to produce reliable results, four conditions must hold:

Assumption	What it means	How to check	Fix if violated
Linearity	Each feature $X$ has a straight-line relationship with $Y$	Scatter plots, residual plots	Log, power, or exponential transformation; discretization
Normality	Each feature $X$ follows a Gaussian (bell-curve) distribution	Histograms, Q-Q plots	Log, Box-Cox, or Yeo-Johnson transformation
No multicollinearity	Input features are not strongly correlated with each other	Correlation matrix, Variance Inflation Factor	Drop one of the correlated features; apply PCA
Homoscedasticity	The model's prediction errors have roughly the same spread across all feature values	Residual plots	Log, Box-Cox, or Yeo-Johnson transformation

When one or more of these conditions fail, you have two options: apply a mathematical transformation to the offending feature, or switch to a model that does not share these assumptions (such as a tree-based method).

Setting Up the Environment

Start by importing every library you will need for this tutorial in one block:

PYTHON

import pandas as pd
import numpy as np

# for plotting and visualization
import matplotlib.pyplot as plt
import seaborn as sns

# for the Q-Q plots
import scipy.stats as stats

# the dataset for the demo
from sklearn.datasets import load_boston

# for linear regression
from sklearn.linear_model import LinearRegression

# to split and standarize the dataset
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# to evaluate the regression model
from sklearn.metrics import mean_squared_error

Load the Boston housing dataset and build a Pandas DataFrame that holds both the features and the target column MEDV (median house value in $1000s):

PYTHON

boston_dataset = load_boston()

# create a dataframe with the independent variables
boston = pd.DataFrame(boston_dataset.data,
                      columns=boston_dataset.feature_names)

# add the target
boston['MEDV'] = boston_dataset.target

boston.head()

OUTPUT

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	MEDV
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33	36.2

Retrieve the list of feature names — these 13 columns are the independent variables you will analyze throughout this tutorial:

PYTHON

features = boston_dataset.feature_names
features

OUTPUT

array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')

Print the full dataset description to familiarize yourself with what each column represents before continuing. Your goal is to predict MEDV — the median value of owner-occupied homes:

PYTHON

print(boston_dataset.DESCR)

OUTPUT

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**

    :Number of Instances: 506

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.

.. topic:: References

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

Creating a Simulated Reference Dataset

Before examining the real data, you need a reference point — a small dataset that perfectly satisfies all assumptions. You will use these "ideal" plots to train your eye before looking at messier real-world data.

The code below generates 200 observations where x follows a standard normal distribution and y is a linear function of x plus a small amount of random noise:

PYTHON

np.random.seed(29) # for reproducibility

n = 200
x = np.random.randn(n)
y = x * 10 + np.random.randn(n) * 2

toy_df = pd.DataFrame([x, y]).T
toy_df.columns = ['x', 'y']
toy_df.head()

OUTPUT

	x	y
0	-0.417482	-1.271561
1	0.706032	7.990600
2	1.915985	19.848687
3	-2.141755	-21.928903
4	0.719057	5.579070

Linearity Assumption

The linearity assumption says that each feature's relationship to the target must be a straight line — not a curve. You check this with two tools: scatter plots and residual plots.

Scatter Plots

A scatter plot draws the feature on the horizontal axis and the target on the vertical axis. When linearity holds, the cloud of points should cluster tightly around a straight regression line. sns.lmplot() draws this scatter plot and fits a regression line in one call.

Start with the simulated data to see what a perfectly linear relationship looks like:

PYTHON

# order 1 indicates that we want seaborn to estimate a linear model (the line in the plot below) between x and y
sns.lmplot(x="x", y="y", data=toy_df, order=1)

plt.ylabel('Target')
plt.xlabel('Independent variable')

The scatter plot below shows points tightly aligned along the regression line — the ideal shape for a linear relationship:

Scatter plot of simulated data showing a tight linear relationship between the independent variable and the target

Now plot LSTAT (percentage of lower-status population) against MEDV from the Boston dataset:

PYTHON

sns.lmplot(x="LSTAT", y="MEDV", data=boston, order=1)

The plot shows a clear downward trend: as LSTAT increases, house prices fall. The relationship is approximately linear, apart from a cluster of points at the low end of LSTAT:

Scatter plot of LSTAT vs MEDV from the Boston housing dataset showing a negative linear relationship

Next, plot RM (average number of rooms per dwelling) against MEDV:

PYTHON

sns.lmplot(x="RM", y="MEDV", data=boston, order=1)

The linear trend is present but weaker — many data points deviate from the regression line, especially at low and high room counts:

Scatter plot of RM vs MEDV from the Boston housing dataset showing a positive but noisy correlation

Now plot CRIM (per capita crime rate) against MEDV:

PYTHON

sns.lmplot(x="CRIM", y="MEDV", data=boston, order=1)

The relationship is clearly non-linear. Nearly all data is packed near zero crime rates, and the regression line does not capture this pattern well:

Scatter plot of CRIM vs MEDV from the Boston housing dataset showing a non-linear relationship driven by extreme crime-rate outliers

A log transformation compresses the long right tail of CRIM and can restore a linear relationship. Apply np.log() to CRIM and re-plot:

PYTHON

boston['log_crim'] = np.log(boston['CRIM'])

# plot the transformed CRIM variable vs MEDV
sns.lmplot(x="log_crim", y="MEDV", data=boston, order=1)

The transformed scatter plot shows a much more evenly distributed cloud of points around the regression line, confirming that the log transformation improved the linear fit:

Scatter plot of log-transformed CRIM vs MEDV showing improved linear fit after transformation

Drop the temporary column before continuing:

PYTHON

# let's drop the added log transformed variable we don't need it for the rest of the demo

boston.drop(labels='log_crim', inplace=True, axis=1)

Residual Plots

A second way to check linearity is to look at the residuals — the differences between the model's predictions and the true target values.

The four-step process is:

Fit a LinearRegression model using a single feature
Generate predictions on the training data
Calculate the residual for each observation: $residual = y_{true} - y_{predicted}$
Plot the residuals against the feature values and check that they scatter randomly around zero

When linearity holds, residuals should look like random noise centered at zero. A curved or funnel-shaped residual pattern signals a violated assumption.

Start with the simulated data. Fit a linear model, generate predictions, and compute residuals:

PYTHON

# SIMULATED DATA

# step 1: make a linear model
# call the linear model from sklearn
linreg = LinearRegression()

# fit the model
linreg.fit(toy_df['x'].to_frame(), toy_df['y'])

# step 2: obtain the predictions
# make the predictions
pred = linreg.predict(toy_df['x'].to_frame())

# step 3: calculate the residuals
error = toy_df['y'] - pred

# plot predicted vs real
plt.scatter(x=pred, y=toy_df['y'])
plt.xlabel('Predictions')
plt.ylabel('Real value')

The predicted values align closely with the real target values, confirming a good linear fit on the simulated data:

Scatter plot of predictions vs real values for the simulated dataset showing a near-perfect diagonal alignment

Now plot the residuals against the independent variable x to verify that errors are randomly distributed around zero:

PYTHON

# step 4: observe the distribution of the errors

plt.scatter(y=error, x=toy_df['x'])
plt.ylabel('Residuals')
plt.xlabel('Independent variable x')

The residuals are evenly scattered above and below zero across all values of x — exactly what you want to see:

Residuals vs independent variable x for the simulated dataset, showing random scatter centered around zero

A histogram of those same residuals should show a Gaussian bell curve centered at zero. Plot it with sns.distplot():

PYTHON

sns.distplot(error, bins=30)
plt.xlabel('Residuals')

The residuals form a symmetric bell curve centered at zero, confirming the linear model assumption is fully satisfied on the simulated data:

Histogram of residuals for the simulated dataset showing a symmetric Gaussian distribution centered at zero

Now repeat the same process for LSTAT from the Boston dataset. Fit the model, generate predictions, and plot them against the true MEDV values:

PYTHON

# step 1: call the linear model from sklearn
linreg = LinearRegression()

# fit the model
linreg.fit(boston['LSTAT'].to_frame(), boston['MEDV'])

# step 2: make the predictions
pred = linreg.predict(boston['LSTAT'].to_frame())

# step 3: calculate the residuals
error = boston['MEDV'] - pred

# plot predicted vs real
plt.scatter(x=pred, y=boston['MEDV'])
plt.xlabel('Predictions')
plt.ylabel('MEDV')

The model fits reasonably well in the middle range but consistently under-predicts the highest house prices — a sign that the linear relationship is imperfect:

Scatter plot of predictions vs actual MEDV values using LSTAT as the sole feature, showing under-prediction at high price ranges

Plot the residuals against LSTAT to see where the errors concentrate:

PYTHON

# step 4: observe the distribution of the errors

plt.scatter(y=error, x=boston['LSTAT'])
plt.ylabel('Residuals')
plt.xlabel('LSTAT')

The residuals fan out at low and high values of LSTAT instead of staying flat around zero. This non-constant spread is called heteroscedasticity — a violation of the homoscedasticity assumption:

Residuals vs LSTAT plot showing larger errors at low and high LSTAT values, indicating non-constant variance

Now fit the model using log(LSTAT) instead and compare the predictions:

PYTHON

# step 1: call the linear model from sklearn
linreg = LinearRegression()

# fit the model
linreg.fit(np.log(boston['LSTAT']).to_frame(), boston['MEDV'])

# staep 2: make the predictions
pred = linreg.predict(np.log(boston['LSTAT']).to_frame())

# step 3: calculate the residuals
error = boston['MEDV'] - pred

# plot predicted vs real
plt.scatter(x=pred, y=boston['MEDV'])
plt.xlabel('Predictions')
plt.ylabel('MEDV')

The predictions track the true values more evenly across the full price range after the transformation:

Scatter plot of predictions vs actual MEDV using log-transformed LSTAT, showing a tighter and more consistent fit

Plot the residuals after the transformation:

PYTHON

# step 4: observe the distribution of the errors

plt.scatter(y=error, x=boston['LSTAT'])
plt.ylabel('Residuals')
plt.xlabel('LSTAT')

The residuals are more centered around zero and spread more evenly across all LSTAT values compared to the untransformed model:

Residuals vs LSTAT after log transformation showing improved centering and more homogeneous spread around zero

You can see how the log transformation improved linearity and reduced the error pattern. Try the same experiment on RM and CRIM to see whether transformation helps those features too.

Multicollinearity

Multicollinearity means that two or more input features are strongly correlated with each other. When this happens, the model cannot reliably separate the individual effect of each feature — coefficients become unstable and the model's interpretability breaks down.

You detect multicollinearity by computing a correlation matrix: a table that shows the pairwise correlation between every pair of features, where values near +1 or -1 indicate a strong relationship.

Compute the correlation matrix for the Boston features and display it as a heatmap — a color-coded grid where darker shades represent stronger correlations:

PYTHON

correlation_matrix = boston[features].corr().round(2)
figure = plt.figure(figsize=(12, 12))
# annot = True to print the correlation values inside the squares
sns.heatmap(data=correlation_matrix, annot=True)

Each cell in the heatmap shows the correlation between the two features labeling its row and column. Strong positive correlations appear in a warm dark color and strong negative correlations in a cool dark color:

Correlation heatmap of all 13 Boston housing features, with each cell showing the pairwise correlation value

The heatmap reveals two pairs of highly correlated features. RAD (highway accessibility) and TAX (property tax rate) have a correlation of 0.91 — nearly perfectly correlated. NOX (nitric oxide concentration) and DIS (distance to employment centers) have a correlation of -0.71.

Plot the RAD–TAX pair in a scatter plot to see the collinearity visually:

PYTHON

sns.lmplot(x="RAD", y="TAX", data=boston, order=1)

The points follow the regression line closely, confirming the strong linear relationship between highway accessibility and property taxes:

Scatter plot of RAD vs TAX showing strong positive collinearity between the two features

Now plot NOX vs DIS:

PYTHON

sns.lmplot(x="NOX", y="DIS", data=boston, order=1)

The strong negative correlation between nitric oxide concentration and distance to employment centers is clearly visible — areas farther from job centers tend to have lower pollution:

Scatter plot of NOX vs DIS showing strong negative correlation between the two features

Both RAD–TAX and NOX–DIS violate the no-multicollinearity assumption. Before training a linear model, you would drop one feature from each correlated pair (or apply PCA to replace both with an uncorrelated component).

Normality

The normality assumption says that each input feature should follow a Gaussian distribution — the familiar symmetric bell curve. You check this with histograms and Q-Q plots (quantile–quantile plots).

Histograms

A histogram bins feature values and plots their frequency. A normally distributed feature produces a symmetric bell shape centered around the mean. Skewed or multi-modal shapes signal a violation.

Plot the simulated variable x, which you know is normally distributed by construction:

PYTHON

sns.distplot(toy_df['x'], bins=30)

The histogram shows the classic symmetric bell curve — your reference for what a normally distributed feature looks like:

Histogram of the simulated variable x showing a symmetric Gaussian bell curve

Now plot RM (average rooms per dwelling) from the Boston dataset:

PYTHON

sns.distplot(boston['RM'], bins=30)

RM shows a roughly normal distribution centered around six rooms — it passes the normality check:

Histogram of the RM feature from the Boston housing dataset showing an approximately normal distribution centered around six rooms

Plot LSTAT (percentage lower-status population):

PYTHON

sns.distplot(boston['LSTAT'], bins=30)

LSTAT is visibly right-skewed — most values are low, but a long tail stretches toward higher values. This violates the normality assumption:

Histogram of the LSTAT feature showing a right-skewed distribution with a long tail toward higher values

Apply a log transformation to LSTAT to compress the tail and compare:

PYTHON

# histogram of the log-transformed LSTAT for comparison
sns.distplot(np.log(boston['LSTAT']), bins=30)

After the log transformation, the distribution is less skewed and closer to a bell shape, though still not perfectly Gaussian:

Histogram of log-transformed LSTAT showing a reduced skew and a more symmetric distribution

Q-Q Plots

A Q-Q plot (quantile–quantile plot) is a more precise normality diagnostic. It plots the quantiles of your actual data (y-axis) against the theoretical quantiles of a perfect Gaussian distribution (x-axis). When the data is normally distributed, the points form a straight 45-degree line. Departures from that line reveal where the distribution diverges from normality.

Plot the Q-Q plot for the simulated variable x:

PYTHON

stats.probplot(toy_df['x'], dist="norm", plot=plt)
plt.show()

Every point lies on the 45-degree reference line — the ideal outcome for a normally distributed variable:

Q-Q plot of the simulated variable x showing nearly perfect alignment with the theoretical Gaussian quantiles along the 45-degree line

Now check RM:

PYTHON

stats.probplot(boston['RM'], dist="norm", plot=plt)
plt.show()

The center of the distribution follows the line well, but the tails deviate slightly — RM is close to normal but not perfect:

Q-Q plot of the RM feature showing good alignment in the center but slight deviation at both tails

Check LSTAT:

PYTHON

stats.probplot(boston['LSTAT'], dist="norm", plot=plt)
plt.show()

The points curve away from the reference line at both ends, confirming that LSTAT is not normally distributed:

Q-Q plot of the LSTAT feature showing significant curvature away from the reference line, indicating non-normality

Apply the log transformation to LSTAT and re-plot:

PYTHON

# and now for the log transformed LSTAT
stats.probplot(np.log(boston['LSTAT']), dist="norm", plot=plt)
plt.show()

After the transformation, the points track the reference line much more closely across the full range of quantiles:

Q-Q plot of log-transformed LSTAT showing improved alignment with the Gaussian reference line after transformation

For comparison, check CRIM, which you already know has an extreme right skew:

PYTHON

stats.probplot(boston['CRIM'], dist="norm", plot=plt)
plt.show()

The Q-Q plot shows a severe upward curve at the high end — CRIM is extremely non-normal:

Q-Q plot of the CRIM feature showing extreme departure from the Gaussian reference line due to a very heavy right tail

Apply the log transformation to CRIM:

PYTHON

stats.probplot(np.log(boston['CRIM']), dist="norm", plot=plt)
plt.show()

The transformation improves normality noticeably, though the distribution still deviates from the reference line — you might experiment with a Box-Cox or Yeo-Johnson transformation for a better result:

Q-Q plot of log-transformed CRIM showing improved but still imperfect alignment with the Gaussian reference line

Homoscedasticity

Homoscedasticity — also called homogeneity of variance — means that the model's prediction errors have the same spread regardless of the feature value. When errors grow larger (or smaller) as the feature increases, the data is heteroscedastic, which makes regression coefficients less reliable.

To check homoscedasticity, you fit a model using several features together, compute the residuals on the test set, and plot those residuals against each feature. Ideally the residual cloud stays at the same width (same variance) across all feature values.

Baseline Model on Raw Features

Split the data into training (70%) and test (30%) sets, using RM, LSTAT, and CRIM as features:

PYTHON

X_train, X_test, y_train, y_test = train_test_split(
    boston[['RM', 'LSTAT', 'CRIM']],
    boston['MEDV'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

OUTPUT

((354, 3), (152, 3), (354,), (152,))

Scale the features with StandardScaler, which converts each feature to zero mean and unit variance. The standardization formula is:

z = \frac{x - u}{s}

Where:

$z$ — the standardized score (the scaled output)
$x$ — the original raw feature value
$u$ — the mean of the feature across the training set
$s$ — the standard deviation of the feature across the training set

PYTHON

scaler = StandardScaler()
scaler.fit(X_train)

PYTHON

StandardScaler(copy=True, with_mean=True, with_std=True)

Fit the linear regression model on the scaled training data and evaluate its mean squared error on both sets:

PYTHON

# call the model
linreg = LinearRegression()

# train the model
linreg.fit(scaler.transform(X_train), y_train)

# make predictions on the train set and calculate the mean squared error
print('Train set')
pred = linreg.predict(scaler.transform(X_train))
print('Linear Regression mse: {}'.format(mean_squared_error(y_train, pred)))

# make predictions on the test set and calculate the mean squared error
print('Test set')
pred = linreg.predict(scaler.transform(X_test))
print('Linear Regression mse: {}'.format(mean_squared_error(y_test, pred)))
print()

OUTPUT

Train set
Linear Regression mse: 28.603232128198893
Test set
Linear Regression mse: 33.20006295308442

The test MSE is 33.2 — your baseline to beat after applying transformations. Now compute the residuals and check for homoscedasticity. Plot residuals against LSTAT:

PYTHON

# calculate the residuals
error = y_test - pred

# plot the residuals vs LSTAT
plt.scatter(x=X_test['LSTAT'], y=error)
plt.xlabel('LSTAT')
plt.ylabel('Residuals')

The residuals for LSTAT are reasonably spread around zero across most of the feature range:

Residuals vs LSTAT plot for the baseline model showing roughly homogeneous spread around zero for the test set

Plot residuals against RM:

PYTHON

plt.scatter(x=X_test['RM'], y=error)
plt.xlabel('RM')
plt.ylabel('Residuals')

The residuals for RM are not homogeneous — low and high room counts produce larger errors than the middle range, indicating heteroscedasticity:

Residuals vs RM plot for the baseline model showing larger errors at low and high values of RM

Plot residuals against CRIM:

PYTHON

plt.scatter(x=X_test['CRIM'], y=error)
plt.xlabel('CRIM')
plt.ylabel('Residuals')

PYTHON

Text(0, 0.5, 'Residuals')

The residuals for CRIM are heavily skewed — most values cluster at low crime rates, leaving a very long right tail with few observations and large errors:

Residuals vs CRIM plot for the baseline model showing heavily skewed distribution with most data points clustered near zero crime rate

Now use the Yellowbrick ResidualsPlot visualizer for a consolidated view of model residuals. To install the library, run:

pip install yellowbrick

ResidualsPlot shows the residuals for both training and test data on one chart, with the residual distribution histogram on the right side:

PYTHON

from yellowbrick.regressor import ResidualsPlot

linreg = LinearRegression()
linreg.fit(scaler.transform(X_train), y_train)
visualizer = ResidualsPlot(linreg)

visualizer.fit(scaler.transform(X_train), y_train)  # Fit the training data to the model
visualizer.score(scaler.transform(X_test), y_test)  # Evaluate the model on the test data
visualizer.poof()

The residuals fan out at higher predicted values and are not centered around zero, confirming heteroscedasticity. The test R² is 0.60:

Yellowbrick ResidualsPlot for the baseline linear regression model showing non-homogeneous residuals and a test R-squared of 0.60

Improved Model on Log-Transformed Features

Apply a log transformation to all three features — LSTAT, CRIM, and RM — and rebuild the model:

PYTHON

# log transform the variables
boston['LSTAT'] = np.log(boston['LSTAT'])
boston['CRIM'] = np.log(boston['CRIM'])
boston['RM'] = np.log(boston['RM'])

# let's separate into training and testing set
X_train, X_test, y_train, y_test = train_test_split(
    boston[['RM', 'LSTAT', 'CRIM']],
    boston['MEDV'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

OUTPUT

((354, 3), (152, 3), (354,), (152,))

Re-fit the scaler on the new transformed training data:

PYTHON

# let's scale the features
scaler = StandardScaler()
scaler.fit(X_train)

PYTHON

StandardScaler(copy=True, with_mean=True, with_std=True)

Fit the model on the transformed and scaled features and evaluate the MSE:

PYTHON

# building the model

# call the model
linreg = LinearRegression()

# fit the model
linreg.fit(scaler.transform(X_train), y_train)

# make predictions and calculate the mean squared error over the train set
print('Train set')
pred = linreg.predict(scaler.transform(X_train))
print('Linear Regression mse: {}'.format(mean_squared_error(y_train, pred)))

# make predictions and calculate the mean squared error over the test set
print('Test set')
pred = linreg.predict(scaler.transform(X_test))
print('Linear Regression mse: {}'.format(mean_squared_error(y_test, pred)))
print()

OUTPUT

Train set
Linear Regression mse: 24.36853232810096
Test set
Linear Regression mse: 29.516553315892253

The test MSE dropped from 33.2 to 29.5 — a clear improvement from the transformations alone. Now check whether the residuals are more homoscedastic. Plot residuals vs LSTAT:

PYTHON

# calculate the residuals
error = y_test - pred

# residuals plot vs LSTAT
plt.scatter(x=X_test['LSTAT'], y=error)
plt.xlabel('LSTAT')
plt.ylabel('Residuals')

After transformation, the residuals for LSTAT are centered around zero and spread evenly across the full range of feature values:

Residuals vs LSTAT after log transformation showing improved centering and uniform spread across all LSTAT values

Plot residuals vs RM:

PYTHON

# residuals plot vs RM
plt.scatter(x=X_test['RM'], y=error)
plt.xlabel('RM')
plt.ylabel('Residuals')

The spread of the residuals is more uniform across the RM range compared to the baseline model:

Residuals vs RM after log transformation showing a more uniform spread of residuals across all RM values

Plot residuals vs CRIM:

PYTHON

# residuals plot vs CRIM
plt.scatter(x=X_test['CRIM'], y=error)
plt.xlabel('CRIM')
plt.ylabel('Residuals')

The log transformation eliminated the extreme skew in CRIM's residuals — they are now spread across a much wider and more balanced range of values:

Residuals vs CRIM after log transformation showing the skew removed and residuals distributed more evenly

Run the Yellowbrick residuals plot on the transformed model for a final consolidated view:

PYTHON

linreg = LinearRegression()
linreg.fit(scaler.transform(X_train), y_train)
visualizer = ResidualsPlot(linreg)

visualizer.fit(scaler.transform(X_train), y_train)  # Fit the training data to the model
visualizer.score(scaler.transform(X_test), y_test)  # Evaluate the model on the test data
visualizer.poof()

The residuals are substantially more homogeneous after transformation. Comparing R² values: the transformed model achieves a test R² of 0.65, up from 0.60 for the baseline — a meaningful gain from feature engineering alone.

Conclusion

In this tutorial you validated all four linear model assumptions on the Boston housing dataset and demonstrated how log transformations address violations. Starting with raw features, the model achieved a test MSE of 33.2 and R² of 0.60. After applying log transformations to LSTAT, CRIM, and RM, the test MSE dropped to 29.5 and R² improved to 0.65 — without changing the model architecture at all.

Key takeaways:

Scatter plots and residual plots reveal whether the linearity assumption holds for each feature. A curved or funnel-shaped residual pattern is a clear warning sign.
Q-Q plots are more sensitive than histograms for detecting non-normality, especially in the tails of the distribution.
A correlation above roughly 0.8 between two features signals multicollinearity — keep one and drop the other before training a linear model.
Log transformation is the most practical first remedy for right-skewed features: it compresses the long tail, improves linearity, and often reduces heteroscedasticity simultaneously.
Feature engineering at the assumption-checking stage — before any hyperparameter tuning — can produce measurable improvements in model performance.

Next steps:

Read Feature Engineering: Handling Outliers to learn how extreme values affect these same assumption checks and how to treat them before modeling.
Explore Feature Engineering: Variable Magnitude to understand how feature scaling choices interact with the assumptions covered here.
Study Lasso and Ridge Regularisation for Feature Selection to see how regularization handles multicollinearity in practice when you cannot simply drop features.

Feature Engineering: Linear Model Assumptions

Topics You Will Master

What Linear Models Assume

Setting Up the Environment

Creating a Simulated Reference Dataset

Linearity Assumption

Scatter Plots

Residual Plots

Multicollinearity

Normality

Histograms

Q-Q Plots

Homoscedasticity

Baseline Model on Raw Features

Improved Model on Log-Transformed Features

Conclusion

Latest recommendations you might like

LinkedIn Auto Connect Bot

Dimensionality Reduction with LDA and PCA in Python

Find this tutorial useful?

Discussion & Comments