#linear regression#scikit-learn#python#machine learning#regression#boston dataset

Linear Regression with Python

Learn how linear regression works and how to implement it with scikit-learn on the Boston housing dataset. Covers simple and multiple regression, feature selection by correlation, and model evaluation with R², MAE, and MSE.

May 17, 2026 at 9:00 PM19 min readFollowFollow (Hindi)

Topics You Will Master

What linear regression is and when to use it over other supervised methods
How simple and multiple linear regression differ mathematically
How to load and explore the Boston housing dataset with Pandas
How to select features using Pearson correlation thresholds
How to fit a LinearRegression model and generate predictions with scikit-learn
How to evaluate a regression model using R², MAE, and MSE
How to plot learning curves to diagnose underfitting and overfitting
Best For

Python developers and data scientists who understand supervised learning basics and want to build their first regression model from data exploration through to evaluation.

Expected Outcome

A trained linear regression model predicting Boston house prices from correlated features, evaluated with R², MAE, and MSE, with feature-selection experiments you can interpret and extend.

Linear regression is one of the oldest and most widely used predictive techniques in statistics and machine learning. At its core, it answers a simple question: given a set of measurable inputs, what continuous output value does the data suggest? Unlike classification, which predicts categories, regression predicts a number — for example, the selling price of a house.

In this tutorial you will build a regression pipeline on the Boston housing dataset. You will explore the data, select features by correlation, train a LinearRegression model with scikit-learn, and evaluate it using three standard metrics. You will also plot learning curves to understand how the model generalises as more training data is added.

Prerequisites: Python 3.x, scikit-learn, Pandas, NumPy, Matplotlib, Seaborn.

What is Regression?

Regression is a statistical method that attempts to determine the strength and character of the relationship between one dependent variable (the value you want to predict, usually written as ) and one or more independent variables (the inputs, written as ).

The diagram below shows the core idea: raw data goes into a regression model, which learns the relationship and produces a continuous prediction.

Regression overview diagram showing data flowing into a regression model to produce a continuous prediction

Where Regression Applies

Linear regression is used across many domains because it is fast to train, easy to interpret, and often gives a strong baseline. A few practical examples follow.

Stock price prediction uses historical prices and news events as input features to forecast future stock values.

Stock prediction slide showing that price y depends on recent history, news events, and related commodities

Tweet popularity prediction estimates how many people will retweet a post based on the number of followers, hashtag popularity, and text features.

Tweet popularity slide showing that retweet count depends on follower count, hashtag, and text features

House price prediction uses property attributes such as number of rooms, distance to employment centres, and neighbourhood crime rate to estimate sale price — the exact task you will work through in this tutorial.

House price prediction case study diagram showing regression pipeline from house attributes to predicted price

Simple and Multiple Linear Regression

Linear regression comes in two forms depending on how many input features you use.

Simple Linear Regression

Simple linear regression uses a single independent variable to predict the dependent variable. The relationship is described by a straight line:

Where:

  • — the dependent variable (house price)
  • — the intercept (baseline value when )
  • — the coefficient for the independent variable
  • — the independent variable (e.g. number of rooms)
  • — the random error term

Multiple Linear Regression

Multiple linear regression extends the formula to more than one independent variable, allowing the model to capture more complex patterns in the data. Instead of fitting a line in 2D space, you fit a hyperplane in higher-dimensional space. When you add too many features, you can also fit more complex polynomial relationships.

The diagram below illustrates how adding a second input dimension (e.g. number of bathrooms alongside house size) lifts the problem from a 2D line to a 3D plane fit.

Multiple linear regression diagram showing a 3D hyperplane fit over house size and number of bathrooms inputs

Theory: Best-Fit Line, Cost Function, and Gradient Descent

The Best-Fit Line and Residuals

The best-fit line (also called the regression line) is the line for which the error between predicted values and observed values is minimised. These errors are called residuals — the vertical distances between each observed data point and the line.

The Cost Function

To find the best-fit line algorithmically, you minimise a cost function — a measure of how wrong your current parameter estimates are. The standard choice is Mean Squared Error (MSE), which penalises large errors more heavily than small ones.

The diagram below shows the hypothesis function, the MSE cost function , and the optimisation goal:

Hypothesis, cost function, and gradient descent goal diagram showing the linear model and MSE formula with theta parameters

Gradient Descent

Gradient descent is the optimisation algorithm used to minimise the cost function. Starting from random parameter values for and , gradient descent iteratively takes steps in the direction of steepest descent on the cost surface until it reaches the minimum. The step size is controlled by the learning rate — too large and the algorithm overshoots; too small and training is very slow.

The diagram below shows both the goodness-of-fit metric being defined for each candidate line and the gradient descent surface being navigated to reach optimal parameters:

Gradient descent diagram showing the cost surface over intercept and slope, with the algorithm finding the minimum

Bias-Variance Tradeoff

Every supervised model must balance two sources of error. Bias refers to simplifying assumptions a model makes to learn the target function — high bias produces underfitting. Variance refers to how sensitive the model is to fluctuations in the training data — high variance produces overfitting. The goal is to reach the sweet spot where both are low.

The diagram below shows the classic bias-variance tradeoff curve where MSE = bias² + variance, and the optimal model complexity sits at the "sweet spot":

Bias-variance tradeoff diagram showing MSE decomposed into bias squared and variance curves as model complexity increases

Loading the Boston Dataset

Start by importing all libraries you will need for this tutorial. Keeping all imports in one block at the top makes dependencies immediately visible.

PYTHON
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error

Load the dataset and inspect its structure:

PYTHON
boston = load_boston()
type(boston)
sklearn.utils.Bunch
boston.keys()
PYTHON
dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])

Print the full dataset description to understand what each feature represents:

PYTHON
print(boston.DESCR)
OUTPUT
.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**

    :Number of Instances: 506

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.

Inspect the feature names and the first five target values (median house prices in thousands of dollars):

PYTHON
boston.feature_names
OUTPUT
array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')
PYTHON
boston.target[: 5]
OUTPUT
array([24. , 21.6, 34.7, 33.4, 36.2])

Inspect the raw array shape to confirm 506 samples and 13 features:

PYTHON
data = boston.data
type(data)
numpy.ndarray
data.shape
OUTPUT
(506, 13)

Building the DataFrame

A DataFrame is a two-dimensional tabular data structure provided by Pandas. Converting the NumPy array into a DataFrame gives you labelled columns and makes exploration much easier.

PYTHON
data = pd.DataFrame(data = data, columns= boston.feature_names)
data.head()
OUTPUT
CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTAT
00.0063218.02.310.00.5386.57565.24.09001.0296.015.3396.904.98
10.027310.07.070.00.4696.42178.94.96712.0242.017.8396.909.14
20.027290.07.070.00.4697.18561.14.96712.0242.017.8392.834.03
30.032370.02.180.00.4586.99845.86.06223.0222.018.7394.632.94
40.069050.02.180.00.4587.14754.26.06223.0222.018.7396.905.33

Add the target (house price) as a new column so all data lives in one place:

PYTHON
data['Price'] = boston.target
data.head()
OUTPUT
CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATPrice
00.0063218.02.310.00.5386.57565.24.09001.0296.015.3396.904.9824.0
10.027310.07.070.00.4696.42178.94.96712.0242.017.8396.909.1421.6
20.027290.07.070.00.4697.18561.14.96712.0242.017.8392.834.0334.7
30.032370.02.180.00.4586.99845.86.06223.0222.018.7394.632.9433.4
40.069050.02.180.00.4587.14754.26.06223.0222.018.7396.905.3336.2

Use describe() to get a statistical summary of every feature including count, mean, standard deviation, and percentiles:

PYTHON
data.describe()
OUTPUT
CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATPrice
count506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000506.000000
mean3.61352411.36363611.1367790.0691700.5546956.28463468.5749013.7950439.549407408.23715418.455534356.67403212.65306322.532806
std8.60154523.3224536.8603530.2539940.1158780.70261728.1488612.1057108.707259168.5371162.16494691.2948647.1410629.197104
min0.0063200.0000000.4600000.0000000.3850003.5610002.9000001.1296001.000000187.00000012.6000000.3200001.7300005.000000
25%0.0820450.0000005.1900000.0000000.4490005.88550045.0250002.1001754.000000279.00000017.400000375.3775006.95000017.025000
50%0.2565100.0000009.6900000.0000000.5380006.20850077.5000003.2074505.000000330.00000019.050000391.44000011.36000021.200000
75%3.67708312.50000018.1000000.0000000.6240006.62350094.0750005.18842524.000000666.00000020.200000396.22500016.95500025.000000
max88.976200100.00000027.7400001.0000000.8710008.780000100.00000012.12650024.000000711.00000022.000000396.90000037.97000050.000000

data.info() gives a concise structural summary — column types, non-null counts, and memory usage — useful for spotting missing values quickly:

PYTHON
data.info()
PYTHON
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    float64
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    float64
 9   TAX      506 non-null    float64
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
 13  Price    506 non-null    float64
dtypes: float64(14)
memory usage: 55.5 KB

The dataset has no missing values, which you can confirm explicitly:

PYTHON
data.isnull().sum()
OUTPUT
CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
Price      0
dtype: int64

Exploratory Data Analysis

Before selecting features, you should understand how they relate to each other and to the target. A scatterplot matrix (pairplot) lets you see all pairwise relationships and identify obvious linear correlations at a glance.

PYTHON
sns.pairplot(data)
plt.show()

The pairplot below shows every feature plotted against every other feature. Diagonal cells show the marginal distribution of each variable:

Pairplot scatter matrix showing pairwise relationships between all 14 Boston dataset features including the Price target

You can also inspect the marginal distribution of each individual feature with distribution plots to spot skew and outliers:

PYTHON
rows = 2
cols = 7
fig, ax = plt.subplots(nrows= rows, ncols= cols, figsize = (16,4))
col = data.columns
index = 0
for i in range(rows):
    for j in range(cols):
        sns.distplot(data[col[index]], ax = ax[i][j])
        index = index + 1
plt.tight_layout()
plt.show()

The grid of distribution plots below shows the density of each feature — several are heavily right-skewed (e.g. CRIM, ZN), while RM and Price are roughly bell-shaped:

Grid of 14 distribution plots for every Boston dataset feature, showing the density shape and spread of each variable

Correlation Matrix

A correlation matrix quantifies the pairwise linear relationships between all variables using Pearson's r — a coefficient that ranges from −1 (perfect negative correlation) to +1 (perfect positive correlation). You will use this matrix to select which features to feed into the model.

PYTHON
corrmat = data.corr()
corrmat
OUTPUT
CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOBLSTATPrice
CRIM1.000000-0.2004690.406583-0.0558920.420972-0.2192470.352734-0.3796700.6255050.5827640.289946-0.3850640.455621-0.388305
ZN-0.2004691.000000-0.533828-0.042697-0.5166040.311991-0.5695370.664408-0.311948-0.314563-0.3916790.175520-0.4129950.360445
INDUS0.406583-0.5338281.0000000.0629380.763651-0.3916760.644779-0.7080270.5951290.7207600.383248-0.3569770.603800-0.483725
CHAS-0.055892-0.0426970.0629381.0000000.0912030.0912510.086518-0.099176-0.007368-0.035587-0.1215150.048788-0.0539290.175260
NOX0.420972-0.5166040.7636510.0912031.000000-0.3021880.731470-0.7692300.6114410.6680230.188933-0.3800510.590879-0.427321
RM-0.2192470.311991-0.3916760.091251-0.3021881.000000-0.2402650.205246-0.209847-0.292048-0.3555010.128069-0.6138080.695360
AGE0.352734-0.5695370.6447790.0865180.731470-0.2402651.000000-0.7478810.4560220.5064560.261515-0.2735340.602339-0.376955
DIS-0.3796700.664408-0.708027-0.099176-0.7692300.205246-0.7478811.000000-0.494588-0.534432-0.2324710.291512-0.4969960.249929
RAD0.625505-0.3119480.595129-0.0073680.611441-0.2098470.456022-0.4945881.0000000.9102280.464741-0.4444130.488676-0.381626
TAX0.582764-0.3145630.720760-0.0355870.668023-0.2920480.506456-0.5344320.9102281.0000000.460853-0.4418080.543993-0.468536
PTRATIO0.289946-0.3916790.383248-0.1215150.188933-0.3555010.261515-0.2324710.4647410.4608531.000000-0.1773830.374044-0.507787
B-0.3850640.175520-0.3569770.048788-0.3800510.128069-0.2735340.291512-0.444413-0.441808-0.1773831.000000-0.3660870.333461
LSTAT0.455621-0.4129950.603800-0.0539290.590879-0.6138080.602339-0.4969960.4886760.5439930.374044-0.3660871.000000-0.737663
Price-0.3883050.360445-0.4837250.175260-0.4273210.695360-0.3769550.249929-0.381626-0.468536-0.5077870.333461-0.7376631.000000

A heatmap makes it much easier to spot high-correlation pairs visually. Warm colours indicate positive correlation and cool colours indicate negative correlation. Run the block below to render the annotated heatmap:

PYTHON
fig, ax = plt.subplots(figsize = (18, 10))
sns.heatmap(corrmat, annot = True, annot_kws={'size': 12})
plt.show()

The output heatmap shows the full Pearson correlation matrix for all 14 features. Notice that RM (average number of rooms) has the strongest positive correlation with Price (0.70) and LSTAT (lower-status population percentage) has the strongest negative correlation (−0.74).

Feature Selection by Correlation Threshold

Rather than using all 13 features, you will select only those with an absolute Pearson correlation above a chosen threshold against the target price. The helper function below returns a DataFrame of features that pass the threshold:

PYTHON
corrmat.index.values
OUTPUT
array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT', 'Price'], dtype=object)
PYTHON
def getCorrelatedFeature(corrdata, threshold):
    feature = []
    value = []

    for i, index in enumerate(corrdata.index):
        if abs(corrdata[index])> threshold:
            feature.append(index)
            value.append(corrdata[index])

    df = pd.DataFrame(data = value, index = feature, columns=['Corr Value'])
    return df

Apply the function at a threshold of 0.50 to find features with a moderate-to-strong relationship with Price:

PYTHON
threshold = 0.50
corr_value = getCorrelatedFeature(corrmat['Price'], threshold)
corr_value
OUTPUT
Corr Value
RM0.695360
PTRATIO-0.507787
LSTAT-0.737663
Price1.000000

Three features pass the 0.50 threshold: RM (positive), PTRATIO (negative), and LSTAT (negative). Retrieve their names as an array:

PYTHON
corr_value.index.values
OUTPUT
array(['RM', 'PTRATIO', 'LSTAT', 'Price'], dtype=object)

Subset the DataFrame to these correlated features and inspect the first few rows:

PYTHON
correlated_data = data[corr_value.index]
correlated_data.head()
OUTPUT
RMPTRATIOLSTATPrice
06.57515.34.9824.0
16.42117.89.1421.6
27.18517.84.0334.7
36.99818.72.9433.4
47.14718.75.3336.2

Plot the pairwise relationships within this reduced feature set to confirm the linear trends are visible:

PYTHON
sns.pairplot(correlated_data)
plt.tight_layout()

Training and Evaluating the Model

With the features selected, split the data into training (80 %) and test (20 %) sets. Shuffling the data when splitting removes any ordering bias in the dataset:

PYTHON
X = correlated_data.drop(labels=['Price'], axis = 1)
y = correlated_data['Price']
X.head()
OUTPUT
RMPTRATIOLSTAT
06.57515.34.98
16.42117.89.14
27.18517.84.03
36.99818.72.94
47.14718.75.33
PYTHON
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
X_train.shape, X_test.shape
OUTPUT
((404, 3), (102, 3))

You have 404 samples for training and 102 for testing. Now fit the model and generate predictions on the test set:

PYTHON
model = LinearRegression()
model.fit(X_train, y_train)
LinearRegression()
y_predict = model.predict(X_test)
df = pd.DataFrame(data = [y_predict, y_test])
df.T
OUTPUT
01
027.60903122.6
122.09903450.0
226.52925523.0
312.5079868.3
422.25487921.2
.........
9728.27122824.7
9818.46741914.1
9918.55807018.7
10024.68196428.1
10120.82687919.8

102 rows × 2 columns

Regression Evaluation Metrics

Good regression models require a quantitative measure of prediction quality. Three metrics are widely used, each with a different sensitivity to large errors.

Mean Absolute Error (MAE) is the average of the absolute differences between predictions and actuals — easy to interpret because it is in the same units as the target:

Mean Squared Error (MSE) squares each error before averaging, so large errors are penalised much more heavily than small ones:

Root Mean Squared Error (RMSE) takes the square root of MSE so the metric is back in the same units as the target, making it more interpretable:

In practice, MAE is easiest to interpret as the average error, MSE punishes large mistakes more, and RMSE gives you a single number in the units of your target variable. All three are loss functions — you want to minimise them.

The R² Score

The coefficient of determination, , measures how much of the variance in the target the model explains. It is defined as:

Where:

  • — total sum of squares (variance of the actual values)
  • — residual sum of squares (variance unexplained by the model)
  • ranges from 0 (model explains nothing) to 1 (model explains everything); a negative value means the model is worse than predicting the mean

The diagram below illustrates the difference between a bad model (R² near 0) and a good model (R² near 1):

R2 score diagram showing a bad model with similar errors and a good model where predicted squared errors are much smaller than baseline errors

Now compute all three metrics for the first model:

PYTHON
from sklearn.metrics import r2_score
PYTHON
correlated_data.columns
OUTPUT
Index(['RM', 'PTRATIO', 'LSTAT', 'Price'], dtype='object')
PYTHON
score = r2_score(y_test, y_predict)
mae = mean_absolute_error(y_test, y_predict)
mse = mean_squared_error(y_test, y_predict)
print('r2_score: ', score)
print('mae: ', mae)
print('mse: ', mse)
OUTPUT
r2_score:  0.48816420156925056
mae:  4.404434993909258
mse:  41.67799012221684

An R² of 0.49 means the three-feature model explains about half the variance in house prices. You will now experiment with different correlation thresholds to see whether adding or removing features improves this.

Feature Selection Experiments

The helper function below records metrics for each experiment so you can compare all threshold configurations in one table:

PYTHON
total_features = []
total_features_name = []
selected_correlation_value = []
r2_scores = []
mae_value = []
mse_value = []
PYTHON
def performance_metrics(features, th, y_true, y_pred):
    score = r2_score(y_true, y_pred)
    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    total_features.append(len(features)-1)
    total_features_name.append(str(features))
    selected_correlation_value.append(th)
    r2_scores.append(score)
    mae_value.append(mae)
    mse_value.append(mse)

    metrics_dataframe = pd.DataFrame(data= [total_features_name, total_features, selected_correlation_value, r2_scores, mae_value, mse_value],
                                    index = ['features name', '#feature', 'corr_value', 'r2_score', 'MAE', 'MSE'])
    return metrics_dataframe.T

Record the baseline result (threshold = 0.50, three features):

PYTHON
performance_metrics(correlated_data.columns.values, threshold, y_test, y_predict)
OUTPUT
features name#featurecorr_valuer2_scoreMAEMSE
0['RM' 'PTRATIO' 'LSTAT' 'Price']30.50.4881644.4044341.678

The following scatter plots show the linear relationship between each selected feature and house price, with regression lines fitted:

PYTHON
rows = 2
cols = 2
fig, ax = plt.subplots(nrows=rows, ncols=cols, figsize = (16, 4))
ax[0, 0].set_title("House price with respect  to RM")
ax[0, 1].set_title("House price with respect to PTRATIO")
ax[1, 0].set_title("House price with respect to LSTAT")
ax[1, 1].set_title("House price with respect to PRICE")
col = correlated_data.columns
index = 0

for i in range(rows):
    for j in range(cols):
        sns.regplot(x = correlated_data[col[index]], y = correlated_data['Price'], ax = ax[i][j])
        index = index + 1
fig.tight_layout()

The four-panel plot below shows regression lines for each of the three selected features against Price, confirming positive correlation for RM and negative correlations for PTRATIO and LSTAT:

Four-panel scatter plot with regression lines showing house price vs RM, PTRATIO, LSTAT, and Price

The helper function below retrains the model from scratch for any given correlated feature subset, making it easy to compare different threshold configurations:

PYTHON
corrmat['Price']
OUTPUT
CRIM      -0.388305
ZN         0.360445
INDUS     -0.483725
CHAS       0.175260
NOX       -0.427321
RM         0.695360
AGE       -0.376955
DIS        0.249929
RAD       -0.381626
TAX       -0.468536
PTRATIO   -0.507787
B          0.333461
LSTAT     -0.737663
Price      1.000000
Name: Price, dtype: float64
PYTHON
def get_y_predict(corr_data):
    X = corr_data.drop(labels = ['Price'], axis = 1)
    y = corr_data['Price']

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
    model = LinearRegression()
    model.fit(X_train, y_train)
    y_predict = model.predict(X_test)
    return y_predict

Threshold 0.60 — RM and LSTAT only: Tightening the threshold drops PTRATIO and keeps only the two most correlated features:

PYTHON
threshold = 0.60
corr_value = getCorrelatedFeature(corrmat['Price'], threshold)
corr_value
OUTPUT
Corr Value
RM0.695360
LSTAT-0.737663
Price1.000000
PYTHON
correlated_data = data[corr_value.index]
correlated_data.head()
OUTPUT
RMLSTATPrice
06.5754.9824.0
16.4219.1421.6
27.1854.0334.7
36.9982.9433.4
47.1475.3336.2
PYTHON
y_predict = get_y_predict(correlated_data)
performance_metrics(correlated_data.columns.values, threshold, y_test, y_predict)
OUTPUT
features name#featurecorr_valuer2_scoreMAEMSE
0['RM' 'PTRATIO' 'LSTAT' 'Price']30.50.4881644.4044341.678
1['RM' 'LSTAT' 'Price']20.60.5409084.1424437.3831

Dropping PTRATIO and keeping only RM and LSTAT improves R² from 0.49 to 0.54 — fewer but more informative features can outperform a larger set.

Threshold 0.70 — LSTAT only: Raising the threshold further leaves just LSTAT:

PYTHON
corrmat['Price']
OUTPUT
CRIM      -0.388305
ZN         0.360445
INDUS     -0.483725
CHAS       0.175260
NOX       -0.427321
RM         0.695360
AGE       -0.376955
DIS        0.249929
RAD       -0.381626
TAX       -0.468536
PTRATIO   -0.507787
B          0.333461
LSTAT     -0.737663
Price      1.000000
Name: Price, dtype: float64
PYTHON
threshold = 0.70
corr_value = getCorrelatedFeature(corrmat['Price'], threshold)
corr_value
OUTPUT
Corr Value
LSTAT-0.737663
Price1.000000
PYTHON
correlated_data = data[corr_value.index]
correlated_data.head()
OUTPUT
LSTATPrice
04.9824.0
19.1421.6
24.0334.7
32.9433.4
45.3336.2
PYTHON
y_predict = get_y_predict(correlated_data)
performance_metrics(correlated_data.columns.values, threshold, y_test, y_predict)
OUTPUT
features name#featurecorr_valuer2_scoreMAEMSE
0['RM' 'PTRATIO' 'LSTAT' 'Price']30.50.4881644.4044341.678
1['RM' 'LSTAT' 'Price']20.60.5409084.1424437.3831
2['LSTAT' 'Price']10.70.4309574.8640146.3363

Using LSTAT alone drops R² back to 0.43, confirming that the combination of RM and LSTAT at threshold 0.60 was the sweet spot. Try RM alone as a further comparison:

PYTHON
correlated_data = data[['RM', 'Price']]
correlated_data.head()
OUTPUT
RMPrice
06.57524.0
16.42121.6
27.18534.7
36.99833.4
47.14736.2
PYTHON
y_predict = get_y_predict(correlated_data)
performance_metrics(correlated_data.columns.values, threshold, y_test, y_predict)
OUTPUT
features name#featurecorr_valuer2_scoreMAEMSE
0['RM' 'PTRATIO' 'LSTAT' 'Price']30.50.4881644.4044341.678
1['RM' 'LSTAT' 'Price']20.60.5409084.1424437.3831
2['LSTAT' 'Price']10.70.4309574.8640146.3363
3['RM' 'Price']10.70.4239444.3247446.9074

Threshold 0.40 — six features: Lowering the threshold to 0.40 brings in more features (INDUS, NOX, RM, TAX, PTRATIO, LSTAT):

PYTHON
threshold = 0.40
corr_value = getCorrelatedFeature(corrmat['Price'], threshold)
corr_value
OUTPUT
Corr Value
INDUS-0.483725
NOX-0.427321
RM0.695360
TAX-0.468536
PTRATIO-0.507787
LSTAT-0.737663
Price1.000000
PYTHON
correlated_data = data[corr_value.index]
correlated_data.head()
OUTPUT
INDUSNOXRMTAXPTRATIOLSTATPrice
02.310.5386.575296.015.34.9824.0
17.070.4696.421242.017.89.1421.6
27.070.4697.185242.017.84.0334.7
32.180.4586.998222.018.72.9433.4
42.180.4587.147222.018.75.3336.2
PYTHON
y_predict = get_y_predict(correlated_data)
performance_metrics(correlated_data.columns.values, threshold, y_test, y_predict)
OUTPUT
features name#featurecorr_valuer2_scoreMAEMSE
0['RM' 'PTRATIO' 'LSTAT' 'Price']30.50.4881644.4044341.678
1['RM' 'LSTAT' 'Price']20.60.5409084.1424437.3831
2['LSTAT' 'Price']10.70.4309574.8640146.3363
3['RM' 'Price']10.70.4239444.3247446.9074
4['INDUS' 'NOX' 'RM' 'TAX' 'PTRATIO' 'LSTAT' 'P...60.40.4762034.394542.6519

Adding weakly correlated features at threshold 0.40 does not improve over the two-feature model, which confirms that feature quality matters more than feature quantity for linear regression.

Normalisation and Standardisation

When features have very different scales (e.g. TAX ranges into the hundreds while NOX is between 0 and 1), models that rely on distances or gradient-based updates can be affected. Standardisation rescales each feature to have zero mean and unit variance (Gaussian distribution):

Normalisation rescales samples to have unit norm and is particularly useful when using kernel-based similarity measures. Scikit-learn provides several ready-made scalers:

NameSklearn_class
Standard scalerStandard scaler
MinMaxScalerMinMax Scaler
MaxAbs ScalerMaxAbs Scaler
Robust scalerRobust scaler
Quantile Transformer_NormalQuantile Transformer(output_distribution ='normal')
Quantile Transformer_UniformQuantile Transformer(output_distribution = 'uniform')
PowerTransformer-Yeo-JohnsonPowerTransformer(method = 'yeo-johnson')
NormalizerNormalizer

The LinearRegression estimator in scikit-learn accepts a normalize parameter that applies feature-wise standardisation before fitting. Try it and observe whether the R² changes:

PYTHON
model = LinearRegression(normalize=True)
model.fit(X_train, y_train)
LinearRegression(normalize=True)
y_predict = model.predict(X_test)
r2_score(y_test, y_predict)
OUTPUT
0.48816420156925067

The R² is essentially identical, which is expected for ordinary least squares — linear regression coefficients adjust to scale, so normalisation does not change the fit quality here.

Plotting Learning Curves

Learning curves show how model performance changes as you add more training data. They help you diagnose two common problems: if training and validation scores are both low you have underfitting (high bias); if the training score is high but the validation score is low you have overfitting (high variance).

Import the cross-validation utilities needed:

PYTHON
from sklearn.model_selection import learning_curve, ShuffleSplit

Define a reusable function that plots both the training score and the cross-validation score as a function of training set size:

PYTHON
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None, n_jobs=None, train_sizes=np.linspace(.1, 1.0, 10)):

    plt.figure()
    plt.title(title)
    plt.xlabel("Training examples")
    plt.ylabel("Score")

    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)

    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)

    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")

    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

X = correlated_data.drop(labels = ['Price'], axis = 1)
y = correlated_data['Price']

title = "Learning Curves (Linear Regression) " + str(X.columns.values)

cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=0)

estimator = LinearRegression()
plot_learning_curve(estimator, title, X, y, ylim=(0.7, 1.01), cv=cv, n_jobs=-1)
plt.show()

Conclusion

In this tutorial you trained a linear regression model on the Boston housing dataset. You explored the data with pairplots and distribution plots, selected features using Pearson correlation thresholds, and compared five different feature configurations. The two-feature model using RM and LSTAT (threshold 0.60) achieved the best R² of 0.54 and MAE of 4.14, outperforming both more restrictive and more relaxed feature sets. You also verified that feature normalisation has no effect on ordinary least squares fit quality, and built reusable learning-curve tooling to diagnose bias and variance.

Key takeaways:

  • Linear regression fits a line (or hyperplane) by minimising the mean squared error between predictions and actual values using gradient descent.
  • Feature selection by Pearson correlation is a fast and interpretable way to identify which inputs are worth including — but always validate empirically.
  • R² measures how much target variance your model explains; MAE and RMSE measure average prediction error in interpretable units.
  • More features do not always improve regression — weakly correlated features can add noise rather than signal.
  • Learning curves reveal whether your model is underfitting or overfitting and guide decisions about adding more data or regularisation.

Next steps:

  • Read Logistic Regression with Python to see how the linear model is adapted for binary classification problems.
  • Explore Random Forest Regressor to see how an ensemble of trees can capture non-linear relationships that linear regression misses.
  • Experiment with all 13 Boston features (threshold = 0.0) and add Ridge or Lasso regularisation to compare against the correlation-filtered models built here.

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments