Imagine you have two features in your dataset: AGE, which ranges from 0 to 100, and TAX (a property-tax rate), which ranges from 187 to 711. When a machine learning algorithm tries to learn from both features at the same time, the much larger values in TAX can crowd out the signal in AGE — not because TAX is more important, but simply because its numbers are bigger. This is the problem of variable magnitude — the difference in scale and range between features.
Fortunately, feature scaling solves this. Scaling transforms each feature so that all variables operate on a comparable numerical range. This tutorial walks you through three standard scaling techniques available in scikit-learn and shows exactly what each one does to your data.
Prerequisites: Python 3.x, pandas, NumPy, scikit-learn.
Why Variable Magnitude Matters
Before writing any code, it is worth understanding the three specific ways that unscaled features cause problems.
Distance metrics. Algorithms such as K-Nearest Neighbors (KNN), K-Means clustering, and Support Vector Machines (SVMs) measure similarity between data points using Euclidean distance:
Where:
- — the Euclidean distance between data points and
- — the number of features
- — the values of feature for data points and
When one feature has values in the hundreds and another in the tenths, squaring the differences means the large-valued feature dominates the sum. The smaller feature barely influences the result, even if it contains useful information.
Gradient descent. Many models — including linear regression, logistic regression, and neural networks — are trained by minimizing a loss function using gradient descent:
Where:
- — the model weight (coefficient) for feature
- — the learning rate, a small positive number controlling step size
- — the loss function measuring prediction error
When features have very different magnitudes, the loss surface becomes elongated and skewed. Gradient descent then takes many small, inefficient steps to find the minimum. Scaling the features makes the loss surface more symmetric and training converges faster.
Regularization. Penalty-based methods such as Ridge and Lasso regression add a term to the loss function that shrinks large coefficients:
Where:
- — mean squared error between the predicted and actual target values
- — the regularization strength, a hyperparameter you choose
- — the model coefficient for feature
- — the total number of features
If features are unscaled, the coefficients will have vastly different magnitudes depending on the feature's raw scale. The penalty term then shrinks coefficients unevenly — punishing features that happen to have large scales, regardless of how predictive they actually are.
Important
Always fit your scaling transformers only on the training set and then transform both the training and test sets. Fitting on the entire dataset causes data leakage — test-set statistics influence the scaler, which means your model has indirectly seen test data before evaluation.
Comparison of Scaling Methods
The three scalers in this tutorial each use a different formula. The table below summarises them side by side.
| Scaler | Mathematical Formula | Key Characteristic | Outlier Robustness |
|---|---|---|---|
| StandardScaler | Centered at 0, unit variance | Sensitive (outliers affect and ) | |
| MinMaxScaler | Bounded in a custom range (typically ) | Highly Sensitive (outliers skew the bounds) | |
| RobustScaler | Scales using median () and IQR () | Highly Robust (outliers do not influence the median/IQR) |
Where:
- — the raw feature value for a single observation
- — the mean of the feature computed from the training set
- — the standard deviation of the feature computed from the training set
- — the minimum and maximum feature values in the training set
- — the 25th, 50th (median), and 75th percentiles of the feature in the training set
Setting Up the Dataset
The examples in this tutorial use the Boston House Prices dataset from scikit-learn, which contains 13 numeric features describing neighborhoods in Boston and a target variable — median house price. It is a good test case for scaling because its features span very different ranges.
Start by importing the libraries you will need throughout this tutorial.
# to read the dataset into a dataframe and perform operations on it
import pandas as pd
# to perform basic array operations
import numpy as np
# to split and standarize the dataset
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler
# boston house dataset for the demo
from sklearn.datasets import load_boston
Load the dataset and build a pandas DataFrame so you can inspect column names and values easily.
boston_dataset = load_boston()
boston = pd.DataFrame(boston_dataset.data,
columns=boston_dataset.feature_names)
boston.head()
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 |
| 1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 |
| 2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 |
| 3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 |
| 4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 |
Use describe() to compute summary statistics for every column. The numbers immediately confirm that features operate on vastly different scales.
boston.describe().round(2)
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 | 506.00 |
| mean | 3.61 | 11.36 | 11.14 | 0.07 | 0.55 | 6.28 | 68.57 | 3.80 | 9.55 | 408.24 | 18.46 | 356.67 | 12.65 |
| std | 8.60 | 23.32 | 6.86 | 0.25 | 0.12 | 0.70 | 28.15 | 2.11 | 8.71 | 168.54 | 2.16 | 91.29 | 7.14 |
| min | 0.01 | 0.00 | 0.46 | 0.00 | 0.38 | 3.56 | 2.90 | 1.13 | 1.00 | 187.00 | 12.60 | 0.32 | 1.73 |
| 25% | 0.08 | 0.00 | 5.19 | 0.00 | 0.45 | 5.89 | 45.02 | 2.10 | 4.00 | 279.00 | 17.40 | 375.38 | 6.95 |
| 50% | 0.26 | 0.00 | 9.69 | 0.00 | 0.54 | 6.21 | 77.50 | 3.21 | 5.00 | 330.00 | 19.05 | 391.44 | 11.36 |
| 75% | 3.68 | 12.50 | 18.10 | 0.00 | 0.62 | 6.62 | 94.07 | 5.19 | 24.00 | 666.00 | 20.20 | 396.22 | 16.96 |
| max | 88.98 | 100.00 | 27.74 | 1.00 | 0.87 | 8.78 | 100.00 | 12.13 | 24.00 | 711.00 | 22.00 | 396.90 | 37.97 |
Notice that NOX has a mean of 0.55 and a standard deviation of 0.12, while TAX has a mean of 408.24 and a standard deviation of 168.54. These two features are on completely different scales.
You can make the magnitude problem even clearer by computing the range — the difference between the maximum and minimum value — for every feature.
# calculate the range (difference between maximum and minimum value) of the variables
# then we sort the range and display
(boston.max() - boston.min()).sort_values(ascending=False)
TAX 524.00000
B 396.58000
ZN 100.00000
AGE 97.10000
CRIM 88.96988
LSTAT 36.24000
INDUS 27.28000
RAD 23.00000
PTRATIO 9.40000
DIS 10.99690
RM 5.21900
CHAS 1.00000
NOX 0.48600
dtype: float64
TAX has a range of 524 while NOX has a range of only 0.49 — more than a thousand times smaller. Any distance- or gradient-based algorithm trained on these raw values would be dominated by TAX and B, regardless of their true predictive importance.
Before applying any scaler, split the data into a training set (70 %) and a test set (30 %). The scalers will be fit only on the training portion.
# split the data into training and testing set
X_train, X_test, y_train, y_test = train_test_split(
boston_dataset.data,
boston_dataset.target,
test_size=0.3,
random_state=0)
X_train.shape, X_test.shape
((354, 13), (152, 13))
The training set contains 354 samples and the test set contains 152 samples, each with 13 features.
Applying the Three Scalers
Each scaler follows the same three-step pattern: instantiate the scaler, call .fit() on the training data to learn the scaling parameters, then call .transform() on both the training and test sets.
Standard Scaling
Standard scaling — also called Z-score normalization — subtracts the mean and divides by the standard deviation. The result is a feature centered at 0 with a standard deviation of 1. This is the most common choice when your features follow a roughly normal (bell-curve) distribution.
# call standard scaler
scaler = StandardScaler()
# fit the scaler to the train set, it will learn the parameters
scaler.fit(X_train)
# transform train and test sets
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
MinMax Scaling
MinMax scaling — also called min-max normalization — compresses every feature into a fixed range, usually . It preserves the original distribution shape but is sensitive to outliers because extreme values shift the min and max used in the formula.
# call min max scaler
scaler = MinMaxScaler()
# fit the scaler to the train set, it will learn the parameters
scaler.fit(X_train)
# transform train and test sets
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
Robust Scaling
Robust scaling centers each feature using the median () and scales it using the interquartile range (IQR = ). Because the median and IQR are not affected by extreme values, this scaler handles datasets that contain outliers far better than the other two methods.
# call robust scaler
scaler = RobustScaler()
# fit the scaler to the train set, it will learn the parameters
scaler.fit(X_train)
# transform train and test sets
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
Verifying the Effect of Scaling
To confirm that scaling is working correctly, compare the mean and standard deviation of the training features before and after applying StandardScaler. First, check the raw (unscaled) statistics.
# fit the scaler
scaler = StandardScaler()
scaler.fit(X_train)
# convert arrays to dataframes
X_train_df = pd.DataFrame(X_train, columns=boston.columns)
X_train_scaled_df = pd.DataFrame(scaler.transform(X_train), columns=boston.columns)
print('Mean and Std before scaling')
print('Mean:')
print(X_train_df.mean(axis=0).round(2))
print()
print('Std:')
print(X_train_df.std(axis=0).round(2))
Mean and Std before scaling
Mean:
CRIM 3.36
ZN 11.91
INDUS 11.30
CHAS 0.07
NOX 0.56
RM 6.29
AGE 68.99
DIS 3.76
RAD 9.37
TAX 403.43
PTRATIO 18.45
B 358.94
LSTAT 12.67
dtype: float64
Std:
CRIM 8.34
ZN 23.75
INDUS 6.99
CHAS 0.25
NOX 0.12
RM 0.70
AGE 28.02
DIS 2.08
RAD 8.69
TAX 170.84
PTRATIO 2.22
B 88.66
LSTAT 7.17
dtype: float64
The means range from 0.07 (CHAS) to 403.43 (TAX), and the standard deviations range from 0.12 (NOX) to 170.84 (TAX). Now apply StandardScaler and check the same statistics after scaling.
scaler = StandardScaler()
scaler.fit(X_train)
# convert arrays to dataframes
X_train_df = pd.DataFrame(X_train, columns=boston.columns)
X_train_scaled_df = pd.DataFrame(scaler.transform(X_train), columns=boston.columns)
print('Mean and Std after scaling')
print('Mean:')
print(X_train_scaled_df.mean(axis=0).round(2))
print()
print('Std:')
print(X_train_scaled_df.std(axis=0).round(2))
Mean and Std after scaling
Mean:
CRIM -0.0
ZN 0.0
INDUS 0.0
CHAS 0.0
NOX -0.0
RM -0.0
AGE -0.0
DIS 0.0
RAD -0.0
TAX -0.0
PTRATIO 0.0
B 0.0
LSTAT -0.0
dtype: float64
Std:
CRIM 1.0
ZN 1.0
INDUS 1.0
CHAS 1.0
NOX 1.0
RM 1.0
AGE 1.0
DIS 1.0
RAD 1.0
TAX 1.0
PTRATIO 1.0
B 1.0
LSTAT 1.0
dtype: float64
After standard scaling every feature has a mean of 0 and a standard deviation of 1, regardless of its original range. TAX and NOX are now on exactly the same footing.
Now verify the MinMaxScaler output by checking that every feature's minimum and maximum are exactly 0 and 1.
scaler = MinMaxScaler()
scaler.fit(X_train)
# convert arrays to dataframes
X_train_df = pd.DataFrame(X_train, columns=boston.columns)
X_train_scaled_df = pd.DataFrame(scaler.transform(X_train), columns=boston.columns)
print('Min and Max after scaling')
print('Min:')
print(X_train_scaled_df.min(axis=0).round(2))
print()
print('Max:')
print(X_train_scaled_df.max(axis=0).round(2))
Min and Max after scaling
Min:
CRIM 0.0
ZN 0.0
INDUS 0.0
CHAS 0.0
NOX 0.0
RM 0.0
AGE 0.0
DIS 0.0
RAD 0.0
TAX 0.0
PTRATIO 0.0
B 0.0
LSTAT 0.0
dtype: float64
Max:
CRIM 1.0
ZN 1.0
INDUS 1.0
CHAS 1.0
NOX 1.0
RM 1.0
AGE 1.0
DIS 1.0
RAD 1.0
TAX 1.0
PTRATIO 1.0
B 1.0
LSTAT 1.0
dtype: float64
Every feature now has a minimum of 0.0 and a maximum of 1.0, confirming that MinMaxScaler has compressed all features into the interval.
Conclusion
In this tutorial you explored why variable magnitude creates problems for distance-based, gradient-driven, and regularized machine learning models. You then applied all three standard scikit-learn scalers — StandardScaler, MinMaxScaler, and RobustScaler — to the Boston Housing dataset and verified the before-and-after statistics to confirm each scaler's effect.
Key takeaways:
- Features at different scales unfairly bias distance metrics, slow down gradient descent, and distort regularization penalties — scaling removes all three problems.
StandardScalerproduces zero mean and unit variance; use it when your features approximate a normal distribution.MinMaxScalercompresses features into ; use it for neural networks or algorithms that expect bounded inputs, but remove outliers first.RobustScaleruses the median and IQR instead of mean and standard deviation; use it when your dataset contains outliers you cannot or do not want to remove.- Always call
.fit()on the training set only, then.transform()on both sets — fitting on the full dataset is data leakage.
Next steps:
- Learn how outliers interact with scaling in Feature Engineering Series 5: Outliers — understanding outliers helps you decide whether
RobustScaleror pre-removal is the better choice. - Read Feature Engineering Series 4: Linear Model Assumptions to see why scaling is one of several preprocessing steps required before fitting a linear model.
- Explore Feature Selection: Constant, Quasi-Constant, and Duplicate Features to remove uninformative features before you scale, keeping your pipeline lean.
- Apply what you have learned end-to-end in a Linear Regression walkthrough where scaling directly impacts the model's coefficient estimates.
