Feature Engineering: Outlier Detection

An outlier is a data point that sits far away from the rest of the observations in a dataset. It is suspicious because it may have been generated by a different process — for example, a measurement error, a data entry mistake, or a genuinely rare event.

Not all outliers are bad. In fraud detection, an unusual transaction is exactly what you are looking for. But in many regression and classification problems, outliers caused by noise or errors will distort your model. This tutorial shows you how to find them systematically.

In this tutorial you will work with two real datasets: the Boston Housing dataset (loaded from Scikit-learn) and the Titanic passenger dataset (loaded from a public CSV). You will build two reusable detection functions — one for normally distributed variables and one for skewed variables — and apply them to five features.

Prerequisites: Python 3.x, Pandas, NumPy, Matplotlib, Seaborn, SciPy, Scikit-learn.

Why Outliers Matter for Machine Learning

Not every model reacts to outliers in the same way, so the first question to ask is: does my model care?

Models sensitive to outliers:

Linear Regression and Logistic Regression — extreme values pull the regression line or decision boundary toward them, distorting every prediction.
AdaBoost — assigns higher weights to misclassified samples; outliers repeatedly receive high weights, forcing the ensemble to over-focus on noise.
Neural Networks — can be misled when outliers dominate gradient updates during training.

Models robust to outliers:

Decision Trees and Random Forests — these models split data using thresholds, so an extreme value only affects the single node where it lands; it does not shift the entire model.

If you are building a linear model or a distance-based model (such as K-Nearest Neighbours or K-Means clustering), outlier treatment is not optional — it is a prerequisite.

Outlier Detection Theory

The right detection method depends on the shape of your variable's distribution. Before picking a method, you need to know whether the variable is approximately normal or skewed.

Normally Distributed Variables — Z-Score Rule

For a variable that follows a Gaussian (normal) distribution — bell-shaped and symmetric — data points beyond three standard deviations from the mean are flagged as outliers:

Boundary = μ \pm 3 σ

Where:

$μ$ — the mean (average) of the feature
$σ$ — the standard deviation of the feature; measures how spread out the values are
$\pm 3 σ$ — three standard deviations in either direction; roughly 99.7 % of all data in a normal distribution falls within this range, so anything outside is considered extreme

Skewed Variables — IQR Proximity Rule

Many real-world variables are not bell-shaped — they have a long tail to one side. For these skewed distributions, standard deviations are unreliable because they are sensitive to extreme values themselves. Instead, you use the Interquartile Range (IQR):

IQR = Q_{3} - Q_{1}

Where:

$Q_{1}$ — the 25th percentile; 25 % of values fall below this point
$Q_{3}$ — the 75th percentile; 75 % of values fall below this point
$IQR$ — the range of the middle 50 % of the data; robust to extreme values

Using the IQR, you calculate upper and lower boundaries. There are two common multipliers:

Standard outlier boundaries (catches moderate outliers):

Lower = Q_{1} - 1.5 \times IQR, Upper = Q_{3} + 1.5 \times IQR

Extreme outlier boundaries (catches only the most extreme values):

Lower = Q_{1} - 3 \times IQR, Upper = Q_{3} + 3 \times IQR

Where:

$Q_{1}$ — the 25th percentile of the feature
$Q_{3}$ — the 75th percentile of the feature
$IQR$ — the interquartile range ( $Q_{3} - Q_{1}$ )
$1.5$ or $3$ — the distance multiplier; 1.5 is the standard threshold used by boxplots, 3 targets only the most extreme observations

Understanding the Boxplot

Before writing any code, it helps to understand what a boxplot shows. The diagram below labels every component — the box spans the IQR, the line inside the box is the median, and the whiskers extend to the standard 1.5 × IQR boundaries. Any point beyond a whisker is an outlier:

Anatomy of a boxplot diagram showing the minimum, lower whisker, Q1, median, Q3, upper whisker, maximum, and interquartile range label

Any value sitting outside the whiskers is considered an outlier.

Setting Up the Environment

Start by importing every library you will need for this tutorial.

PYTHON

# to read the dataset into a dataframe and perform operations on it
import pandas as pd

# to perform basic array operations
import numpy as np

# for plotting and visualization
import matplotlib.pyplot as plt
import seaborn as sns

# for Q-Q plots
import scipy.stats as stats

# boston house dataset for the demo
from sklearn.datasets import load_boston

Load the Boston Housing dataset and inspect its variable descriptions — this tells you exactly what each column measures:

PYTHON

from sklearn.datasets import load_boston
print(load_boston().DESCR)

OUTPUT

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**

    :Number of Instances: 506

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1980.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.

.. topic:: References

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

For this tutorial you only need three columns — RM (average rooms per dwelling), LSTAT (% lower-status population), and CRIM (per-capita crime rate). Select them and preview the first five rows:

PYTHON

boston_dataset = load_boston()
boston = pd.DataFrame(boston_dataset.data,
                      columns=boston_dataset.feature_names)[[
                          'RM', 'LSTAT', 'CRIM'
                      ]]

boston.head()

OUTPUT

	RM	LSTAT	CRIM
0	6.575	4.98	0.00632
1	6.421	9.14	0.02731
2	7.185	4.03	0.02729
3	6.998	2.94	0.03237
4	7.147	5.33	0.06905

Load the Titanic dataset and keep only the Age and Fare columns. Drop any rows that have missing values in those columns so they do not interfere with the boundary calculations:

PYTHON

titanic = pd.read_csv('https://raw.githubusercontent.com/laxmimerit/All-CSV-ML-Data-Files-Download/master/titanic.csv',
                      usecols=['Age', 'Fare'])

titanic.dropna(subset=['Age', 'Fare'], inplace=True)
titanic.head()

OUTPUT

	Age	Fare
0	22.0	7.2500
1	38.0	71.2833
2	26.0	7.9250
3	35.0	53.1000
4	35.0	8.0500

Identifying Variable Distributions

Before applying any detection method, you need to know the shape of your variable's distribution. You will use three complementary plots: a histogram (shows the overall shape), a Q-Q plot (compares the data to a theoretical normal distribution — if the points lie on the diagonal line, the variable is approximately normal), and a boxplot (shows the IQR, whiskers, and any outliers directly).

The helper function below draws all three plots side by side for any variable in a given DataFrame:

PYTHON

def diagnostic_plots(df, variable):
    # function takes a dataframe (df) and
    # the variable of interest as arguments

    # define figure size
    plt.figure(figsize=(16, 4))

    # histogram
    plt.subplot(1, 3, 1)
    sns.distplot(df[variable], bins=30)
    plt.title('Histogram')

    # Q-Q plot
    plt.subplot(1, 3, 2)
    stats.probplot(df[variable], dist="norm", plot=plt)
    plt.ylabel('RM quantiles')

    # boxplot
    plt.subplot(1, 3, 3)
    sns.boxplot(y=df[variable])
    plt.title('Boxplot')

    plt.show()

Normally Distributed Variables

Plot the diagnostic charts for RM — the average number of rooms per dwelling:

PYTHON

diagnostic_plots(boston, 'RM')

Diagnostic plots for RM — histogram shows a near-symmetric bell curve, Q-Q plot points track close to the diagonal, and boxplot shows a few outlier dots above and below the whiskers

The histogram and Q-Q plot confirm that RM approximates a Gaussian distribution well — the points in the Q-Q plot track the diagonal line closely. The boxplot reveals a small number of dots beyond both whiskers, suggesting outliers at both tails.

Now inspect Age from the Titanic dataset — the age of each passenger:

PYTHON

diagnostic_plots(titanic, 'Age')

Diagnostic plots for passenger Age — histogram is roughly bell-shaped with a slight skew toward younger ages, Q-Q plot shows minor deviation at low values, and boxplot shows a few outlier dots above the upper whisker

The Age variable approximates a Gaussian distribution fairly well. There is a slight deviation from normality at the lower end of the Q-Q plot, caused by the concentration of younger passengers. The boxplot indicates a small number of outliers above the upper whisker — very old passengers.

Skewed Variables

Plot the diagnostic charts for LSTAT — the percentage of lower-status population per town:

PYTHON

diagnostic_plots(boston, 'LSTAT')

Diagnostic plots for LSTAT — histogram shows a right-skewed distribution with a long tail, Q-Q plot points deviate significantly from the diagonal at high values, and boxplot shows outlier dots above the upper whisker

LSTAT is clearly not normally distributed — the histogram has a long tail to the right and the Q-Q plot deviates from the diagonal. The boxplot confirms outliers only at the right tail. For this variable you will use the IQR method.

Examine CRIM — the per-capita crime rate by town:

PYTHON

diagnostic_plots(boston, 'CRIM')

Diagnostic plots for CRIM — histogram shows extreme right skew with almost all mass near zero, Q-Q plot shows severe deviation at high quantiles, and boxplot shows many outlier dots clustered far above the upper whisker

CRIM is heavily right-skewed. Almost all towns have very low crime rates, but a subset have extremely high ones. The boxplot shows a large cluster of outlier points far above the upper whisker.

Finally, look at Fare from the Titanic dataset — the ticket price paid by each passenger:

PYTHON

diagnostic_plots(titanic, 'Fare')

Diagnostic plots for Fare — histogram shows extreme right skew concentrated near zero with a very long tail, Q-Q plot deviates strongly at the upper end, and boxplot shows numerous outlier dots up to around 500

Fare is also extremely right-skewed, with most passengers paying low fares and a handful paying very high amounts. The IQR method is the correct choice here.

Outlier Detection for Normally Distributed Variables

The function below calculates the upper and lower boundaries using the Z-score rule — mean plus or minus three standard deviations:

PYTHON

def find_normal_boundaries(df, variable):

    # calculate the boundaries outside which lie the outliers for a Gaussian distribution

    upper_boundary = df[variable].mean() + 3 * df[variable].std()
    lower_boundary = df[variable].mean() - 3 * df[variable].std()

    return upper_boundary, lower_boundary

Apply it to RM to get the numerical boundaries:

PYTHON

upper_boundary, lower_boundary = find_normal_boundaries(boston, 'RM')
upper_boundary, lower_boundary

OUTPUT

(8.392485817597757, 4.176782957105816)

Values above ~8.4 rooms or below ~4.2 rooms per dwelling are rare enough to be considered outliers. Now count how many houses fall outside these boundaries:

PYTHON

print('Total number of houses: {}'.format(len(boston)))

print('Houses with more than 8.4 rooms (right end outliers): {}'.format(
    len(boston[boston['RM'] > upper_boundary])))

print('Houses with less than 4.2 rooms (left end outliers: {}'.format(
    len(boston[boston['RM'] < lower_boundary])))

print('% right end outliers: {}'.format(
    len(boston[boston['RM'] > upper_boundary]) / len(boston)))

print('% left end outliers: {}'.format(
    len(boston[boston['RM'] < lower_boundary]) / len(boston)))

OUTPUT

Total number of houses: 506
Houses with more than 8.4 rooms (right end outliers): 4
Houses with less than 4.2 rooms (left end outliers: 4

% right end outliers: 0.007905138339920948
% left end outliers: 0.007905138339920948

Only 4 houses sit at each tail — about 0.8 % each, or 1.6 % combined. This is exactly what you expect: the Z-score rule is designed to flag only the rarest values.

Now calculate boundaries for Age in the Titanic dataset:

PYTHON

# calculate boundaries for Age in the titanic

upper_boundary, lower_boundary = find_normal_boundaries(titanic, 'Age')
upper_boundary, lower_boundary

OUTPUT

(73.27860964406095, -13.88037434994331)

The upper boundary of 73 years is meaningful. The lower boundary is negative, which is impossible for age — so you only apply the upper boundary. Count the passengers above 73:

PYTHON

# lets look at the number and percentage of outliers

print('Total passengers: {}'.format(len(titanic)))

print('Passengers older than 73: {}'.format(
    len(titanic[titanic['Age'] > upper_boundary])))
print()
print('% of passengers older than 73: {}'.format(
    len(titanic[titanic['Age'] > upper_boundary]) / len(titanic)))

OUTPUT

Total passengers: 714
Passengers older than 73: 2

% of passengers older than 73: 0.0028011204481792717

Only 2 passengers were older than 73 — a tiny fraction of the 714 passengers, confirming they are genuine outliers by the Z-score definition.

Outlier Detection for Skewed Variables

For skewed features, you use the IQR proximity rule. The function below accepts a distance argument so you can switch between 1.5 × IQR (standard outliers) and 3 × IQR (extreme outliers):

PYTHON

def find_skewed_boundaries(df, variable, distance):

    IQR = df[variable].quantile(0.75) - df[variable].quantile(0.25)

    lower_boundary = df[variable].quantile(0.25) - (IQR * distance)
    upper_boundary = df[variable].quantile(0.75) + (IQR * distance)

    return upper_boundary, lower_boundary

Apply the standard 1.5 × IQR rule to LSTAT:

PYTHON

upper_boundary, lower_boundary = find_skewed_boundaries(boston, 'LSTAT', 1.5)
upper_boundary, lower_boundary

OUTPUT

(31.962500000000006, -8.057500000000005)

The upper boundary is ~32. The lower boundary is negative, which LSTAT (a percentage) cannot reach, so only the upper boundary is meaningful. Count the houses with unusually high values:

PYTHON

print('Total houses: {}'.format(len(boston)))

print('Houses with LSTAT bigger than 32: {}'.format(
    len(boston[boston['LSTAT'] > upper_boundary])))
print()
print('% of houses with LSTAT bigger than 32: {}'.format(
    len(boston[boston['LSTAT'] > upper_boundary])/len(boston)))

OUTPUT

Total houses: 506
Houses with LSTAT bigger than 32: 7

% of houses with LSTAT bigger than 32: 0.01383399209486166

Seven houses (1.4 %) have an unusually high lower-status population percentage — consistent with the right-tail outliers visible in the boxplot earlier.

Now use the stricter 3 × IQR rule on CRIM to find only the most extreme crime-rate values:

PYTHON

upper_boundary, lower_boundary = find_skewed_boundaries(boston, 'CRIM', 3)
upper_boundary, lower_boundary

OUTPUT

(14.462195000000001, -10.7030675)

Count the houses above this extreme boundary:

PYTHON

print('Total houses: {}'.format(len(boston)))

print('Houses with CRIM bigger than 14: {}'.format(
    len(boston[boston['CRIM'] > upper_boundary])))
print()
print('% of houses with CRIM bigger than 14s: {}'.format(
    len(boston[boston['CRIM'] > upper_boundary]) / len(boston)))

OUTPUT

Total houses: 506
Houses with CRIM bigger than 14: 30

% of houses with CRIM bigger than 14s: 0.05928853754940711

Even with the stricter 3 × IQR threshold, about 6 % of the dataset exceeds the boundary. This reflects the heavily skewed nature of CRIM — a small number of high-crime towns sit very far from the rest.

Finally, identify extreme Fare values in the Titanic dataset using IQR × 3:

PYTHON

upper_boundary, lower_boundary = find_skewed_boundaries(titanic, 'Fare', 3)
upper_boundary, lower_boundary

OUTPUT

(109.35, -67.925)

Count the passengers who paid fares above this boundary:

PYTHON

print('Total passengers: {}'.format(len(titanic)))

print('Passengers who paid more than 117: {}'.format(
    len(titanic[titanic['Fare'] > upper_boundary])))
print()
print('% of passengers who paid more than 117: {}'.format(
    len(titanic[titanic['Fare'] > upper_boundary])/len(titanic)))

OUTPUT

Total passengers: 714
Passengers who paid more than 117: 44

% of passengers who paid more than 117: 0.06162464985994398

About 6 % of passengers paid fares above the extreme boundary. As with the other right-skewed variables, the lower boundary is negative and is ignored. These 44 passengers likely held first-class cabins and represent a genuinely different segment of the passenger population.

Conclusion

In this tutorial you built a systematic outlier detection workflow applied to five features across two real-world datasets. For RM and Age — variables that are approximately normally distributed — you used the Z-score rule and found very few outliers (under 2 %). For LSTAT, CRIM, and Fare — right-skewed variables — you used the IQR proximity rule and found between 1 % and 6 % of observations lying beyond the extreme boundaries.

Key takeaways:

Always inspect the variable's distribution first — histogram and Q-Q plot together tell you whether to use the Z-score rule or the IQR rule.
The Z-score rule ( $μ \pm 3 σ$ ) is appropriate only for approximately normal variables; applying it to skewed data gives misleading boundaries.
The IQR rule is robust to extreme values because it uses percentiles rather than the mean and standard deviation.
A negative lower boundary is physically impossible for variables like age, percentage, or price — use only the upper boundary in those cases.
The percentage of outliers found should be small (under 5 %); if it is large, revisit the multiplier or the distribution assumption.

Next steps:

After removing or capping outliers, scaling your features is the logical next step — read Variable Magnitude to see how standardization and min-max scaling work.
If you have not yet addressed linear model assumptions such as normality of residuals and homoscedasticity, read Linear Model Assumptions.
For handling rare category labels in categorical features — another common preprocessing challenge — see Rare Labels.
To see how tree-based models that are naturally robust to outliers work, explore Decision Tree in Python.

Feature Engineering: Outlier Detection

Topics You Will Master

Why Outliers Matter for Machine Learning

Outlier Detection Theory

Normally Distributed Variables — Z-Score Rule

Skewed Variables — IQR Proximity Rule

Understanding the Boxplot

Setting Up the Environment

Identifying Variable Distributions

Normally Distributed Variables

Skewed Variables

Outlier Detection for Normally Distributed Variables

Outlier Detection for Skewed Variables

Conclusion

Latest recommendations you might like

LinkedIn Auto Connect Bot

Dimensionality Reduction with LDA and PCA in Python

Find this tutorial useful?

Discussion & Comments