Feature Engineering Tutorial Series 5: Outliers

Published by georgiannacambel on

An outlier is a data point which is significantly different from the remaining data. "An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism." [D. Hawkins. Identification of Outliers, Chapman and Hall , 1980.]

Should outliers be removed?

Depending on the context, outliers either deserve special attention or should be completely ignored. Take the example of revenue forecasting: if unusual spikes of revenue are observed, it's probably a good idea to pay extra attention to them and figure out what caused the spike. In the same way, an unusual transaction on a credit card is usually a sign of fraudulent activity, which is what the credit card issuer wants to prevent. So in instances like these, it is useful to look for and investigate further outlier values.

If outliers are however, introduced due to mechanical error, measurement error or anything else that can't be generalised, it is a good idea to remove these outliers before feeding the data to the modeling algorithm. Why? Because some algorithms are sensitive to outliers.

Which machine learning models are sensitive to outliers?

Some machine learning models are more sensitive to outliers than others. For instance, AdaBoost may treat outliers as "hard" cases and put tremendous weights on outliers, therefore producing a model with bad generalisation.

Linear models, in particular Linear Regression, can be also sensitive to outliers.

Decision trees tend to ignore the presence of outliers when creating the branches of their trees. Typically, trees make decisions by asking if variable x >= a certain value, and therefore the outlier will fall on each side of the branch, but it will be treated equally than the remaining values, regardless of its magnitude.

A recent research article suggests that Neural Networks could also be sensitive to outliers, provided the number of outliers is high and the deviation is also high. I would argue that if the number of outliers is high (>15% as suggested in the article), then they are no longer outliers, and rather a fair representation of that variable.

How can outliers be identified?

Outlier analysis and anomaly detection is a huge field of research devoted to optimise methods and create new algorithms to reliably identify outliers. There are a huge number of ways to detect outliers in different situations. These are mostly targeted to identify outliers when those are the observations that we indeed want to focus on, for example fraudulent credit card activity.

In this blog we will focus on identifying those outliers introduced by mechanical or measurement error. Those outliers that are indeed a rare case in the population, and that could be ignored. We will see how to identify those outliers.

Extreme Value Analysis

The most basic form of outlier detection is Extreme Value Analysis of 1-dimensional data. The key for this method is to determine the statistical tails of the underlying distribution of the variable, and then find the values that sit at the very end of the tails.

If the the variable is Normally distributed (Gaussian), then the values that lie outside the mean plus or minus 3 times the standard deviation of the variable are considered outliers.

  • outliers = mean +/- 3* std

If the distribution of the variable is skewed, a general approach is to calculate the quantiles, and then the inter-quantile range (IQR), as follows:

  • IQR = 75th quantile - 25th quantile

An outlier will sit outside the following upper and lower boundaries:

  • Upper boundary = 75th quantile + (IQR * 1.5)
  • Lower boundary = 25th quantile - (IQR * 1.5)

or for extreme cases:

  • Upper boundary = 75th quantile + (IQR * 3)
  • Lower boundary = 25th quantile - (IQR * 3)

In this blog

We will:

  • Identify outliers using complete case analysis in Normally distributed variables.
  • Identify outliers using complete case analysis in skewed variables.

Datasets for this notebook:

In this demo, we will use the House Prices and Titanic datasets.

We will also use the Boston house prices dataset from Scikit-learn.

Let's Start!

We will start by importing the necessary libraries.

# to read the dataset into a dataframe and perform operations on it
import pandas as pd

# to perform basic array operations
import numpy as np

# for plotting and visualization
import matplotlib.pyplot as plt
import seaborn as sns

# for Q-Q plots
import scipy.stats as stats

# boston house dataset for the demo
from sklearn.datasets import load_boston

We will first load the Boston House price dataset from sklearnload_boston() loads and returns the boston house-prices dataset. We will see the detailed description of the dataset. DESCR gives the full description of the dataset.

from sklearn.datasets import load_boston
print(load_boston().DESCR)
.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
.. topic:: References

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

Now we will create a pandas dataframe which will include the independent variables. We will only use 3 variables RMLSTAT and CRIM for this demo.

boston_dataset = load_boston()
boston = pd.DataFrame(boston_dataset.data,
                      columns=boston_dataset.feature_names)[[
                          'RM', 'LSTAT', 'CRIM'
                      ]]


boston.head()
RMLSTATCRIM
06.5754.980.00632
16.4219.140.02731
27.1854.030.02729
36.9982.940.03237
47.1475.330.06905

Now we will load two variables Age and Fare from the titanic dataset. The 2 variables have missing values which we will remove for this demo using dropna().

titanic = pd.read_csv('https://raw.githubusercontent.com/laxmimerit/All-CSV-ML-Data-Files-Download/master/titanic.csv',
                      usecols=['Age', 'Fare'])

titanic.dropna(subset=['Age', 'Fare'], inplace=True)
titanic.head()
AgeFare
022.07.2500
138.071.2833
226.07.9250
335.053.1000
435.08.0500

Identify variable distribution

In Normally distributed variables, outliers are those values that lie beyond the mean plus or minus 3 times the standard deviation. If the variables are skewed however, we find outliers using the inter-quantile range. In order to decide which method to utilise to detect outliers, we first need to know the distribution of the variable.

We can use histograms and Q-Q plots to determine if the variable is normally distributed. We can also use boxplots to directly visualise the outliers. Boxplots are a standard way of displaying the distribution of a variable utilising the first quartile, the median, the third quartile and the whiskers.

Looking at a boxplot, you can easily identify:

  • The median, indicated by the line within the box.
  • The inter-quantile range (IQR), the box itself.
  • The quantiles, 25th (Q1) is the lower and 75th (Q3) the upper end of the box.
  • The wiskers, which extend to:- top whisker: Q3 + 1.5 x IQR - bottom whisker: Q1 -1.5 x IQR
boxplot.jpg

Any value sitting outside the whiskers is considered an outlier. Let's look at the examples below.

We will create a function diagnostic_plots() which plots the histogram, Q-Q plot and box plot for the specified variable of the given dataframe.

def diagnostic_plots(df, variable):
    # function takes a dataframe (df) and
    # the variable of interest as arguments

    # define figure size
    plt.figure(figsize=(16, 4))

    # histogram
    plt.subplot(1, 3, 1)
    sns.distplot(df[variable], bins=30)
    plt.title('Histogram')

    # Q-Q plot
    plt.subplot(1, 3, 2)
    stats.probplot(df[variable], dist="norm", plot=plt)
    plt.ylabel('RM quantiles')

    # boxplot
    plt.subplot(1, 3, 3)
    sns.boxplot(y=df[variable])
    plt.title('Boxplot')

    plt.show()

Normally distributed variables

Let's start with the variable RM from the boston house dataset. RM is the average number of rooms per dwelling.

diagnostic_plots(boston, 'RM')

From the histogram and the Q-Q plot, we see that the variable RM approximates a Gaussian distribution quite well. In the boxplot, we see that the variable could have outliers, as there are many dots sitting outside the whiskers, at both tails of the distribution.

Now let's inspect the variable Age from the titanic dataset. It refers to the age of the passengers on board.

diagnostic_plots(titanic, 'Age')

From the histogram and the Q-Q plot, we see that the variable approximates fairly well a Gaussian distribution. There is a deviation from the distribution towards the smaller values of age. In the boxplot, we can see that the variable could have outliers, as there are many dots sitting outside the whiskers, at the right end of the distribution (top whisker in the boxplot).

Skewed variables

Now we will plot the variable LSTAT from the boston house dataset. It gives the % lower status of the population.

diagnostic_plots(boston, 'LSTAT')

LSTAT is not normally distributed, it is skewed with a tail to the right. According to the boxplot, there are some outliers at the right end of the distribution of the variable.

Let's see the distribution of the variable CRIM from the boston house dataset. CRIM is the per capita crime rate by town.

diagnostic_plots(boston, 'CRIM')

CRIM is heavily skewed, with a tail to the right. There seems to be quite a few outliers as well at the right end of the distribution, according to the boxplot.

Now for the variable Fare from the titanic dataset. Fare is the price paid for the ticket by the passengers.

diagnostic_plots(titanic, 'Fare')

Fare is also very skewed, and shows some unusual values at the right end of its distribution.

Now we will identify outliers using the mean and the standard deviation for the variables RM and Age from the boston and titanic datasets respectively. Then we will use the inter-quantile range to identify outliers for the variables LSTATCRIM and Fare from the boston and titanic datasets.

Outlier detection for Normally distributed variables

We will create a function find_normal_boundaries() which will find the upper and lower boundaries for normally distributed variables.

def find_normal_boundaries(df, variable):

    # calculate the boundaries outside which lie the outliers for a Gaussian distribution

    upper_boundary = df[variable].mean() + 3 * df[variable].std()
    lower_boundary = df[variable].mean() - 3 * df[variable].std()

    return upper_boundary, lower_boundary

Now we will calculate boundaries for RM using the function created above.

upper_boundary, lower_boundary = find_normal_boundaries(boston, 'RM')
upper_boundary, lower_boundary
(8.392485817597757, 4.176782957105816)

By this we can conclude that values bigger than 8.4 or smaller than 4.2 occur very rarely for the variable RM. Therefore, we can consider them outliers. Now we will inspect the number and percentage of outliers for RM.

print('Total number of houses: {}'.format(len(boston)))

print('Houses with more than 8.4 rooms (right end outliers): {}'.format(
    len(boston[boston['RM'] > upper_boundary])))

print('Houses with less than 4.2 rooms (left end outliers: {}'.format(
    len(boston[boston['RM'] < lower_boundary])))
print()
print('% right end outliers: {}'.format(
    len(boston[boston['RM'] > upper_boundary]) / len(boston)))

print('% left end outliers: {}'.format(
    len(boston[boston['RM'] < lower_boundary]) / len(boston)))
Total number of houses: 506
Houses with more than 8.4 rooms (right end outliers): 4
Houses with less than 4.2 rooms (left end outliers: 4

% right end outliers: 0.007905138339920948
% left end outliers: 0.007905138339920948

Using Extreme Value Analysis we identified outliers at both ends of the distribution of RM. The percentage of outliers is small (1.4% considering the 2 tails together), which makes sense, because we are finding precisely outliers. That is, rare values, rare occurrences.

Let's move on to Age in the titanic dataset.

# calculate boundaries for Age in the titanic

upper_boundary, lower_boundary = find_normal_boundaries(titanic, 'Age')
upper_boundary, lower_boundary
(73.27860964406095, -13.88037434994331)

The upper boundary is 73 years, which means that passengers older than 73 were very few, if any, in the titanic. The lower boundary is negative. Because negative age does not exist, it only makes sense to look for outliers utilising the upper boundary.

# lets look at the number and percentage of outliers

print('Total passengers: {}'.format(len(titanic)))

print('Passengers older than 73: {}'.format(
    len(titanic[titanic['Age'] > upper_boundary])))
print()
print('% of passengers older than 73: {}'.format(
    len(titanic[titanic['Age'] > upper_boundary]) / len(titanic)))
Total passengers: 714
Passengers older than 73: 2

% of passengers older than 73: 0.0028011204481792717

There were 2 passengers older than 73 on board of the titanic, which could be considered as outliers, because the majority of the population was much younger.

Outlier detection for skewed variables

We will create a function find_skewed_boundaries() to find upper and lower boundaries for skewed distributed variables. distance passed as an argument, gives us the option to estimate 1.5 times or 3 times the IQR to calculate the boundaries.

def find_skewed_boundaries(df, variable, distance):

    IQR = df[variable].quantile(0.75) - df[variable].quantile(0.25)

    lower_boundary = df[variable].quantile(0.25) - (IQR * distance)
    upper_boundary = df[variable].quantile(0.75) + (IQR * distance)

    return upper_boundary, lower_boundary

We will look for outliers in LSTAT from the boston house dataset using the interquantile proximity rule IQR * 1.5, the standard metric.

upper_boundary, lower_boundary = find_skewed_boundaries(boston, 'LSTAT', 1.5)
upper_boundary, lower_boundary
(31.962500000000006, -8.057500000000005)

Now let's look at the number and percentage of outliers for LSTAT.

print('Total houses: {}'.format(len(boston)))

print('Houses with LSTAT bigger than 32: {}'.format(
    len(boston[boston['LSTAT'] > upper_boundary])))
print()
print('% of houses with LSTAT bigger than 32: {}'.format(
    len(boston[boston['LSTAT'] > upper_boundary])/len(boston)))
Total houses: 506
Houses with LSTAT bigger than 32: 7

% of houses with LSTAT bigger than 32: 0.01383399209486166

The upper boundary shows a value ~32. The lower boundary is negative, however the variable LSTAT does not take negative values. So to calculate the outliers for LSTAT we will only use the upper boundary. This coincides with what we observed in the boxplot earlier. Outliers sit only at the right tail of LSTAT's distribution.

We observe 7 houses, 1.3 % of the dataset, with extremely high values for LSTAT.

Now we will look for outliers in CRIM using the interquantile proximity rule IQR * 3, now we are looking for extremely high values.

upper_boundary, lower_boundary = find_skewed_boundaries(boston, 'CRIM', 3)
upper_boundary, lower_boundary
(14.462195000000001, -10.7030675)

let's look at the number and percentage of outliers for CRIM.

print('Total houses: {}'.format(len(boston)))

print('Houses with CRIM bigger than 14: {}'.format(
    len(boston[boston['CRIM'] > upper_boundary])))
print()
print('% of houses with CRIM bigger than 14s: {}'.format(
    len(boston[boston['CRIM'] > upper_boundary]) / len(boston)))
Total houses: 506
Houses with CRIM bigger than 14: 30

% of houses with CRIM bigger than 14s: 0.05928853754940711

While using the 3 times inter-quantile range to find outliers, we find that ~6% of the houses show unusually high crime rate areas. For CRIM as well, the lower boundary is negative, so it only makes sense to use the upper boundary to calculate outliers, as the variable takes only positive values. This coincides with what we observed in CRIM's boxplot earlier in this blog.

Lastly, we will identify outliers in Fare from the titanic dataset. We will look again for extreme values using IQR * 3.

upper_boundary, lower_boundary = find_skewed_boundaries(titanic, 'Fare', 3)
upper_boundary, lower_boundary
(109.35, -67.925)

Let's look at the number and percentage of passengers who paid extremely high Fares.

print('Total passengers: {}'.format(len(titanic)))

print('Passengers who paid more than 117: {}'.format(
    len(titanic[titanic['Fare'] > upper_boundary])))
print()
print('% of passengers who paid more than 117: {}'.format(
    len(titanic[titanic['Fare'] > upper_boundary])/len(titanic)))
Total passengers: 714
Passengers who paid more than 117: 44

% of passengers who paid more than 117: 0.06162464985994398

For Fare, as well as for all the other variables in this notebook which show a tail to the right, the lower boundary is negative. So we used the upper boundary to determine the outliers. We observe that 6% of the values of the dataset fall above the boundary.

Many machine learning algorithms are sensitive to the range and distribution of attribute values in the input data. Outliers in input data can skew and mislead the training process of machine learning algorithms resulting in longer training times, less accurate models and ultimately poorer results. Even before predictive models are prepared on training data, outliers can result in misleading representations and in turn misleading interpretations of collected data. Outliers can skew the summary distribution of attribute values in descriptive statistics like mean and standard deviation and in plots such as histograms and scatterplots, compressing the body of the data. On the other hand, outliers can represent examples of data instances that are relevant to the problem such as anomalies in the case of fraud detection and computer security.