Missing Values and Their Mechanisms

In real-world data science projects, you will rarely work with complete datasets. Incomplete data is an unavoidable problem across almost all industries, and how you choose to handle these missing records can have a massive impact on your model's accuracy and conclusions.

Missing values occur when no data value is stored for an observation within a variable. To build robust machine learning pipelines, you must understand the mechanisms by which missing values are introduced. Depending on why the data is missing, you will choose completely different imputation strategies.

In this tutorial, you will explore the three primary missing data mechanisms: Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR). You will learn how to detect and quantify missingness using Python, analyze patterns in the Titanic and peer-to-peer lending datasets, and choose appropriate processing techniques.

Prerequisites: Python 3.x, Pandas, NumPy.

Missing Data Mechanisms

Statistical analysis defines three distinct categories of missingness based on why the values are absent:

Missing Completely at Random (MCAR): The probability of a value being missing is identical for all observations. The missingness is completely random and independent of any observed or unobserved features. Disregarding these rows does not bias model parameters.
Missing at Random (MAR): The probability of a value being missing depends on other observed variables in the dataset. For instance, if men are less likely to report their weight than women, the missingness of weight depends on the observed gender column.
Missing Not at Random (MNAR): There is a systematic reason why values are missing. The probability of missingness is directly related to the variable itself. For example, individuals with higher depression levels might be less likely to complete a mental health survey.

Setup and Libraries

Import the data processing libraries and configure Pandas to display all columns:

PYTHON

import pandas as pd
import numpy as np

# to display the total number of columns present in the dataset
pd.set_option('display.max_columns', None)

Load the Titanic dataset into a Pandas DataFrame:

PYTHON

data = pd.read_csv("https://raw.githubusercontent.com/laxmimerit/All-CSV-ML-Data-Files-Download/master/titanic.csv")
data.head()

OUTPUT

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

Count the total number of missing values in each column:

PYTHON

data.isnull().sum()

OUTPUT

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Calculate the percentage of missing values in each column:

PYTHON

data.isnull().mean()

OUTPUT

PassengerId    0.000000
Survived       0.000000
Pclass         0.000000
Name           0.000000
Sex            0.000000
Age            0.198653
SibSp          0.000000
Parch          0.000000
Ticket         0.000000
Fare           0.000000
Cabin          0.771044
Embarked       0.002245
dtype: float64

The output shows that Age is missing approximately 20% of its values, Cabin is missing 77%, and Embarked is missing only 0.2%.

Identifying MNAR (Missing Not at Random)

In the Titanic dataset, both Age and Cabin are missing systematically (MNAR). For passengers who did not survive, details like their age or cabin numbers could not be recovered. The survivors, on the other hand, were able to provide this information.

To test this hypothesis, create a binary column cabin_null indicating if the Cabin value is missing:

PYTHON

data['cabin_null'] = np.where(data['Cabin'].isnull(), 1, 0)
data['cabin_null'].head()

OUTPUT

0    1
1    0
2    1
3    0
4    1
Name: cabin_null, dtype: int32

Calculate the proportion of missing Cabin records grouped by whether the passenger survived (1) or died (0):

PYTHON

data.groupby(['Survived'])['cabin_null'].mean()

OUTPUT

Survived
0    0.876138
1    0.602339
Name: cabin_null, dtype: float64

You can achieve the same analysis using a more compact syntax:

PYTHON

data['Cabin'].isnull().groupby(data['Survived']).mean()

OUTPUT

Survived
0    0.876138
1    0.602339
Name: Cabin, dtype: float64

The results show that 87.6% of passengers who died have missing Cabin values, compared to only 60.2% of survivors, supporting the hypothesis of systematic data loss (MNAR).

Repeat the same grouped null analysis for the Age variable:

PYTHON

data['age_null'] = np.where(data['Age'].isnull(), 1, 0)
data.groupby(['Survived'])['age_null'].mean()

OUTPUT

Survived
0    0.227687
1    0.152047
Name: age_null, dtype: float64

Verify using the simplified syntax:

PYTHON

data['Age'].isnull().groupby(data['Survived']).mean()

OUTPUT

Survived
0    0.227687
1    0.152047
Name: Age, dtype: float64

The data shows a higher rate of missing ages among those who died (22.8%) compared to survivors (15.2%), suggesting systematic missingness.

Identifying MCAR (Missing Completely at Random)

The Embarked variable has only 2 missing entries. Display the rows containing these missing values:

PYTHON

data[data['Embarked'].isnull()]

OUTPUT

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	cabin_null	age_null
61	62	1	1	Icard, Miss. Amelie	female	38.0	0	0	113572	80.0	B28	NaN	0	0
829	830	1	1	Stone, Mrs. George Nelson (Martha Evelyn)	female	62.0	0	0	113572	80.0	B28	NaN	0	0

These two passengers survived, meaning they could have been asked for their embarkation port. The missingness is likely due to clerical error or data loss during collection, meaning it is Missing Completely at Random (MCAR).

Identifying MAR (Missing at Random)

To illustrate MAR, load the employment and time_employed columns from the peer-to-peer lending dataset:

PYTHON

data = pd.read_csv('https://raw.githubusercontent.com/laxmimerit/feature-engineering-for-machine-learning-dataset/master/loan.csv', usecols=['employment', 'time_employed'])
data.head()

OUTPUT

	employment	time_employed
0	Teacher	<=5 years
1	Accountant	<=5 years
2	Statistician	<=5 years
3	Other	<=5 years
4	Bus driver	>5 years

Check the percentage of missing values in both columns:

PYTHON

data.isnull().mean()

OUTPUT

employment       0.0611
time_employed    0.0529
dtype: float64

Inspect the unique categories present in the employment column:

PYTHON

# number of different employments
print('Number of employment categories: {}'.format(
    len(data['employment'].unique())))

# Examples of employments
data['employment'].unique()

OUTPUT

Number of employment categories: 12

array(['Teacher', 'Accountant', 'Statistician', 'Other', 'Bus driver',
       'Secretary', 'Software developer', 'Nurse', 'Taxi driver', nan,
       'Civil Servant', 'Dentist'], dtype=object)

Inspect the unique values present in the time_employed column:

PYTHON

data['time_employed'].unique()

OUTPUT

array(['5 years', nan], dtype=object)

Calculate the proportion of missing values in the time_employed column for borrowers who declared their employment:

PYTHON

# customers who declared employment
t = data[~data['employment'].isnull()]

# percentage of missing data in time employed
t['time_employed'].isnull().mean()

OUTPUT

0.0005325380764724678

Calculate the proportion of missing values in time_employed for borrowers who did not declare their employment:

PYTHON

# customers who did not declare employment
t = data[data['employment'].isnull()]

# percentage of missing data in time employed
t['time_employed'].isnull().mean()

OUTPUT

0.8576104746317512

The rate of missingness in time_employed is extremely low (0.05%) for employed customers, but extremely high (85.8%) for customers without declared employment. This strong dependency on an observed variable (employment) confirms that time_employed is Missing at Random (MAR).

Conclusion

In this tutorial, you explored the three types of missing data mechanisms: MCAR, MAR, and MNAR. Using the Titanic and lending datasets, you calculated null proportions, sliced missing rows, and grouped missingness against observed variables to prove systematic data loss. You learned why understanding these mechanisms is a critical prerequisite to choosing the correct imputation strategy.

Key takeaways:

MCAR: Values are missing completely at random. The missingness is a random subset of the dataset and does not introduce bias when rows are dropped.
MAR: Missingness depends on other observed columns. To prevent bias during imputation, you must control for these related features in your pipeline.
MNAR: Values are missing due to systematic reasons. Imputing MNAR features requires creating indicator variables or using domain-specific models to prevent loss of critical predictive signals.

Next steps:

Learn how high cardinality impacts machine learning models in Cardinality in Machine Learning.
Master dataset wrangling operations with Pandas Crash Course.

Missing Values and Their Mechanisms

Topics You Will Master

Missing Data Mechanisms

Setup and Libraries

Identifying MNAR (Missing Not at Random)

Identifying MCAR (Missing Completely at Random)

Identifying MAR (Missing at Random)

Conclusion

Latest recommendations you might like

LinkedIn Auto Connect Bot

Dimensionality Reduction with LDA and PCA in Python

Find this tutorial useful?

Discussion & Comments