In real-world data science projects, you will rarely work with complete datasets. Incomplete data is an unavoidable problem across almost all industries, and how you choose to handle these missing records can have a massive impact on your model's accuracy and conclusions.
Missing values occur when no data value is stored for an observation within a variable. To build robust machine learning pipelines, you must understand the mechanisms by which missing values are introduced. Depending on why the data is missing, you will choose completely different imputation strategies.
In this tutorial, you will explore the three primary missing data mechanisms: Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR). You will learn how to detect and quantify missingness using Python, analyze patterns in the Titanic and peer-to-peer lending datasets, and choose appropriate processing techniques.
Prerequisites: Python 3.x, Pandas, NumPy.
Missing Data Mechanisms
Statistical analysis defines three distinct categories of missingness based on why the values are absent:
- Missing Completely at Random (MCAR): The probability of a value being missing is identical for all observations. The missingness is completely random and independent of any observed or unobserved features. Disregarding these rows does not bias model parameters.
- Missing at Random (MAR): The probability of a value being missing depends on other observed variables in the dataset. For instance, if men are less likely to report their weight than women, the missingness of weight depends on the observed gender column.
- Missing Not at Random (MNAR): There is a systematic reason why values are missing. The probability of missingness is directly related to the variable itself. For example, individuals with higher depression levels might be less likely to complete a mental health survey.
Setup and Libraries
Import the data processing libraries and configure Pandas to display all columns:
import pandas as pd
import numpy as np
# to display the total number of columns present in the dataset
pd.set_option('display.max_columns', None)
Load the Titanic dataset into a Pandas DataFrame:
data = pd.read_csv("https://raw.githubusercontent.com/laxmimerit/All-CSV-ML-Data-Files-Download/master/titanic.csv")
data.head()
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
Count the total number of missing values in each column:
data.isnull().sum()
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
Calculate the percentage of missing values in each column:
data.isnull().mean()
PassengerId 0.000000
Survived 0.000000
Pclass 0.000000
Name 0.000000
Sex 0.000000
Age 0.198653
SibSp 0.000000
Parch 0.000000
Ticket 0.000000
Fare 0.000000
Cabin 0.771044
Embarked 0.002245
dtype: float64
The output shows that Age is missing approximately 20% of its values, Cabin is missing 77%, and Embarked is missing only 0.2%.
Identifying MNAR (Missing Not at Random)
In the Titanic dataset, both Age and Cabin are missing systematically (MNAR). For passengers who did not survive, details like their age or cabin numbers could not be recovered. The survivors, on the other hand, were able to provide this information.
To test this hypothesis, create a binary column cabin_null indicating if the Cabin value is missing:
data['cabin_null'] = np.where(data['Cabin'].isnull(), 1, 0)
data['cabin_null'].head()
0 1
1 0
2 1
3 0
4 1
Name: cabin_null, dtype: int32
Calculate the proportion of missing Cabin records grouped by whether the passenger survived (1) or died (0):
data.groupby(['Survived'])['cabin_null'].mean()
Survived
0 0.876138
1 0.602339
Name: cabin_null, dtype: float64
You can achieve the same analysis using a more compact syntax:
data['Cabin'].isnull().groupby(data['Survived']).mean()
Survived
0 0.876138
1 0.602339
Name: Cabin, dtype: float64
The results show that 87.6% of passengers who died have missing Cabin values, compared to only 60.2% of survivors, supporting the hypothesis of systematic data loss (MNAR).
Repeat the same grouped null analysis for the Age variable:
data['age_null'] = np.where(data['Age'].isnull(), 1, 0)
data.groupby(['Survived'])['age_null'].mean()
Survived
0 0.227687
1 0.152047
Name: age_null, dtype: float64
Verify using the simplified syntax:
data['Age'].isnull().groupby(data['Survived']).mean()
Survived
0 0.227687
1 0.152047
Name: Age, dtype: float64
The data shows a higher rate of missing ages among those who died (22.8%) compared to survivors (15.2%), suggesting systematic missingness.
Identifying MCAR (Missing Completely at Random)
The Embarked variable has only 2 missing entries. Display the rows containing these missing values:
data[data['Embarked'].isnull()]
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | cabin_null | age_null | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 61 | 62 | 1 | 1 | Icard, Miss. Amelie | female | 38.0 | 0 | 0 | 113572 | 80.0 | B28 | NaN | 0 | 0 |
| 829 | 830 | 1 | 1 | Stone, Mrs. George Nelson (Martha Evelyn) | female | 62.0 | 0 | 0 | 113572 | 80.0 | B28 | NaN | 0 | 0 |
These two passengers survived, meaning they could have been asked for their embarkation port. The missingness is likely due to clerical error or data loss during collection, meaning it is Missing Completely at Random (MCAR).
Identifying MAR (Missing at Random)
To illustrate MAR, load the employment and time_employed columns from the peer-to-peer lending dataset:
data = pd.read_csv('https://raw.githubusercontent.com/laxmimerit/feature-engineering-for-machine-learning-dataset/master/loan.csv', usecols=['employment', 'time_employed'])
data.head()
| employment | time_employed | |
|---|---|---|
| 0 | Teacher | <=5 years |
| 1 | Accountant | <=5 years |
| 2 | Statistician | <=5 years |
| 3 | Other | <=5 years |
| 4 | Bus driver | >5 years |
Check the percentage of missing values in both columns:
data.isnull().mean()
employment 0.0611
time_employed 0.0529
dtype: float64
Inspect the unique categories present in the employment column:
# number of different employments
print('Number of employment categories: {}'.format(
len(data['employment'].unique())))
# Examples of employments
data['employment'].unique()
Number of employment categories: 12
array(['Teacher', 'Accountant', 'Statistician', 'Other', 'Bus driver',
'Secretary', 'Software developer', 'Nurse', 'Taxi driver', nan,
'Civil Servant', 'Dentist'], dtype=object)
Inspect the unique values present in the time_employed column:
data['time_employed'].unique()
array(['5 years', nan], dtype=object)
Calculate the proportion of missing values in the time_employed column for borrowers who declared their employment:
# customers who declared employment
t = data[~data['employment'].isnull()]
# percentage of missing data in time employed
t['time_employed'].isnull().mean()
0.0005325380764724678
Calculate the proportion of missing values in time_employed for borrowers who did not declare their employment:
# customers who did not declare employment
t = data[data['employment'].isnull()]
# percentage of missing data in time employed
t['time_employed'].isnull().mean()
0.8576104746317512
The rate of missingness in time_employed is extremely low (0.05%) for employed customers, but extremely high (85.8%) for customers without declared employment. This strong dependency on an observed variable (employment) confirms that time_employed is Missing at Random (MAR).
Conclusion
In this tutorial, you explored the three types of missing data mechanisms: MCAR, MAR, and MNAR. Using the Titanic and lending datasets, you calculated null proportions, sliced missing rows, and grouped missingness against observed variables to prove systematic data loss. You learned why understanding these mechanisms is a critical prerequisite to choosing the correct imputation strategy.
Key takeaways:
- MCAR: Values are missing completely at random. The missingness is a random subset of the dataset and does not introduce bias when rows are dropped.
- MAR: Missingness depends on other observed columns. To prevent bias during imputation, you must control for these related features in your pipeline.
- MNAR: Values are missing due to systematic reasons. Imputing MNAR features requires creating indicator variables or using domain-specific models to prevent loss of critical predictive signals.
Next steps:
- Learn how high cardinality impacts machine learning models in Cardinality in Machine Learning.
- Master dataset wrangling operations with Pandas Crash Course.
