#aarya tadvalkar#kgp talkie#MAR#MCAR#missing data mechanisms#missing values#MNAR#Titanic Dataset

Missing Values and Their Mechanisms

Understand MCAR, MAR, and MNAR missing data mechanisms and their impact on machine learning. Covers detection, analysis, and treatment strategies using Python.

May 19, 2026 at 7:30 PM6 min readFollowFollow (Hindi)

Topics You Will Master

Three missing data mechanisms: MCAR, MAR, and MNAR
How each mechanism affects model training and predictions
Detecting and visualizing missingness patterns in Python
Choosing appropriate imputation strategies per mechanism
Best For

Data scientists working with real-world incomplete datasets who need principled imputation strategies.

Expected Outcome

A clear understanding of why data is missing and which imputation method to apply for each case.

In real-world data science projects, you will rarely work with complete datasets. Incomplete data is an unavoidable problem across almost all industries, and how you choose to handle these missing records can have a massive impact on your model's accuracy and conclusions.

Missing values occur when no data value is stored for an observation within a variable. To build robust machine learning pipelines, you must understand the mechanisms by which missing values are introduced. Depending on why the data is missing, you will choose completely different imputation strategies.

In this tutorial, you will explore the three primary missing data mechanisms: Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR). You will learn how to detect and quantify missingness using Python, analyze patterns in the Titanic and peer-to-peer lending datasets, and choose appropriate processing techniques.

Prerequisites: Python 3.x, Pandas, NumPy.

Missing Data Mechanisms

Statistical analysis defines three distinct categories of missingness based on why the values are absent:

  • Missing Completely at Random (MCAR): The probability of a value being missing is identical for all observations. The missingness is completely random and independent of any observed or unobserved features. Disregarding these rows does not bias model parameters.
  • Missing at Random (MAR): The probability of a value being missing depends on other observed variables in the dataset. For instance, if men are less likely to report their weight than women, the missingness of weight depends on the observed gender column.
  • Missing Not at Random (MNAR): There is a systematic reason why values are missing. The probability of missingness is directly related to the variable itself. For example, individuals with higher depression levels might be less likely to complete a mental health survey.

Setup and Libraries

Import the data processing libraries and configure Pandas to display all columns:

PYTHON
import pandas as pd
import numpy as np

# to display the total number of columns present in the dataset
pd.set_option('display.max_columns', None)

Load the Titanic dataset into a Pandas DataFrame:

PYTHON
data = pd.read_csv("https://raw.githubusercontent.com/laxmimerit/All-CSV-ML-Data-Files-Download/master/titanic.csv")
data.head()
OUTPUT
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS

Count the total number of missing values in each column:

PYTHON
data.isnull().sum()
OUTPUT
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Calculate the percentage of missing values in each column:

PYTHON
data.isnull().mean()
OUTPUT
PassengerId    0.000000
Survived       0.000000
Pclass         0.000000
Name           0.000000
Sex            0.000000
Age            0.198653
SibSp          0.000000
Parch          0.000000
Ticket         0.000000
Fare           0.000000
Cabin          0.771044
Embarked       0.002245
dtype: float64

The output shows that Age is missing approximately 20% of its values, Cabin is missing 77%, and Embarked is missing only 0.2%.


Identifying MNAR (Missing Not at Random)

In the Titanic dataset, both Age and Cabin are missing systematically (MNAR). For passengers who did not survive, details like their age or cabin numbers could not be recovered. The survivors, on the other hand, were able to provide this information.

To test this hypothesis, create a binary column cabin_null indicating if the Cabin value is missing:

PYTHON
data['cabin_null'] = np.where(data['Cabin'].isnull(), 1, 0)
data['cabin_null'].head()
OUTPUT
0    1
1    0
2    1
3    0
4    1
Name: cabin_null, dtype: int32

Calculate the proportion of missing Cabin records grouped by whether the passenger survived (1) or died (0):

PYTHON
data.groupby(['Survived'])['cabin_null'].mean()
OUTPUT
Survived
0    0.876138
1    0.602339
Name: cabin_null, dtype: float64

You can achieve the same analysis using a more compact syntax:

PYTHON
data['Cabin'].isnull().groupby(data['Survived']).mean()
OUTPUT
Survived
0    0.876138
1    0.602339
Name: Cabin, dtype: float64

The results show that 87.6% of passengers who died have missing Cabin values, compared to only 60.2% of survivors, supporting the hypothesis of systematic data loss (MNAR).

Repeat the same grouped null analysis for the Age variable:

PYTHON
data['age_null'] = np.where(data['Age'].isnull(), 1, 0)
data.groupby(['Survived'])['age_null'].mean()
OUTPUT
Survived
0    0.227687
1    0.152047
Name: age_null, dtype: float64

Verify using the simplified syntax:

PYTHON
data['Age'].isnull().groupby(data['Survived']).mean()
OUTPUT
Survived
0    0.227687
1    0.152047
Name: Age, dtype: float64

The data shows a higher rate of missing ages among those who died (22.8%) compared to survivors (15.2%), suggesting systematic missingness.


Identifying MCAR (Missing Completely at Random)

The Embarked variable has only 2 missing entries. Display the rows containing these missing values:

PYTHON
data[data['Embarked'].isnull()]
OUTPUT
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedcabin_nullage_null
616211Icard, Miss. Ameliefemale38.00011357280.0B28NaN00
82983011Stone, Mrs. George Nelson (Martha Evelyn)female62.00011357280.0B28NaN00

These two passengers survived, meaning they could have been asked for their embarkation port. The missingness is likely due to clerical error or data loss during collection, meaning it is Missing Completely at Random (MCAR).


Identifying MAR (Missing at Random)

To illustrate MAR, load the employment and time_employed columns from the peer-to-peer lending dataset:

PYTHON
data = pd.read_csv('https://raw.githubusercontent.com/laxmimerit/feature-engineering-for-machine-learning-dataset/master/loan.csv', usecols=['employment', 'time_employed'])
data.head()
OUTPUT
employmenttime_employed
0Teacher<=5 years
1Accountant<=5 years
2Statistician<=5 years
3Other<=5 years
4Bus driver>5 years

Check the percentage of missing values in both columns:

PYTHON
data.isnull().mean()
OUTPUT
employment       0.0611
time_employed    0.0529
dtype: float64

Inspect the unique categories present in the employment column:

PYTHON
# number of different employments
print('Number of employment categories: {}'.format(
    len(data['employment'].unique())))

# Examples of employments
data['employment'].unique()
OUTPUT
Number of employment categories: 12

array(['Teacher', 'Accountant', 'Statistician', 'Other', 'Bus driver',
       'Secretary', 'Software developer', 'Nurse', 'Taxi driver', nan,
       'Civil Servant', 'Dentist'], dtype=object)

Inspect the unique values present in the time_employed column:

PYTHON
data['time_employed'].unique()
OUTPUT
array(['5 years', nan], dtype=object)

Calculate the proportion of missing values in the time_employed column for borrowers who declared their employment:

PYTHON
# customers who declared employment
t = data[~data['employment'].isnull()]

# percentage of missing data in time employed
t['time_employed'].isnull().mean()
OUTPUT
0.0005325380764724678

Calculate the proportion of missing values in time_employed for borrowers who did not declare their employment:

PYTHON
# customers who did not declare employment
t = data[data['employment'].isnull()]

# percentage of missing data in time employed
t['time_employed'].isnull().mean()
OUTPUT
0.8576104746317512

The rate of missingness in time_employed is extremely low (0.05%) for employed customers, but extremely high (85.8%) for customers without declared employment. This strong dependency on an observed variable (employment) confirms that time_employed is Missing at Random (MAR).

Conclusion

In this tutorial, you explored the three types of missing data mechanisms: MCAR, MAR, and MNAR. Using the Titanic and lending datasets, you calculated null proportions, sliced missing rows, and grouped missingness against observed variables to prove systematic data loss. You learned why understanding these mechanisms is a critical prerequisite to choosing the correct imputation strategy.

Key takeaways:

  • MCAR: Values are missing completely at random. The missingness is a random subset of the dataset and does not introduce bias when rows are dropped.
  • MAR: Missingness depends on other observed columns. To prevent bias during imputation, you must control for these related features in your pipeline.
  • MNAR: Values are missing due to systematic reasons. Imputing MNAR features requires creating indicator variables or using domain-specific models to prevent loss of critical predictive signals.

Next steps:

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments