Feature Engineering Series Tutorial 1: Missing Values and its Mechanisms

Published by georgiannacambel on

Missing data, or missing values, occur when no data / no value is stored for certain observations within a variable.

Incomplete data is an unavoidable problem in most data sources, and may have a significant impact on the conclusions that can be derived from the data.

Why is data missing?

The source of missing data can be very different. These are just a few examples:

  • A value is missing because it was forgotten, lost or not stored properly
  • For a certain observation, the value does not exist
  • The value can't be known or identified

In many organisations, information is collected into a form by a person talking with a client on the phone, or alternatively, by customers filling forms online. Often, the person entering the data does not complete all the fields in the form. Many of the fields are not compulsory, which may lead to missing values.

The reasons for omitting the information can vary: perhaps the person does not want to disclose some information, for example income, or they do not know the answer, or the answer is not applicable for a certain circumstance, or on the contrary, the person in the organisation wants to spare the customer some time, and therefore omits asking questions they think are not so relevant.

There are other cases where the value for a certain variable does not exist. For example, in the variable 'total debt as percentage of total income' (very common in financial data), if the person has no income, then the total percentage of 0 does not exist, and therefore it will be a missing value.

It is important to understand how the missing data are introduced in the dataset, that is, the mechanisms by which missing information is introduced in a dataset. Depending on the mechanism, we may choose to process the missing values differently. In addition, by knowing the source of missing data, we may choose to take action to control that source and decrease the amount of missing information looking forward during data collection.

Missing Data Mechanisms

There are 3 mechanisms that lead to missing data, 2 of them involve missing data randomly or almost-randomly, and the third one involves a systematic loss of data.

Missing Completely at Random (MCAR):

A variable is missing completely at random (MCAR) if the probability of being missing is the same for all the observations. When data is MCAR, there is absolutely no relationship between the data missing and any other values, observed or missing, within the dataset. In other words, those missing data points are a random subset of the data. There is nothing systematic going on that makes some data more likely to be missing than other. If values for observations are missing completely at random, then disregarding those cases would not bias the inferences made.

Missing at Random (MAR):

MAR occurs when there is a relationship between the propensity of missing values and the observed data. In other words, the probability of an observation being missing depends on available information (i.e., other variables in the dataset). For example, if men are more likely to disclose their weight than women, weight is MAR. The weight information will be missing at random for those men and women who do not disclose their weight, but as men are more prone to disclose it, there will be more missing values for women than for men.

In a situation like the above, if we decide to proceed with the variable with missing values (in this case weight), we might benefit from including gender to control the bias in weight for the missing observations.

Missing Not at Random (MNAR):

Missing data is not at random (MNAR) when there is a mechanism or a reason why missing values are introduced in the dataset. For example, MNAR would occur if people failed to fill in a depression survey because of their level of depression. Here, the missing of data is related to the outcome, depression. Similarly, when a financial company asks for bank and identity documents from customers in order to prevent identity fraud, typically, fraudsters impersonating someone else will not upload documents, because they don't have them, because they are fraudsters. Therefore, there is a systematic relationship between the missing documents and the target we want to predict: fraud.

Understanding the mechanism by which data is missing is important to decide which methods to use to impute the missing values.

In this Blog:

In the following cells we will:

  • Learn how to detect and quantify missing values
  • Try to identify the 3 different mechanisms of missing data introduction

We will use the toy Loan dataset and the Titanic dataset.

Let's start!

Here we have imported the necessary libraries.

  • pandas is used to read the dataset into a dataframe and perform operations on it
  • numpy is used to perform basic array operations
import pandas as pd
import numpy as np

# to display the total number of columns present in the dataset
pd.set_option('display.max_columns', None)

We will begin by reading the dataset using read_csv() and displaying the first 5 rows using head(). In python, the missing values are stored as NaN, see for example the first row for the variable Cabin.

data = pd.read_csv("https://raw.githubusercontent.com/laxmimerit/All-CSV-ML-Data-Files-Download/master/titanic.csv")
data.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS

We will now see the total number of missing values for each column using isnull() and sum(). There are 177 missing values for Age, 687 for Cabin and 2 for Embarked.

data.isnull().sum()
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Alternatively, we can use the mean() method after isnull() to get the percentage of missing values for each variable. As we can see 20% of the values in Age are missing. Cabin which indicates the cabin in which the passengers were travelling has 77% missing values and Embarked which specifies the port from which the passenger got into the Titanic has approximately 0.2% missing values.

data.isnull().mean()
PassengerId    0.000000
Survived       0.000000
Pclass         0.000000
Name           0.000000
Sex            0.000000
Age            0.198653
SibSp          0.000000
Parch          0.000000
Ticket         0.000000
Fare           0.000000
Cabin          0.771044
Embarked       0.002245
dtype: float64

Mechanisms of Missing Data

Missing data Not At Random (MNAR): Systematic missing values

In the Titanic dataset, both the missing values of the variables Age and Cabin, were introduced systematically. For many of the people who did not survive, the relation between their survival and age or the cabin they were traveling in, could not be established. The people who survived could be otherwise asked for that information.

Can we infer this by looking at the data?

In a situation like this, we could expect a greater number of missing values for people who did not survive.

Let's have a look.

We will create a new column cabin_null which will be a binary variable. If the value of Cabin is NaN then the corresponding value of cabin_null will be 1 else it will be 0. where() method is used to check a data frame for one or more condition and return the result accordingly. Here we are using where() to check if the value in Cabin is NaN.

data['cabin_null'] = np.where(data['Cabin'].isnull(), 1, 0)
data['cabin_null'].head()
0    1
1    0
2    1
3    0
4    1
Name: cabin_null, dtype: int32

Now we will evaluate the percentage of missing values in Cabin for the people who survived and the people who did not survive. The variable Survived takes the value 1 if the passenger survived or 0 otherwise. groupby() function is used to split the data into groups based on some criteria. We are splitting the data into 2 categories. The first group will contain all the survived passengers and the second group will contain all the non-survived passengers. Then we are computing the percentage of missing values in Cabin for each of the 2 groups.

data.groupby(['Survived'])['cabin_null'].mean()
Survived
0    0.876138
1    0.602339
Name: cabin_null, dtype: float64

Another way of doing the same thing with less lines of code is shown below.

data['Cabin'].isnull().groupby(data['Survived']).mean()
Survived
0    0.876138
1    0.602339
Name: Cabin, dtype: float64

We observe that the percentage of missing values is higher for people who did not survive (87%), compared to people who survived (60%). This finding supports our hypothesis that the data is missing because after people died, the information could not be retrieved.

Note: Having said this, to truly underpin whether the data is missing not at random, we would need to get extremely familiar with the way data was collected. Analysing datasets can only show us the right direction or help us build assumptions.

Let's do the same for the variable Age. First we will create a binary variable age_null which indicates whether the value of age is missing or not. Then we will compute the mean for the 2 survival groups.

data['age_null'] = np.where(data['Age'].isnull(), 1, 0)
data.groupby(['Survived'])['age_null'].mean()
Survived
0    0.227687
1    0.152047
Name: age_null, dtype: float64

The same thing with simpler code is shown below.

data['Age'].isnull().groupby(data['Survived']).mean()
Survived
0    0.227687
1    0.152047
Name: Age, dtype: float64

Again, we observe a higher number of missing data for the people who did not survive the tragedy. The analysis therefore suggests that there is a systematic loss of data: people who did not survive tend to have more missing information. Presumably, the method chosen to gather the information, contributes to the generation of these missing data.

Missing data Completely At Random (MCAR)

In the dataset the variable Embarked also contains missing values. We will slice the dataframe data to show only the rows where Embarked has value NaN.

data[data['Embarked'].isnull()]
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedcabin_nullage_null
616211Icard, Miss. Ameliefemale38.00011357280.0B28NaN00
82983011Stone, Mrs. George Nelson (Martha Evelyn)female62.00011357280.0B28NaN00

These 2 women were traveling together, Miss Amelia was the maid of Mrs. Martha.

There does not seem to be an indication that the missing information in the variable Embarked is depending on any other variable. The fact that these women survived, means that they could have been asked for this information.

Very likely the values were lost at the time of building the dataset.

If these values are MCAR, the probability of data being missing for these 2 women is the same as the probability for values to missing for any other person on the titanic. Of course this will be hard, if possible at all, to prove. But I hope this serves as a demonstration.

Missing data at Random (MAR)

For this example we will use the loan book toy dataset from a ficticious peer to peer lending company.

We will look at the variables employment and time_employed which is the years in employment, both declared by the borrowers at the time of applying for a loan.

In this example, data missing in employment are associated with data missing in time in employment.

We will begin by reading the dataset anf loading only the columns of interest using usecols parameter.

data = pd.read_csv('https://raw.githubusercontent.com/laxmimerit/feature-engineering-for-machine-learning-dataset/master/loan.csv', usecols=['employment', 'time_employed'])
data.head()
employmenttime_employed
0Teacher<=5 years
1Accountant<=5 years
2Statistician<=5 years
3Other<=5 years
4Bus driver>5 years

Now we will check the percentage of missing data. We see that both the variables have approximately the same percentage of missing observations.

data.isnull().mean()
employment       0.0611
time_employed    0.0529
dtype: float64

Now we will inspect the different categories of employment. We will also display all the categories i.e. all the unique() values in employment.

# number of different employments
print('Number of employment categories: {}'.format(
    len(data['employment'].unique())))

# Examples of employments
data['employment'].unique()
Number of employment categories: 12
array(['Teacher', 'Accountant', 'Statistician', 'Other', 'Bus driver',
       'Secretary', 'Software developer', 'Nurse', 'Taxi driver', nan,
       'Civil Servant', 'Dentist'], dtype=object)

We can see that employment contains the missing information (NaN) and several different categories of employments of the people. Now we will inspect the time_employed variable. We will start by displaying all the unique() values in that variable.

data['time_employed'].unique()
array(['<=5 years', '>5 years', nan], dtype=object)

The customer can't enter a value for employment time if they are not employed. They could be students, retired, self-employed, or work in the house. But we see how these 2 variables are related to each other.

Let's calculate the proportion of missing data for the time_employed variable for customers who have declared employment. We will first slice data to get all the rows of customers who have declared employment in t. Then we will find the percentage of missing data in time_employed for t.

# customers who declared employment
t = data[~data['employment'].isnull()]

# percentage of missing data in time employed
t['time_employed'].isnull().mean()
0.0005325380764724678

Simillarly, we will calculate the proportion of missing data for the time_employed variable for customers who have not declared employment.

# customers who did not declare employment
t = data[data['employment'].isnull()]

# percentage of missing data in time employed
t['time_employed'].isnull().mean()
0.8576104746317512

The number of borrowers who have reported occupation and have missing values in time_employed is minimal. Whereas the customers who did not report an occupation or employment, are mostly reporting missing values in the time_employed variable

This further supports that the missing values in employment are related to the missing values in time_employed.

Hence this is an example of MAR.