Feature Engineering Series Tutorial 1: Missing Values and its Mechanisms
Missing data, or missing values, occur when no data / no value is stored for certain observations within a variable.
Incomplete data is an unavoidable problem in most data sources, and may have a significant impact on the conclusions that can be derived from the data.
Why is data missing?
The source of missing data can be very different. These are just a few examples:
- A value is missing because it was forgotten, lost or not stored properly
- For a certain observation, the value does not exist
- The value can't be known or identified
In many organisations, information is collected into a form by a person talking with a client on the phone, or alternatively, by customers filling forms online. Often, the person entering the data does not complete all the fields in the form. Many of the fields are not compulsory, which may lead to missing values.
The reasons for omitting the information can vary: perhaps the person does not want to disclose some information, for example income, or they do not know the answer, or the answer is not applicable for a certain circumstance, or on the contrary, the person in the organisation wants to spare the customer some time, and therefore omits asking questions they think are not so relevant.
There are other cases where the value for a certain variable does not exist. For example, in the variable 'total debt as percentage of total income' (very common in financial data), if the person has no income, then the total percentage of 0 does not exist, and therefore it will be a missing value.
It is important to understand how the missing data are introduced in the dataset, that is, the mechanisms by which missing information is introduced in a dataset. Depending on the mechanism, we may choose to process the missing values differently. In addition, by knowing the source of missing data, we may choose to take action to control that source and decrease the amount of missing information looking forward during data collection.
Missing Data Mechanisms
There are 3 mechanisms that lead to missing data, 2 of them involve missing data randomly or almost-randomly, and the third one involves a systematic loss of data.
Missing Completely at Random (MCAR):
A variable is missing completely at random (MCAR) if the probability of being missing is the same for all the observations. When data is MCAR, there is absolutely no relationship between the data missing and any other values, observed or missing, within the dataset. In other words, those missing data points are a random subset of the data. There is nothing systematic going on that makes some data more likely to be missing than other. If values for observations are missing completely at random, then disregarding those cases would not bias the inferences made.
Missing at Random (MAR):
MAR occurs when there is a relationship between the propensity of missing values and the observed data. In other words, the probability of an observation being missing depends on available information (i.e., other variables in the dataset). For example, if men are more likely to disclose their weight than women, weight is MAR. The weight information will be missing at random for those men and women who do not disclose their weight, but as men are more prone to disclose it, there will be more missing values for women than for men.
In a situation like the above, if we decide to proceed with the variable with missing values (in this case weight), we might benefit from including gender to control the bias in weight for the missing observations.
Missing Not at Random (MNAR):
Missing data is not at random (MNAR) when there is a mechanism or a reason why missing values are introduced in the dataset. For example, MNAR would occur if people failed to fill in a depression survey because of their level of depression. Here, the missing of data is related to the outcome, depression. Similarly, when a financial company asks for bank and identity documents from customers in order to prevent identity fraud, typically, fraudsters impersonating someone else will not upload documents, because they don't have them, because they are fraudsters. Therefore, there is a systematic relationship between the missing documents and the target we want to predict: fraud.
Understanding the mechanism by which data is missing is important to decide which methods to use to impute the missing values.
In this Blog:
In the following cells we will:
- Learn how to detect and quantify missing values
- Try to identify the 3 different mechanisms of missing data introduction
We will use the toy Loan dataset and the Titanic dataset.
Let's start!
Here we have imported the necessary libraries.
pandas
is used to read the dataset into a dataframe and perform operations on itnumpy
is used to perform basic array operations
import pandas as pd import numpy as np # to display the total number of columns present in the dataset pd.set_option('display.max_columns', None)
We will begin by reading the dataset using read_csv()
and displaying the first 5 rows using head()
. In python, the missing values are stored as NaN
, see for example the first row for the variable Cabin
.
data = pd.read_csv("https://raw.githubusercontent.com/laxmimerit/All-CSV-ML-Data-Files-Download/master/titanic.csv") data.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
We will now see the total number of missing values for each column using isnull()
and sum()
. There are 177 missing values for Age
, 687 for Cabin
and 2 for Embarked
.
data.isnull().sum()
PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64
Alternatively, we can use the mean()
method after isnull()
to get the percentage of missing values for each variable. As we can see 20% of the values in Age
are missing. Cabin
which indicates the cabin in which the passengers were travelling has 77% missing values and Embarked
which specifies the port from which the passenger got into the Titanic has approximately 0.2% missing values.
data.isnull().mean()
PassengerId 0.000000 Survived 0.000000 Pclass 0.000000 Name 0.000000 Sex 0.000000 Age 0.198653 SibSp 0.000000 Parch 0.000000 Ticket 0.000000 Fare 0.000000 Cabin 0.771044 Embarked 0.002245 dtype: float64
Mechanisms of Missing Data
Missing data Not At Random (MNAR): Systematic missing values
In the Titanic dataset, both the missing values of the variables Age
and Cabin
, were introduced systematically. For many of the people who did not survive, the relation between their survival and age or the cabin they were traveling in, could not be established. The people who survived could be otherwise asked for that information.
Can we infer this by looking at the data?
In a situation like this, we could expect a greater number of missing values for people who did not survive.
Let's have a look.
We will create a new column cabin_null
which will be a binary variable. If the value of Cabin
is NaN
then the corresponding value of cabin_null
will be 1 else it will be 0. where()
method is used to check a data frame for one or more condition and return the result accordingly. Here we are using where()
to check if the value in Cabin
is NaN
.
data['cabin_null'] = np.where(data['Cabin'].isnull(), 1, 0) data['cabin_null'].head()
0 1 1 0 2 1 3 0 4 1 Name: cabin_null, dtype: int32
Now we will evaluate the percentage of missing values in Cabin
for the people who survived and the people who did not survive. The variable Survived
takes the value 1 if the passenger survived or 0 otherwise. groupby()
function is used to split the data into groups based on some criteria. We are splitting the data into 2 categories. The first group will contain all the survived passengers and the second group will contain all the non-survived passengers. Then we are computing the percentage of missing values in Cabin
for each of the 2 groups.
data.groupby(['Survived'])['cabin_null'].mean()
Survived 0 0.876138 1 0.602339 Name: cabin_null, dtype: float64
Another way of doing the same thing with less lines of code is shown below.
data['Cabin'].isnull().groupby(data['Survived']).mean()
Survived 0 0.876138 1 0.602339 Name: Cabin, dtype: float64
We observe that the percentage of missing values is higher for people who did not survive (87%), compared to people who survived (60%). This finding supports our hypothesis that the data is missing because after people died, the information could not be retrieved.
Note: Having said this, to truly underpin whether the data is missing not at random, we would need to get extremely familiar with the way data was collected. Analysing datasets can only show us the right direction or help us build assumptions.
Let's do the same for the variable Age
. First we will create a binary variable age_null
which indicates whether the value of age is missing or not. Then we will compute the mean for the 2 survival groups.
data['age_null'] = np.where(data['Age'].isnull(), 1, 0) data.groupby(['Survived'])['age_null'].mean()
Survived 0 0.227687 1 0.152047 Name: age_null, dtype: float64
The same thing with simpler code is shown below.
data['Age'].isnull().groupby(data['Survived']).mean()
Survived 0 0.227687 1 0.152047 Name: Age, dtype: float64
Again, we observe a higher number of missing data for the people who did not survive the tragedy. The analysis therefore suggests that there is a systematic loss of data: people who did not survive tend to have more missing information. Presumably, the method chosen to gather the information, contributes to the generation of these missing data.
Missing data Completely At Random (MCAR)
In the dataset the variable Embarked
also contains missing values. We will slice the dataframe data
to show only the rows where Embarked
has value NaN
.
data[data['Embarked'].isnull()]
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | cabin_null | age_null | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
61 | 62 | 1 | 1 | Icard, Miss. Amelie | female | 38.0 | 0 | 0 | 113572 | 80.0 | B28 | NaN | 0 | 0 |
829 | 830 | 1 | 1 | Stone, Mrs. George Nelson (Martha Evelyn) | female | 62.0 | 0 | 0 | 113572 | 80.0 | B28 | NaN | 0 | 0 |
These 2 women were traveling together, Miss Amelia was the maid of Mrs. Martha.
There does not seem to be an indication that the missing information in the variable Embarked
is depending on any other variable. The fact that these women survived, means that they could have been asked for this information.
Very likely the values were lost at the time of building the dataset.
If these values are MCAR, the probability of data being missing for these 2 women is the same as the probability for values to missing for any other person on the titanic. Of course this will be hard, if possible at all, to prove. But I hope this serves as a demonstration.
Missing data at Random (MAR)
For this example we will use the loan book toy dataset from a ficticious peer to peer lending company.
We will look at the variables employment
and time_employed
which is the years in employment, both declared by the borrowers at the time of applying for a loan.
In this example, data missing in employment are associated with data missing in time in employment.
We will begin by reading the dataset anf loading only the columns of interest using usecols
parameter.
data = pd.read_csv('https://raw.githubusercontent.com/laxmimerit/feature-engineering-for-machine-learning-dataset/master/loan.csv', usecols=['employment', 'time_employed']) data.head()
employment | time_employed | |
---|---|---|
0 | Teacher | <=5 years |
1 | Accountant | <=5 years |
2 | Statistician | <=5 years |
3 | Other | <=5 years |
4 | Bus driver | >5 years |
Now we will check the percentage of missing data. We see that both the variables have approximately the same percentage of missing observations.
data.isnull().mean()
employment 0.0611 time_employed 0.0529 dtype: float64
Now we will inspect the different categories of employment. We will also display all the categories i.e. all the unique()
values in employment
.
# number of different employments print('Number of employment categories: {}'.format( len(data['employment'].unique()))) # Examples of employments data['employment'].unique()
Number of employment categories: 12
array(['Teacher', 'Accountant', 'Statistician', 'Other', 'Bus driver', 'Secretary', 'Software developer', 'Nurse', 'Taxi driver', nan, 'Civil Servant', 'Dentist'], dtype=object)
We can see that employment
contains the missing information (NaN) and several different categories of employments of the people. Now we will inspect the time_employed
variable. We will start by displaying all the unique()
values in that variable.
data['time_employed'].unique()
array(['<=5 years', '>5 years', nan], dtype=object)
The customer can't enter a value for employment time if they are not employed. They could be students, retired, self-employed, or work in the house. But we see how these 2 variables are related to each other.
Let's calculate the proportion of missing data for the time_employed
variable for customers who have declared employment. We will first slice data
to get all the rows of customers who have declared employment in t
. Then we will find the percentage of missing data in time_employed
for t
.
# customers who declared employment t = data[~data['employment'].isnull()] # percentage of missing data in time employed t['time_employed'].isnull().mean()
0.0005325380764724678
Simillarly, we will calculate the proportion of missing data for the time_employed
variable for customers who have not declared employment.
# customers who did not declare employment t = data[data['employment'].isnull()] # percentage of missing data in time employed t['time_employed'].isnull().mean()
0.8576104746317512
The number of borrowers who have reported occupation and have missing values in time_employed
is minimal. Whereas the customers who did not report an occupation or employment, are mostly reporting missing values in the time_employed
variable
This further supports that the missing values in employmen
t are related to the missing values in time_employed
.
Hence this is an example of MAR.
2 Comments