Feature Engineering Series Tutorial 2: Cardinality in Machine Learning

Published by georgiannacambel on

Cardinality refers to the number of possible values that a feature can assume. For example, the variable “US State” is one that has 50 possible values. The binary features, of course, could only assume one of two values (0 or 1).

The values of a categorical variable are selected from a group of categories, also called labels. For example, in the variable gender the categories or labels are male and female, whereas in the variable city the labels can be London, Manchester, Brighton and so on.

Different categorical variables contain different number of labels or categories. The variable gender contains only 2 labels, but a variable like city or postcode, can contain a huge number of different labels.

The number of different labels within a categorical variable is known as cardinality. A high number of labels within a variable is known as high cardinality.

Are multiple labels in a categorical variable a problem?

High cardinality may pose the following problems:

  • Variables with too many labels tend to dominate over those with only a few labels, particularly in Tree based algorithms.
  • A big number of labels within a variable may introduce noise with little, if any, information, therefore making machine learning models prone to over-fit.
  • Some of the labels may only be present in the training data set, but not in the test set, therefore machine learning algorithms may over-fit to the training set.
  • Contrarily, some labels may appear only in the test set, therefore leaving the machine learning algorithms unable to perform a calculation over the new (unseen) observation.

In particular, tree methods can be biased towards variables with lots of labels (variables with high cardinality). Thus, their performance may be affected by high cardinality.

Below we will see the effect of high cardinality of variables on the performance of different machine learning algorithms and how a quick fix to reduce the number of labels, without any sort of data insight, helps to boost the performance.

In this Blog:

We will:

  • Learn how to quantify cardinality
  • See examples of high and low cardinality variables
  • Understand the effect of cardinality while preparing train and test sets
  • See the effect of cardinality on Machine Learning Model performance

We will use the Titanic dataset.

Let's start!

We will first import all the necessary libraries.

# to read the dataset into a dataframe and perform operations on it 
import pandas as pd

# to perform basic array operations
import numpy as np

# to build machine learning models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

# to evaluate the models
from sklearn.metrics import roc_auc_score

# to separate data into train and test
from sklearn.model_selection import train_test_split

Now we will read the titanic dataset using read_csv()head() shows the first 5 rows of the dataframe. The categorical variables in this dataset are NameSexTicketCabin and Embarked.

NoteTicket and Cabin contain both letters and numbers, so they could be treated as Mixed Variables. For this demonstration, we will treat them as categorical variables.

data = pd.read_csv('https://raw.githubusercontent.com/laxmimerit/All-CSV-ML-Data-Files-Download/master/titanic.csv')
data.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS

Let's inspect the cardinality of each categorical variable in the dataset.

print('Number of categories in the variable Name: {}'.format(
    len(data.Name.unique())))

print('Number of categories in the variable Gender: {}'.format(
    len(data.Sex.unique())))

print('Number of categories in the variable Ticket: {}'.format(
    len(data.Ticket.unique())))

print('Number of categories in the variable Cabin: {}'.format(
    len(data.Cabin.unique())))

print('Number of categories in the variable Embarked: {}'.format(
    len(data.Embarked.unique())))

print('Total number of passengers in the Titanic: {}'.format(len(data)))
Number of categories in the variable Name: 891
Number of categories in the variable Gender: 2
Number of categories in the variable Ticket: 681
Number of categories in the variable Cabin: 148
Number of categories in the variable Embarked: 4
Total number of passengers in the Titanic: 891

While the variable Sex contains only 2 categories and Embarked contains 4 (low cardinality), the variables TicketName and Cabin, as expected, contain a huge number of different labels (high cardinality).

To demonstrate the effect of high cardinality in train and test sets and machine learning performance, we will work with the variable Cabin. We will create a new variable with reduced cardinality.

We will begin by exploring the values in the variable Cabin. As we saw in the previous cell there are 148 unique values. We will display these values using unique()

data.Cabin.unique()
array([nan, 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6',
       'C23 C25 C27', 'B78', 'D33', 'B30', 'C52', 'B28', 'C83', 'F33',
       'F G73', 'E31', 'A5', 'D10 D12', 'D26', 'C110', 'B58 B60', 'E101',
       'F E69', 'D47', 'B86', 'F2', 'C2', 'E33', 'B19', 'A7', 'C49', 'F4',
       'A32', 'B4', 'B80', 'A31', 'D36', 'D15', 'C93', 'C78', 'D35',
       'C87', 'B77', 'E67', 'B94', 'C125', 'C99', 'C118', 'D7', 'A19',
       'B49', 'D', 'C22 C26', 'C106', 'C65', 'E36', 'C54',
       'B57 B59 B63 B66', 'C7', 'E34', 'C32', 'B18', 'C124', 'C91', 'E40',
       'T', 'C128', 'D37', 'B35', 'E50', 'C82', 'B96 B98', 'E10', 'E44',
       'A34', 'C104', 'C111', 'C92', 'E38', 'D21', 'E12', 'E63', 'A14',
       'B37', 'C30', 'D20', 'B79', 'E25', 'D46', 'B73', 'C95', 'B38',
       'B39', 'B22', 'C86', 'C70', 'A16', 'C101', 'C68', 'A10', 'E68',
       'B41', 'A20', 'D19', 'D50', 'D9', 'A23', 'B50', 'A26', 'D48',
       'E58', 'C126', 'B71', 'B51 B53 B55', 'D49', 'B5', 'B20', 'F G63',
       'C62 C64', 'E24', 'C90', 'C45', 'E8', 'B101', 'D45', 'C46', 'D30',
       'E121', 'D11', 'E77', 'F38', 'B3', 'D6', 'B82 B84', 'D17', 'A36',
       'B102', 'B69', 'E49', 'C47', 'D28', 'E17', 'A24', 'C50', 'B42',
       'C148'], dtype=object)

Now we will reduce the cardinality of the variable. To do so, instead of using the entire cabin value, we will retain only the first letter in Cabin_reduced.

Rationale: The first letter indicates the deck on which the cabin was located, and is therefore an indication of both social class status and proximity to the surface of the Titanic. Both are known to improve the probability of survival.

# let's capture the first letter of Cabin
data['Cabin_reduced'] = data['Cabin'].astype(str).str[0]

data[['Cabin', 'Cabin_reduced']].head()
CabinCabin_reduced
0NaNn
1C85C
2NaNn
3C123C
4NaNn

Now let's check the cardinality of Cabin_reduced. We reduced the number of different labels from 148 to 9.

print('Number of categories in the variable Cabin: {}'.format(
    len(data.Cabin.unique())))

print('Number of categories in the variable Cabin_reduced: {}'.format(
    len(data.Cabin_reduced.unique())))
Number of categories in the variable Cabin: 148
Number of categories in the variable Cabin_reduced: 9

Now we will split the data into training and testing set with the help of train_test_split()use_col contains the variables of the feature space i.e. the variables which provide information necessary for prediction. Survived contains the values which have to be predicted. The test_size = 0.3 will keep 30% data for testing and 70% data will be used for training the model. random_state controls the shuffling applied to the data before applying the split.

use_cols = ['Cabin', 'Cabin_reduced', 'Sex']

X_train, X_test, y_train, y_test = train_test_split(
    data[use_cols], 
    data['Survived'],  
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape
((623, 3), (268, 3))

As you can see from the previous cell the training set contains 623 rows and the test dataset contains 268 rows.

High cardinality leads to uneven distribution of categories in train and test sets

When a variable has high cardinality, often some categories land only in the training set, or only in the testing set. If present only in the training set, they may lead to over-fitting. If present only on the testing set, the machine learning algorithm will not know how to handle them, as it has not seen them during training.

We will find the number of labels in Cabin which are present only in the training set and are not present in the test dataset.

unique_to_train_set = [
    x for x in X_train.Cabin.unique() if x not in X_test.Cabin.unique()
]

len(unique_to_train_set)
100

There are 100 Cabins that are only present in the training set, and not in the testing set. Simillarly, we will compute the number of labels present only in the test set and not in the training set.

unique_to_test_set = [
    x for x in X_test.Cabin.unique() if x not in X_train.Cabin.unique()
]

len(unique_to_test_set)
28

This problem can be overcomed by reducing the cardinality of the variable. Let's find out the number of labels present only in the training set for Cabin with reduced cardinality i.e. Cabin_reduced.

unique_to_train_set = [
    x for x in X_train['Cabin_reduced'].unique()
    if x not in X_test['Cabin_reduced'].unique()
]

len(unique_to_train_set)
1

Now we will find the number of labels present only in the test set for Cabin with reduced cardinality i.e. Cabin_reduced.

unique_to_test_set = [
    x for x in X_test['Cabin_reduced'].unique()
    if x not in X_train['Cabin_reduced'].unique()
]

len(unique_to_test_set)
0

Observe how by reducing the cardinality there is now only 1 label in the training set that is not present in the test set. And no labels in the test set which are not present in the training set.

Effect of cardinality on Machine Learning Model Performance

In order to evaluate the effect of categorical variables in machine learning models, we will quickly replace the categories by numbers. We will re-map Cabin to numbers so we can use it to train ML models.

Note: This is neither the only nor the best way to encode categorical variables into numbers

Here itertools is just used to display the first 100 elements in the newly created dictionary.

import itertools 

cabin_dict = {k: i for i, k in enumerate(X_train['Cabin'].unique(), 0)}
print(dict(itertools.islice(cabin_dict.items(), 100)))
{'E17': 0, 'D33': 1, nan: 2, 'D26': 3, 'B58 B60': 4, 'C128': 5, 'D17': 6, 'A14': 7, 'F33': 8, 'B19': 9, 'D21': 10, 'C148': 11, 'C30': 12, 'D56': 13, 'E24': 14, 'E40': 15, 'E31': 16, 'E44': 17, 'E38': 18, 'D37': 19, 'E8': 20, 'C92': 21, 'E63': 22, 'C125': 23, 'F4': 24, 'E67': 25, 'C126': 26, 'B73': 27, 'E36': 28, 'C78': 29, 'E46': 30, 'C111': 31, 'E101': 32, 'D15': 33, 'E12': 34, 'G6': 35, 'A32': 36, 'B4': 37, 'A10': 38, 'A5': 39, 'C95': 40, 'E25': 41, 'C90': 42, 'D6': 43, 'A36': 44, 'D': 45, 'D50': 46, 'B96 B98': 47, 'C93': 48, 'E77': 49, 'C101': 50, 'D11': 51, 'C123': 52, 'C32': 53, 'B35': 54, 'C91': 55, 'T': 56, 'B101': 57, 'E58': 58, 'A23': 59, 'B77': 60, 'D28': 61, 'B82 B84': 62, 'B79': 63, 'C45': 64, 'C2': 65, 'B5': 66, 'C104': 67, 'B20': 68, 'A19': 69, 'B51 B53 B55': 70, 'B80': 71, 'B38': 72, 'B22': 73, 'B18': 74, 'C22 C26': 75, 'A16': 76, 'F2': 77, 'D47': 78, 'E121': 79, 'C23 C25 C27': 80, 'B28': 81, 'E10': 82, 'D36': 83, 'C46': 84, 'B39': 85, 'D30': 86, 'E33': 87, 'C50': 88, 'D20': 89, 'C124': 90, 'A34': 91, 'C110': 92, 'D19': 93, 'B86': 94, 'D35': 95, 'C99': 96, 'D46': 97, 'F38': 98, 'A24': 99}

Now we will replace the labels in Cabin using the dictionary cabin_dict created above. The numerical values will be stored in Cabin_mapped.

X_train.loc[:, 'Cabin_mapped'] = X_train.loc[:, 'Cabin'].map(cabin_dict)
X_test.loc[:, 'Cabin_mapped'] = X_test.loc[:, 'Cabin'].map(cabin_dict)

X_train[['Cabin_mapped', 'Cabin']].head(10)
Cabin_mappedCabin
8570E17
521D33
3862NaN
1243D26
5782NaN
5492NaN
1184B58 B60
122NaN
1572NaN
1272NaN

We can see that NaN takes the value 2 in the new variable, E17 takes the value 0, D33 takes the value 1, and so on. Now we will replace the letters in the Cabin_reduced variable with numbers following the same procedure as above.

# create replace dictionary
cabin_dict = {k: i for i, k in enumerate(X_train['Cabin_reduced'].unique(), 0)}

# replace labels by numbers with dictionary
X_train.loc[:, 'Cabin_reduced'] = X_train.loc[:, 'Cabin_reduced'].map(cabin_dict)
X_test.loc[:, 'Cabin_reduced'] = X_test.loc[:, 'Cabin_reduced'].map(cabin_dict)

X_train[['Cabin_reduced', 'Cabin']].head(20)
Cabin_reducedCabin
8570E17
521D33
3862NaN
1241D26
5782NaN
5492NaN
1183B58 B60
122NaN
1572NaN
1272NaN
6532NaN
2352NaN
7852NaN
2412NaN
3514C128
8621D17
8512NaN
7532NaN
5322NaN
4852NaN

We see now that D33 and D26 correspond to the same number, 1, because we are capturing only the first letter. They both start with D.

Now we wil map the categorical variable Sex to numbers.

X_train.loc[:, 'Sex'] = X_train.loc[:, 'Sex'].map({'male': 0, 'female': 1})
X_test.loc[:, 'Sex'] = X_test.loc[:, 'Sex'].map({'male': 0, 'female': 1})

X_train.Sex.head()
857    0
52     1
386    0
124    0
578    1
Name: Sex, dtype: int64

Next we will check if there are any missing values in these variables in the training as well as testing dataset.

X_train[['Cabin_mapped', 'Cabin_reduced', 'Sex']].isnull().sum()
Cabin_mapped     0
Cabin_reduced    0
Sex              0
dtype: int64
X_test[['Cabin_mapped', 'Cabin_reduced', 'Sex']].isnull().sum()
Cabin_mapped     30
Cabin_reduced     0
Sex               0
dtype: int64

In the test set, there are now 30 missing values for the highly cardinal variable Cabin_mapped. These were introduced while encoding the categories into numbers.

Why?

Many categories exist only in the test set. Thus, when we created our encoding dictionary using only the train set, we did not generate a number to replace those labels present only in the test set. As a consequence, they were encoded as NaN. We will see in future notebooks how to tackle this problem. For now, we will fill those missing values with 0.

Let's check the number of different categories in the encoded variables

len(X_train.Cabin_mapped.unique()), len(X_train.Cabin_reduced.unique())
(121, 9)

From here we can conclude that from the original 148 cabins in the dataset, only 121 are present in the training set. We also see how we reduced the number of different categories to just 9 in our previous step.

Let's go ahead and evaluate the effect of labels on machine learning algorithms.

Random Forests

We will build the model on data with high cardinality for cabin and then predict using that model.

# call the model
rf = RandomForestClassifier(n_estimators=200, random_state=39)

# train the model
rf.fit(X_train[['Cabin_mapped', 'Sex']], y_train)

# make predictions on train and test set
pred_train = rf.predict_proba(X_train[['Cabin_mapped', 'Sex']])
pred_test = rf.predict_proba(X_test[['Cabin_mapped', 'Sex']].fillna(0))

print('Train set')
print('Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))
Train set
Random Forests roc-auc: 0.8617329342096702
Test set
Random Forests roc-auc: 0.8078571428571428

We observe that the performance of the Random Forests on the training set is quite superior to its performance in the test set. This indicates that the model is over-fitting, which means that it does a great job at predicting the outcome on the dataset it was trained on, but it lacks the power to generalise the prediction for unseen data.

Now we will build the model on data with low cardinality for cabin and then predict using that model.

# call the model
rf = RandomForestClassifier(n_estimators=200, random_state=39)

# train the model
rf.fit(X_train[['Cabin_reduced', 'Sex']], y_train)

# make predictions on train and test set
pred_train = rf.predict_proba(X_train[['Cabin_reduced', 'Sex']])
pred_test = rf.predict_proba(X_test[['Cabin_reduced', 'Sex']])

print('Train set')
print('Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))
Train set
Random Forests roc-auc: 0.8199550985878832
Test set
Random Forests roc-auc: 0.8332142857142857

We can see now that the Random Forests no longer over-fitS to the training set. In addition, the model is much better at generalising the predictions.

Note:- We can overcome the effect of high cardinality by adjusting the hyper-parameters of the random forests. That goes beyond the scope of this blog. Here, I want to show you that given a same model, with identical hyper-parameters, high cardinality may cause the model to over-fit.

AdaBoost

We will build the model on data with high cardinality for cabin and then predict using that model.

# call the model
ada = AdaBoostClassifier(n_estimators=200, random_state=44)

# train the model
ada.fit(X_train[['Cabin_mapped', 'Sex']], y_train)

# make predictions on train and test set
pred_train = ada.predict_proba(X_train[['Cabin_mapped', 'Sex']])
pred_test = ada.predict_proba(X_test[['Cabin_mapped', 'Sex']].fillna(0))

print('Train set')
print('Adaboost roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Adaboost roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))
Train set
Adaboost roc-auc: 0.8399546647578144
Test set
Adaboost roc-auc: 0.809375

Now we will build the model on data with low cardinality for cabin and then predict using that model.

# call the model
ada = AdaBoostClassifier(n_estimators=200, random_state=44)

# train the model
ada.fit(X_train[['Cabin_reduced', 'Sex']], y_train)

# make predictions on train and test set
pred_train = ada.predict_proba(X_train[['Cabin_reduced', 'Sex']])
pred_test = ada.predict_proba(X_test[['Cabin_reduced', 'Sex']].fillna(0))

print('Train set')
print('Adaboost roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Adaboost roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))
Train set
Adaboost roc-auc: 0.8195863430294354
Test set
Adaboost roc-auc: 0.8332142857142857

The Adaboost model trained on the variable with high cardinality is also overfitting to the training set. Whereas the Adaboost trained on the low cardinal variable is not overfitting and therefore does a better job in generalising the predictions.

In addition, building an AdaBoost on a model with less categories in Cabin, is a) simpler and b) should a different category in the test set appear, by taking just the front letter of cabin, the ML model will know how to handle it because it was seen during training.

Logistic Regression

We will build the model on data with high cardinality for cabin and then predict using that model.

# call the model
logit = LogisticRegression(random_state=44, solver='lbfgs')

# train the model
logit.fit(X_train[['Cabin_mapped', 'Sex']], y_train)

# make predictions on train and test set
pred_train = logit.predict_proba(X_train[['Cabin_mapped', 'Sex']])
pred_test = logit.predict_proba(X_test[['Cabin_mapped', 'Sex']].fillna(0))

print('Train set')
print('Logistic regression roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Logistic regression roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))
Train set
Logistic regression roc-auc: 0.8094564109238411
Test set
Logistic regression roc-auc: 0.7591071428571431

Now we will build the model on data with low cardinality for cabin and then predict using that model.

# call the model
logit = LogisticRegression(random_state=44, solver='lbfgs')

# train the model
logit.fit(X_train[['Cabin_reduced', 'Sex']], y_train)

# make predictions on train and test set
pred_train = logit.predict_proba(X_train[['Cabin_reduced', 'Sex']])
pred_test = logit.predict_proba(X_test[['Cabin_reduced', 'Sex']].fillna(0))

print('Train set')
print('Logistic regression roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Logistic regression roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))
Train set
Logistic regression roc-auc: 0.7672664367367301
Test set
Logistic regression roc-auc: 0.7957738095238095

We can draw the same conclusion for Logistic Regression: reducing the cardinality improves the performance and generalisation of the algorithm.

Gradient Boosted Classifier

We will build the model on data with high cardinality for cabin and then predict using that model.

# call the model
gbc = GradientBoostingClassifier(n_estimators=300, random_state=44)

# train the model
gbc.fit(X_train[['Cabin_mapped', 'Sex']], y_train)

# make predictions on train and test set
pred_train = gbc.predict_proba(X_train[['Cabin_mapped', 'Sex']])
pred_test = gbc.predict_proba(X_test[['Cabin_mapped', 'Sex']].fillna(0))

print('Train set')
print('Gradient Boosted Trees roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Gradient Boosted Trees roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))
Train set
Gradient Boosted Trees roc-auc: 0.8731860480249887
Test set
Gradient Boosted Trees roc-auc: 0.816845238095238

Now we will build the model on data with low cardinality for cabin and then predict using that model.

# model build on data with plenty of categories in Cabin variable

# call the model
gbc = GradientBoostingClassifier(n_estimators=300, random_state=44)

# train the model
gbc.fit(X_train[['Cabin_reduced', 'Sex']], y_train)

# make predictions on train and test set
pred_train = gbc.predict_proba(X_train[['Cabin_reduced', 'Sex']])
pred_test = gbc.predict_proba(X_test[['Cabin_reduced', 'Sex']].fillna(0))

print('Train set')
print('Gradient Boosted Trees roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Gradient Boosted Trees roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))
Train set
Gradient Boosted Trees roc-auc: 0.8204756946703976
Test set
Gradient Boosted Trees roc-auc: 0.8332142857142857

We can see that all the algorithms give better performance when the cardinality of the variables is low.