Feature Engineering Series Tutorial 2: Cardinality in Machine Learning
Cardinality refers to the number of possible values that a feature can assume. For example, the variable “US State” is one that has 50 possible values. The binary features, of course, could only assume one of two values (0 or 1).
The values of a categorical variable are selected from a group of categories, also called labels. For example, in the variable gender the categories or labels are male and female, whereas in the variable city the labels can be London, Manchester, Brighton and so on.
Different categorical variables contain different number of labels or categories. The variable gender contains only 2 labels, but a variable like city or postcode, can contain a huge number of different labels.
The number of different labels within a categorical variable is known as cardinality. A high number of labels within a variable is known as high cardinality.
Are multiple labels in a categorical variable a problem?
High cardinality may pose the following problems:
- Variables with too many labels tend to dominate over those with only a few labels, particularly in Tree based algorithms.
- A big number of labels within a variable may introduce noise with little, if any, information, therefore making machine learning models prone to over-fit.
- Some of the labels may only be present in the training data set, but not in the test set, therefore machine learning algorithms may over-fit to the training set.
- Contrarily, some labels may appear only in the test set, therefore leaving the machine learning algorithms unable to perform a calculation over the new (unseen) observation.
In particular, tree methods can be biased towards variables with lots of labels (variables with high cardinality). Thus, their performance may be affected by high cardinality.
Below we will see the effect of high cardinality of variables on the performance of different machine learning algorithms and how a quick fix to reduce the number of labels, without any sort of data insight, helps to boost the performance.
In this Blog:
We will:
- Learn how to quantify cardinality
- See examples of high and low cardinality variables
- Understand the effect of cardinality while preparing train and test sets
- See the effect of cardinality on Machine Learning Model performance
We will use the Titanic dataset.
Let's start!
We will first import all the necessary libraries.
# to read the dataset into a dataframe and perform operations on it import pandas as pd # to perform basic array operations import numpy as np # to build machine learning models from sklearn.linear_model import LogisticRegression from sklearn.ensemble import AdaBoostClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import GradientBoostingClassifier # to evaluate the models from sklearn.metrics import roc_auc_score # to separate data into train and test from sklearn.model_selection import train_test_split
Now we will read the titanic dataset using read_csv()
. head()
shows the first 5 rows of the dataframe. The categorical variables in this dataset are Name
, Sex
, Ticket
, Cabin
and Embarked
.
Note: Ticket
and Cabin
contain both letters and numbers, so they could be treated as Mixed Variables. For this demonstration, we will treat them as categorical variables.
data = pd.read_csv('https://raw.githubusercontent.com/laxmimerit/All-CSV-ML-Data-Files-Download/master/titanic.csv') data.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
Let's inspect the cardinality of each categorical variable in the dataset.
print('Number of categories in the variable Name: {}'.format( len(data.Name.unique()))) print('Number of categories in the variable Gender: {}'.format( len(data.Sex.unique()))) print('Number of categories in the variable Ticket: {}'.format( len(data.Ticket.unique()))) print('Number of categories in the variable Cabin: {}'.format( len(data.Cabin.unique()))) print('Number of categories in the variable Embarked: {}'.format( len(data.Embarked.unique()))) print('Total number of passengers in the Titanic: {}'.format(len(data)))
Number of categories in the variable Name: 891 Number of categories in the variable Gender: 2 Number of categories in the variable Ticket: 681 Number of categories in the variable Cabin: 148 Number of categories in the variable Embarked: 4 Total number of passengers in the Titanic: 891
While the variable Sex
contains only 2 categories and Embarked
contains 4 (low cardinality), the variables Ticket
, Name
and Cabin
, as expected, contain a huge number of different labels (high cardinality).
To demonstrate the effect of high cardinality in train and test sets and machine learning performance, we will work with the variable Cabin
. We will create a new variable with reduced cardinality.
We will begin by exploring the values in the variable Cabin
. As we saw in the previous cell there are 148 unique values. We will display these values using unique()
data.Cabin.unique()
array([nan, 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6', 'C23 C25 C27', 'B78', 'D33', 'B30', 'C52', 'B28', 'C83', 'F33', 'F G73', 'E31', 'A5', 'D10 D12', 'D26', 'C110', 'B58 B60', 'E101', 'F E69', 'D47', 'B86', 'F2', 'C2', 'E33', 'B19', 'A7', 'C49', 'F4', 'A32', 'B4', 'B80', 'A31', 'D36', 'D15', 'C93', 'C78', 'D35', 'C87', 'B77', 'E67', 'B94', 'C125', 'C99', 'C118', 'D7', 'A19', 'B49', 'D', 'C22 C26', 'C106', 'C65', 'E36', 'C54', 'B57 B59 B63 B66', 'C7', 'E34', 'C32', 'B18', 'C124', 'C91', 'E40', 'T', 'C128', 'D37', 'B35', 'E50', 'C82', 'B96 B98', 'E10', 'E44', 'A34', 'C104', 'C111', 'C92', 'E38', 'D21', 'E12', 'E63', 'A14', 'B37', 'C30', 'D20', 'B79', 'E25', 'D46', 'B73', 'C95', 'B38', 'B39', 'B22', 'C86', 'C70', 'A16', 'C101', 'C68', 'A10', 'E68', 'B41', 'A20', 'D19', 'D50', 'D9', 'A23', 'B50', 'A26', 'D48', 'E58', 'C126', 'B71', 'B51 B53 B55', 'D49', 'B5', 'B20', 'F G63', 'C62 C64', 'E24', 'C90', 'C45', 'E8', 'B101', 'D45', 'C46', 'D30', 'E121', 'D11', 'E77', 'F38', 'B3', 'D6', 'B82 B84', 'D17', 'A36', 'B102', 'B69', 'E49', 'C47', 'D28', 'E17', 'A24', 'C50', 'B42', 'C148'], dtype=object)
Now we will reduce the cardinality of the variable. To do so, instead of using the entire cabin value, we will retain only the first letter in Cabin_reduced
.
Rationale: The first letter indicates the deck on which the cabin was located, and is therefore an indication of both social class status and proximity to the surface of the Titanic. Both are known to improve the probability of survival.
# let's capture the first letter of Cabin data['Cabin_reduced'] = data['Cabin'].astype(str).str[0] data[['Cabin', 'Cabin_reduced']].head()
Cabin | Cabin_reduced | |
---|---|---|
0 | NaN | n |
1 | C85 | C |
2 | NaN | n |
3 | C123 | C |
4 | NaN | n |
Now let's check the cardinality of Cabin_reduced
. We reduced the number of different labels from 148 to 9.
print('Number of categories in the variable Cabin: {}'.format( len(data.Cabin.unique()))) print('Number of categories in the variable Cabin_reduced: {}'.format( len(data.Cabin_reduced.unique())))
Number of categories in the variable Cabin: 148 Number of categories in the variable Cabin_reduced: 9
Now we will split the data into training and testing set with the help of train_test_split()
. use_col
contains the variables of the feature space i.e. the variables which provide information necessary for prediction. Survived
contains the values which have to be predicted. The test_size = 0.3
will keep 30% data for testing and 70% data will be used for training the model. random_state
controls the shuffling applied to the data before applying the split.
use_cols = ['Cabin', 'Cabin_reduced', 'Sex'] X_train, X_test, y_train, y_test = train_test_split( data[use_cols], data['Survived'], test_size=0.3, random_state=0) X_train.shape, X_test.shape
((623, 3), (268, 3))
As you can see from the previous cell the training set contains 623 rows and the test dataset contains 268 rows.
High cardinality leads to uneven distribution of categories in train and test sets
When a variable has high cardinality, often some categories land only in the training set, or only in the testing set. If present only in the training set, they may lead to over-fitting. If present only on the testing set, the machine learning algorithm will not know how to handle them, as it has not seen them during training.
We will find the number of labels in Cabin
which are present only in the training set and are not present in the test dataset.
unique_to_train_set = [ x for x in X_train.Cabin.unique() if x not in X_test.Cabin.unique() ] len(unique_to_train_set)
100
There are 100 Cabins that are only present in the training set, and not in the testing set. Simillarly, we will compute the number of labels present only in the test set and not in the training set.
unique_to_test_set = [ x for x in X_test.Cabin.unique() if x not in X_train.Cabin.unique() ] len(unique_to_test_set)
28
This problem can be overcomed by reducing the cardinality of the variable. Let's find out the number of labels present only in the training set for Cabin with reduced cardinality i.e. Cabin_reduced
.
unique_to_train_set = [ x for x in X_train['Cabin_reduced'].unique() if x not in X_test['Cabin_reduced'].unique() ] len(unique_to_train_set)
1
Now we will find the number of labels present only in the test set for Cabin with reduced cardinality i.e. Cabin_reduced
.
unique_to_test_set = [ x for x in X_test['Cabin_reduced'].unique() if x not in X_train['Cabin_reduced'].unique() ] len(unique_to_test_set)
0
Observe how by reducing the cardinality there is now only 1 label in the training set that is not present in the test set. And no labels in the test set which are not present in the training set.
Effect of cardinality on Machine Learning Model Performance
In order to evaluate the effect of categorical variables in machine learning models, we will quickly replace the categories by numbers. We will re-map Cabin
to numbers so we can use it to train ML models.
Note: This is neither the only nor the best way to encode categorical variables into numbers
Here itertools
is just used to display the first 100 elements in the newly created dictionary.
import itertools cabin_dict = {k: i for i, k in enumerate(X_train['Cabin'].unique(), 0)} print(dict(itertools.islice(cabin_dict.items(), 100)))
{'E17': 0, 'D33': 1, nan: 2, 'D26': 3, 'B58 B60': 4, 'C128': 5, 'D17': 6, 'A14': 7, 'F33': 8, 'B19': 9, 'D21': 10, 'C148': 11, 'C30': 12, 'D56': 13, 'E24': 14, 'E40': 15, 'E31': 16, 'E44': 17, 'E38': 18, 'D37': 19, 'E8': 20, 'C92': 21, 'E63': 22, 'C125': 23, 'F4': 24, 'E67': 25, 'C126': 26, 'B73': 27, 'E36': 28, 'C78': 29, 'E46': 30, 'C111': 31, 'E101': 32, 'D15': 33, 'E12': 34, 'G6': 35, 'A32': 36, 'B4': 37, 'A10': 38, 'A5': 39, 'C95': 40, 'E25': 41, 'C90': 42, 'D6': 43, 'A36': 44, 'D': 45, 'D50': 46, 'B96 B98': 47, 'C93': 48, 'E77': 49, 'C101': 50, 'D11': 51, 'C123': 52, 'C32': 53, 'B35': 54, 'C91': 55, 'T': 56, 'B101': 57, 'E58': 58, 'A23': 59, 'B77': 60, 'D28': 61, 'B82 B84': 62, 'B79': 63, 'C45': 64, 'C2': 65, 'B5': 66, 'C104': 67, 'B20': 68, 'A19': 69, 'B51 B53 B55': 70, 'B80': 71, 'B38': 72, 'B22': 73, 'B18': 74, 'C22 C26': 75, 'A16': 76, 'F2': 77, 'D47': 78, 'E121': 79, 'C23 C25 C27': 80, 'B28': 81, 'E10': 82, 'D36': 83, 'C46': 84, 'B39': 85, 'D30': 86, 'E33': 87, 'C50': 88, 'D20': 89, 'C124': 90, 'A34': 91, 'C110': 92, 'D19': 93, 'B86': 94, 'D35': 95, 'C99': 96, 'D46': 97, 'F38': 98, 'A24': 99}
Now we will replace the labels in Cabin
using the dictionary cabin_dict
created above. The numerical values will be stored in Cabin_mapped
.
X_train.loc[:, 'Cabin_mapped'] = X_train.loc[:, 'Cabin'].map(cabin_dict) X_test.loc[:, 'Cabin_mapped'] = X_test.loc[:, 'Cabin'].map(cabin_dict) X_train[['Cabin_mapped', 'Cabin']].head(10)
Cabin_mapped | Cabin | |
---|---|---|
857 | 0 | E17 |
52 | 1 | D33 |
386 | 2 | NaN |
124 | 3 | D26 |
578 | 2 | NaN |
549 | 2 | NaN |
118 | 4 | B58 B60 |
12 | 2 | NaN |
157 | 2 | NaN |
127 | 2 | NaN |
We can see that NaN takes the value 2 in the new variable, E17 takes the value 0, D33 takes the value 1, and so on. Now we will replace the letters in the Cabin_reduced
variable with numbers following the same procedure as above.
# create replace dictionary cabin_dict = {k: i for i, k in enumerate(X_train['Cabin_reduced'].unique(), 0)} # replace labels by numbers with dictionary X_train.loc[:, 'Cabin_reduced'] = X_train.loc[:, 'Cabin_reduced'].map(cabin_dict) X_test.loc[:, 'Cabin_reduced'] = X_test.loc[:, 'Cabin_reduced'].map(cabin_dict) X_train[['Cabin_reduced', 'Cabin']].head(20)
Cabin_reduced | Cabin | |
---|---|---|
857 | 0 | E17 |
52 | 1 | D33 |
386 | 2 | NaN |
124 | 1 | D26 |
578 | 2 | NaN |
549 | 2 | NaN |
118 | 3 | B58 B60 |
12 | 2 | NaN |
157 | 2 | NaN |
127 | 2 | NaN |
653 | 2 | NaN |
235 | 2 | NaN |
785 | 2 | NaN |
241 | 2 | NaN |
351 | 4 | C128 |
862 | 1 | D17 |
851 | 2 | NaN |
753 | 2 | NaN |
532 | 2 | NaN |
485 | 2 | NaN |
We see now that D33 and D26 correspond to the same number, 1, because we are capturing only the first letter. They both start with D.
Now we wil map the categorical variable Sex
to numbers.
X_train.loc[:, 'Sex'] = X_train.loc[:, 'Sex'].map({'male': 0, 'female': 1}) X_test.loc[:, 'Sex'] = X_test.loc[:, 'Sex'].map({'male': 0, 'female': 1}) X_train.Sex.head()
857 0 52 1 386 0 124 0 578 1 Name: Sex, dtype: int64
Next we will check if there are any missing values in these variables in the training as well as testing dataset.
X_train[['Cabin_mapped', 'Cabin_reduced', 'Sex']].isnull().sum()
Cabin_mapped 0 Cabin_reduced 0 Sex 0 dtype: int64
X_test[['Cabin_mapped', 'Cabin_reduced', 'Sex']].isnull().sum()
Cabin_mapped 30 Cabin_reduced 0 Sex 0 dtype: int64
In the test set, there are now 30 missing values for the highly cardinal variable Cabin_mapped
. These were introduced while encoding the categories into numbers.
Why?
Many categories exist only in the test set. Thus, when we created our encoding dictionary using only the train set, we did not generate a number to replace those labels present only in the test set. As a consequence, they were encoded as NaN. We will see in future notebooks how to tackle this problem. For now, we will fill those missing values with 0.
Let's check the number of different categories in the encoded variables
len(X_train.Cabin_mapped.unique()), len(X_train.Cabin_reduced.unique())
(121, 9)
From here we can conclude that from the original 148 cabins in the dataset, only 121 are present in the training set. We also see how we reduced the number of different categories to just 9 in our previous step.
Let's go ahead and evaluate the effect of labels on machine learning algorithms.
Random Forests
We will build the model on data with high cardinality for cabin and then predict using that model.
# call the model rf = RandomForestClassifier(n_estimators=200, random_state=39) # train the model rf.fit(X_train[['Cabin_mapped', 'Sex']], y_train) # make predictions on train and test set pred_train = rf.predict_proba(X_train[['Cabin_mapped', 'Sex']]) pred_test = rf.predict_proba(X_test[['Cabin_mapped', 'Sex']].fillna(0)) print('Train set') print('Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1]))) print('Test set') print('Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))
Train set Random Forests roc-auc: 0.8617329342096702 Test set Random Forests roc-auc: 0.8078571428571428
We observe that the performance of the Random Forests on the training set is quite superior to its performance in the test set. This indicates that the model is over-fitting, which means that it does a great job at predicting the outcome on the dataset it was trained on, but it lacks the power to generalise the prediction for unseen data.
Now we will build the model on data with low cardinality for cabin and then predict using that model.
# call the model rf = RandomForestClassifier(n_estimators=200, random_state=39) # train the model rf.fit(X_train[['Cabin_reduced', 'Sex']], y_train) # make predictions on train and test set pred_train = rf.predict_proba(X_train[['Cabin_reduced', 'Sex']]) pred_test = rf.predict_proba(X_test[['Cabin_reduced', 'Sex']]) print('Train set') print('Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1]))) print('Test set') print('Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))
Train set Random Forests roc-auc: 0.8199550985878832 Test set Random Forests roc-auc: 0.8332142857142857
We can see now that the Random Forests no longer over-fitS to the training set. In addition, the model is much better at generalising the predictions.
Note:- We can overcome the effect of high cardinality by adjusting the hyper-parameters of the random forests. That goes beyond the scope of this blog. Here, I want to show you that given a same model, with identical hyper-parameters, high cardinality may cause the model to over-fit.
AdaBoost
We will build the model on data with high cardinality for cabin and then predict using that model.
# call the model ada = AdaBoostClassifier(n_estimators=200, random_state=44) # train the model ada.fit(X_train[['Cabin_mapped', 'Sex']], y_train) # make predictions on train and test set pred_train = ada.predict_proba(X_train[['Cabin_mapped', 'Sex']]) pred_test = ada.predict_proba(X_test[['Cabin_mapped', 'Sex']].fillna(0)) print('Train set') print('Adaboost roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1]))) print('Test set') print('Adaboost roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))
Train set Adaboost roc-auc: 0.8399546647578144 Test set Adaboost roc-auc: 0.809375
Now we will build the model on data with low cardinality for cabin and then predict using that model.
# call the model ada = AdaBoostClassifier(n_estimators=200, random_state=44) # train the model ada.fit(X_train[['Cabin_reduced', 'Sex']], y_train) # make predictions on train and test set pred_train = ada.predict_proba(X_train[['Cabin_reduced', 'Sex']]) pred_test = ada.predict_proba(X_test[['Cabin_reduced', 'Sex']].fillna(0)) print('Train set') print('Adaboost roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1]))) print('Test set') print('Adaboost roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))
Train set Adaboost roc-auc: 0.8195863430294354 Test set Adaboost roc-auc: 0.8332142857142857
The Adaboost model trained on the variable with high cardinality is also overfitting to the training set. Whereas the Adaboost trained on the low cardinal variable is not overfitting and therefore does a better job in generalising the predictions.
In addition, building an AdaBoost on a model with less categories in Cabin, is a) simpler and b) should a different category in the test set appear, by taking just the front letter of cabin, the ML model will know how to handle it because it was seen during training.
Logistic Regression
We will build the model on data with high cardinality for cabin and then predict using that model.
# call the model logit = LogisticRegression(random_state=44, solver='lbfgs') # train the model logit.fit(X_train[['Cabin_mapped', 'Sex']], y_train) # make predictions on train and test set pred_train = logit.predict_proba(X_train[['Cabin_mapped', 'Sex']]) pred_test = logit.predict_proba(X_test[['Cabin_mapped', 'Sex']].fillna(0)) print('Train set') print('Logistic regression roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1]))) print('Test set') print('Logistic regression roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))
Train set Logistic regression roc-auc: 0.8094564109238411 Test set Logistic regression roc-auc: 0.7591071428571431
Now we will build the model on data with low cardinality for cabin and then predict using that model.
# call the model logit = LogisticRegression(random_state=44, solver='lbfgs') # train the model logit.fit(X_train[['Cabin_reduced', 'Sex']], y_train) # make predictions on train and test set pred_train = logit.predict_proba(X_train[['Cabin_reduced', 'Sex']]) pred_test = logit.predict_proba(X_test[['Cabin_reduced', 'Sex']].fillna(0)) print('Train set') print('Logistic regression roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1]))) print('Test set') print('Logistic regression roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))
Train set Logistic regression roc-auc: 0.7672664367367301 Test set Logistic regression roc-auc: 0.7957738095238095
We can draw the same conclusion for Logistic Regression: reducing the cardinality improves the performance and generalisation of the algorithm.
Gradient Boosted Classifier
We will build the model on data with high cardinality for cabin and then predict using that model.
# call the model gbc = GradientBoostingClassifier(n_estimators=300, random_state=44) # train the model gbc.fit(X_train[['Cabin_mapped', 'Sex']], y_train) # make predictions on train and test set pred_train = gbc.predict_proba(X_train[['Cabin_mapped', 'Sex']]) pred_test = gbc.predict_proba(X_test[['Cabin_mapped', 'Sex']].fillna(0)) print('Train set') print('Gradient Boosted Trees roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1]))) print('Test set') print('Gradient Boosted Trees roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))
Train set Gradient Boosted Trees roc-auc: 0.8731860480249887 Test set Gradient Boosted Trees roc-auc: 0.816845238095238
Now we will build the model on data with low cardinality for cabin and then predict using that model.
# model build on data with plenty of categories in Cabin variable # call the model gbc = GradientBoostingClassifier(n_estimators=300, random_state=44) # train the model gbc.fit(X_train[['Cabin_reduced', 'Sex']], y_train) # make predictions on train and test set pred_train = gbc.predict_proba(X_train[['Cabin_reduced', 'Sex']]) pred_test = gbc.predict_proba(X_test[['Cabin_reduced', 'Sex']].fillna(0)) print('Train set') print('Gradient Boosted Trees roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1]))) print('Test set') print('Gradient Boosted Trees roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))
Train set Gradient Boosted Trees roc-auc: 0.8204756946703976 Test set Gradient Boosted Trees roc-auc: 0.8332142857142857
We can see that all the algorithms give better performance when the cardinality of the variables is low.
0 Comments