When preparing categorical variables for machine learning models, one of the most critical characteristics to evaluate is the number of unique categories or labels.
The number of unique labels within a categorical variable is known as its cardinality. A high number of labels within a variable is referred to as high cardinality. High-cardinality features (like zip codes, cities, or IP addresses) present unique challenges for machine learning algorithms, particularly decision trees.
In this tutorial, you will explore the concept of cardinality using the Titanic dataset. You will learn how to quantify cardinality, identify the challenges it introduces during train-test splits, observe its impact on the performance of algorithms like Random Forests, AdaBoost, Logistic Regression, and Gradient Boosting, and implement a deck-based grouping technique to reduce cardinality and prevent overfitting.
Prerequisites: Python 3.x, Pandas, Scikit-learn.
High Cardinality Challenges
High cardinality in categorical variables can introduce several problems in machine learning models:
- Tree Bias: Tree-based algorithms tend to favor variables with many levels over those with few levels, regardless of their actual predictive power.
- Overfitting and Noise: Variables with too many labels introduce high-frequency noise with little information, making the model prone to memorizing the training set.
- Train-Test Inconsistency: Some labels might only be present in the training set, causing overfitting. Conversely, new labels might appear only in the test set, leaving the model unable to compute predictions on unseen data.
We will demonstrate these effects below using the Titanic dataset to predict passenger survival.
Setup and Libraries
Before processing the dataset, import the necessary data manipulation, model building, and evaluation libraries:
# to read the dataset into a dataframe and perform operations on it
import pandas as pd
# to perform basic array operations
import numpy as np
# to build machine learning models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
# to evaluate the models
from sklearn.metrics import roc_auc_score
# to separate data into train and test
from sklearn.model_selection import train_test_split
Load the Titanic dataset using Pandas:
data = pd.read_csv('https://raw.githubusercontent.com/laxmimerit/All-CSV-ML-Data-Files-Download/master/titanic.csv')
data.head()
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
Inspect the cardinality (the number of unique categories) of each categorical variable in the dataset:
print('Number of categories in the variable Name: {}'.format(
len(data.Name.unique())))
print('Number of categories in the variable Gender: {}'.format(
len(data.Sex.unique())))
print('Number of categories in the variable Ticket: {}'.format(
len(data.Ticket.unique())))
print('Number of categories in the variable Cabin: {}'.format(
len(data.Cabin.unique())))
print('Number of categories in the variable Embarked: {}'.format(
len(data.Embarked.unique())))
print('Total number of passengers in the Titanic: {}'.format(len(data)))
Number of categories in the variable Name: 891
Number of categories in the variable Gender: 2
Number of categories in the variable Ticket: 681
Number of categories in the variable Cabin: 148
Number of categories in the variable Embarked: 4
Total number of passengers in the Titanic: 891
The printout confirms that while Sex (2 categories) and Embarked (4 categories) have low cardinality, Ticket (681 categories), Name (891 categories), and Cabin (148 categories) exhibit high cardinality.
To demonstrate the impact of cardinality, we will focus on the Cabin column. Display the unique values of Cabin to see their original structure:
data.Cabin.unique()
array([nan, 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6',
'C23 C25 C27', 'B78', 'D33', 'B30', 'C52', 'B28', 'C83', 'F33',
'F G73', 'E31', 'A5', 'D10 D12', 'D26', 'C110', 'B58 B60', 'E101',
'F E69', 'D47', 'B86', 'F2', 'C2', 'E33', 'B19', 'A7', 'C49', 'F4',
'A32', 'B4', 'B80', 'A31', 'D36', 'D15', 'C93', 'C78', 'D35',
'C87', 'B77', 'E67', 'B94', 'C125', 'C99', 'C118', 'D7', 'A19',
'B49', 'D', 'C22 C26', 'C106', 'C65', 'E36', 'C54',
'B57 B59 B63 B66', 'C7', 'E34', 'C32', 'B18', 'C124', 'C91', 'E40',
'T', 'C128', 'D37', 'B35', 'E50', 'C82', 'B96 B98', 'E10', 'E44',
'A34', 'C104', 'C111', 'C92', 'E38', 'D21', 'E12', 'E63', 'A14',
'B37', 'C30', 'D20', 'B79', 'E25', 'D46', 'B73', 'C95', 'B38',
'B39', 'B22', 'C86', 'C70', 'A16', 'C101', 'C68', 'A10', 'E68',
'B41', 'A20', 'D19', 'D50', 'D9', 'A23', 'B50', 'A26', 'D48',
'E58', 'C126', 'B71', 'B51 B53 B55', 'D49', 'B5', 'B20', 'F G63',
'C62 C64', 'E24', 'C90', 'C45', 'E8', 'B101', 'D45', 'C46', 'D30',
'E121', 'D11', 'E77', 'F38', 'B3', 'D6', 'B82 B84', 'D17', 'A36',
'B102', 'B69', 'E49', 'C47', 'D28', 'E17', 'A24', 'C50', 'B42',
'C148'], dtype=object)
We will reduce the cardinality of Cabin by extracting only the first letter. The first letter indicates the deck (e.g., A, B, C) where the cabin was located, which acts as a proxy for social class and proximity to the ship's surface:
# let's capture the first letter of Cabin
data['Cabin_reduced'] = data['Cabin'].astype(str).str[0]
data[['Cabin', 'Cabin_reduced']].head()
| Cabin | Cabin_reduced | |
|---|---|---|
| 0 | NaN | n |
| 1 | C85 | C |
| 2 | NaN | n |
| 3 | C123 | C |
| 4 | NaN | n |
Verify the new cardinality of the reduced Cabin_reduced variable:
print('Number of categories in the variable Cabin: {}'.format(
len(data.Cabin.unique())))
print('Number of categories in the variable Cabin_reduced: {}'.format(
len(data.Cabin_reduced.unique())))
Number of categories in the variable Cabin: 148
Number of categories in the variable Cabin_reduced: 9
Splitting the first letter reduces the cardinality from 148 categories down to just 9.
Split the dataset into training (70%) and testing (30%) sets:
use_cols = ['Cabin', 'Cabin_reduced', 'Sex']
X_train, X_test, y_train, y_test = train_test_split(
data[use_cols],
data['Survived'],
test_size=0.3,
random_state=0)
X_train.shape, X_test.shape
((623, 3), (268, 3))
Inconsistent Category Distributions
When variables have high cardinality, categories frequently land in the training set but not in the testing set, or vice-versa. Count the number of categories in Cabin that only exist in the training set:
unique_to_train_set = [
x for x in X_train.Cabin.unique() if x not in X_test.Cabin.unique()
]
len(unique_to_train_set)
100
There are 100 cabins that only appear in the training dataset. Now count the categories that only exist in the testing dataset:
unique_to_test_set = [
x for x in X_test.Cabin.unique() if x not in X_train.Cabin.unique()
]
len(unique_to_test_set)
28
Repeat the same analysis on the Cabin_reduced variable to evaluate the impact of reduced cardinality:
unique_to_train_set = [
x for x in X_train['Cabin_reduced'].unique()
if x not in X_test['Cabin_reduced'].unique()
]
len(unique_to_train_set)
1
Count the reduced categories that appear only in the testing set:
unique_to_test_set = [
x for x in X_test['Cabin_reduced'].unique()
if x not in X_train['Cabin_reduced'].unique()
]
len(unique_to_test_set)
0
By reducing cardinality, only 1 category remains unique to the training set, and no categories are unique to the test set.
Encoding and Mapping Categorical Variables
To train models, we must map our string categories to integer values. Create an integer mapping dictionary for the original Cabin feature:
import itertools
cabin_dict = {k: i for i, k in enumerate(X_train['Cabin'].unique(), 0)}
print(dict(itertools.islice(cabin_dict.items(), 100)))
{'E17': 0, 'D33': 1, nan: 2, 'D26': 3, 'B58 B60': 4, 'C128': 5, 'D17': 6, 'A14': 7, 'F33': 8, 'B19': 9, 'D21': 10, 'C148': 11, 'C30': 12, 'D56': 13, 'E24': 14, 'E40': 15, 'E31': 16, 'E44': 17, 'E38': 18, 'D37': 19, 'E8': 20, 'C92': 21, 'E63': 22, 'C125': 23, 'F4': 24, 'E67': 25, 'C126': 26, 'B73': 27, 'E36': 28, 'C78': 29, 'E46': 30, 'C111': 31, 'E101': 32, 'D15': 33, 'E12': 34, 'G6': 35, 'A32': 36, 'B4': 37, 'A10': 38, 'A5': 39, 'C95': 40, 'E25': 41, 'C90': 42, 'D6': 43, 'A36': 44, 'D': 45, 'D50': 46, 'B96 B98': 47, 'C93': 48, 'E77': 49, 'C101': 50, 'D11': 51, 'C123': 52, 'C32': 53, 'B35': 54, 'C91': 55, 'T': 56, 'B101': 57, 'E58': 58, 'A23': 59, 'B77': 60, 'D28': 61, 'B82 B84': 62, 'B79': 63, 'C45': 64, 'C2': 65, 'B5': 66, 'C104': 67, 'B20': 68, 'A19': 69, 'B51 B53 B55': 70, 'B80': 71, 'B38': 72, 'B22': 73, 'B18': 74, 'C22 C26': 75, 'A16': 76, 'F2': 77, 'D47': 78, 'E121': 79, 'C23 C25 C27': 80, 'B28': 81, 'E10': 82, 'D36': 83, 'C46': 84, 'B39': 85, 'D30': 86, 'E33': 87, 'C50': 88, 'D20': 89, 'C124': 90, 'A34': 91, 'C110': 92, 'D19': 93, 'B86': 94, 'D35': 95, 'C99': 96, 'D46': 97, 'F38': 98, 'A24': 99}
Map the string values to integers inside both train and test sets:
X_train.loc[:, 'Cabin_mapped'] = X_train.loc[:, 'Cabin'].map(cabin_dict)
X_test.loc[:, 'Cabin_mapped'] = X_test.loc[:, 'Cabin'].map(cabin_dict)
X_train[['Cabin_mapped', 'Cabin']].head(10)
| Cabin_mapped | Cabin | |
|---|---|---|
| 857 | 0 | E17 |
| 52 | 1 | D33 |
| 386 | 2 | NaN |
| 124 | 3 | D26 |
| 578 | 2 | NaN |
| 549 | 2 | NaN |
| 118 | 4 | B58 B60 |
| 12 | 2 | NaN |
| 157 | 2 | NaN |
| 127 | 2 | NaN |
Repeat the mapping process on the Cabin_reduced variable:
# create replace dictionary
cabin_dict = {k: i for i, k in enumerate(X_train['Cabin_reduced'].unique(), 0)}
# replace labels by numbers with dictionary
X_train.loc[:, 'Cabin_reduced'] = X_train.loc[:, 'Cabin_reduced'].map(cabin_dict)
X_test.loc[:, 'Cabin_reduced'] = X_test.loc[:, 'Cabin_reduced'].map(cabin_dict)
X_train[['Cabin_reduced', 'Cabin']].head(20)
| Cabin_reduced | Cabin | |
|---|---|---|
| 857 | 0 | E17 |
| 52 | 1 | D33 |
| 386 | 2 | NaN |
| 124 | 1 | D26 |
| 578 | 2 | NaN |
| 549 | 2 | NaN |
| 118 | 3 | B58 B60 |
| 12 | 2 | NaN |
| 157 | 2 | NaN |
| 127 | 2 | NaN |
| 653 | 2 | NaN |
| 235 | 2 | NaN |
| 785 | 2 | NaN |
| 241 | 2 | NaN |
| 351 | 4 | C128 |
| 862 | 1 | D17 |
| 851 | 2 | NaN |
| 753 | 2 | NaN |
| 532 | 2 | NaN |
| 485 | 2 | NaN |
Map the categorical Sex column to binary values:
X_train.loc[:, 'Sex'] = X_train.loc[:, 'Sex'].map({'male': 0, 'female': 1})
X_test.loc[:, 'Sex'] = X_test.loc[:, 'Sex'].map({'male': 0, 'female': 1})
X_train.Sex.head()
857 0
52 1
386 0
326 0
578 1
Name: Sex, dtype: int64
Verify missing values in the processed training set columns:
X_train[['Cabin_mapped', 'Cabin_reduced', 'Sex']].isnull().sum()
Cabin_mapped 0
Cabin_reduced 0
Sex 0
dtype: int64
Verify missing values in the test set columns:
X_test[['Cabin_mapped', 'Cabin_reduced', 'Sex']].isnull().sum()
Cabin_mapped 30
Cabin_reduced 0
Sex 0
dtype: int64
The high-cardinality Cabin_mapped feature introduces 30 missing values in the test set. This is because the categories unique to the test set were not present in the dictionary constructed from the training set, mapping them to NaN.
Show the count of unique classes in each feature:
len(X_train.Cabin_mapped.unique()), len(X_train.Cabin_reduced.unique())
(121, 9)
Model Evaluation
We will now train several machine learning models to observe the impact of cardinality on model performance and generalization.
Random Forests
Train a Random Forest classifier using the high-cardinality Cabin_mapped feature:
# call the model
rf = RandomForestClassifier(n_estimators=200, random_state=39)
# train the model
rf.fit(X_train[['Cabin_mapped', 'Sex']], y_train)
# make predictions on train and test set
pred_train = rf.predict_proba(X_train[['Cabin_mapped', 'Sex']])
pred_test = rf.predict_proba(X_test[['Cabin_mapped', 'Sex']].fillna(0))
print('Train set')
print('Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))
Train set
Random Forests roc-auc: 0.8617329342096702
Test set
Random Forests roc-auc: 0.8078571428571428
The model shows high performance on the training set (0.86) compared to the test set (0.80), demonstrating overfitting.
Train the Random Forest classifier using the reduced cardinality Cabin_reduced feature:
# call the model
rf = RandomForestClassifier(n_estimators=200, random_state=39)
# train the model
rf.fit(X_train[['Cabin_reduced', 'Sex']], y_train)
# make predictions on train and test set
pred_train = rf.predict_proba(X_train[['Cabin_reduced', 'Sex']])
pred_test = rf.predict_proba(X_test[['Cabin_reduced', 'Sex']])
print('Train set')
print('Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))
Train set
Random Forests roc-auc: 0.8199550985878832
Test set
Random Forests roc-auc: 0.8332142857142857
Reducing cardinality mitigates overfitting, leading to better generalization and a higher ROC-AUC score on the test set.
AdaBoost
Train an AdaBoost classifier using the high-cardinality feature:
# call the model
ada = AdaBoostClassifier(n_estimators=200, random_state=44)
# train the model
ada.fit(X_train[['Cabin_mapped', 'Sex']], y_train)
# make predictions on train and test set
pred_train = ada.predict_proba(X_train[['Cabin_mapped', 'Sex']])
pred_test = ada.predict_proba(X_test[['Cabin_mapped', 'Sex']].fillna(0))
print('Train set')
print('Adaboost roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Adaboost roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))
Train set
Adaboost roc-auc: 0.8399546647578144
Test set
Adaboost roc-auc: 0.809375
Train the AdaBoost classifier using the reduced cardinality feature:
# call the model
ada = AdaBoostClassifier(n_estimators=200, random_state=44)
# train the model
ada.fit(X_train[['Cabin_reduced', 'Sex']], y_train)
# make predictions on train and test set
pred_train = ada.predict_proba(X_train[['Cabin_reduced', 'Sex']])
pred_test = ada.predict_proba(X_test[['Cabin_reduced', 'Sex']].fillna(0))
print('Train set')
print('Adaboost roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Adaboost roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))
Train set
Adaboost roc-auc: 0.8195863430294354
Test set
Adaboost roc-auc: 0.8332142857142857
Similar to Random Forests, the AdaBoost model trained on the reduced variable generalizes better and is less prone to overfitting.
Logistic Regression
Train a Logistic Regression model using the high-cardinality feature:
# call the model
logit = LogisticRegression(random_state=44, solver='lbfgs')
# train the model
logit.fit(X_train[['Cabin_mapped', 'Sex']], y_train)
# make predictions on train and test set
pred_train = logit.predict_proba(X_train[['Cabin_mapped', 'Sex']])
pred_test = logit.predict_proba(X_test[['Cabin_mapped', 'Sex']].fillna(0))
print('Train set')
print('Logistic regression roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Logistic regression roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))
Train set
Logistic regression roc-auc: 0.8094564109238411
Test set
Logistic regression roc-auc: 0.7591071428571431
Train the Logistic Regression model using the reduced cardinality feature:
# call the model
logit = LogisticRegression(random_state=44, solver='lbfgs')
# train the model
logit.fit(X_train[['Cabin_reduced', 'Sex']], y_train)
# make predictions on train and test set
pred_train = logit.predict_proba(X_train[['Cabin_reduced', 'Sex']])
pred_test = logit.predict_proba(X_test[['Cabin_reduced', 'Sex']].fillna(0))
print('Train set')
print('Logistic regression roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Logistic regression roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))
Train set
Logistic regression roc-auc: 0.7672664367367301
Test set
Logistic regression roc-auc: 0.7957738095238095
Gradient Boosted Classifier
Train a Gradient Boosting classifier using the high-cardinality feature:
# call the model
gbc = GradientBoostingClassifier(n_estimators=300, random_state=44)
# train the model
gbc.fit(X_train[['Cabin_mapped', 'Sex']], y_train)
# make predictions on train and test set
pred_train = gbc.predict_proba(X_train[['Cabin_mapped', 'Sex']])
pred_test = gbc.predict_proba(X_test[['Cabin_mapped', 'Sex']].fillna(0))
print('Train set')
print('Gradient Boosted Trees roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Gradient Boosted Trees roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))
Train set
Gradient Boosted Trees roc-auc: 0.8731860480249887
Test set
Gradient Boosted Trees roc-auc: 0.816845238095238
Train the Gradient Boosting classifier using the reduced cardinality feature:
# model build on data with plenty of categories in Cabin variable
# call the model
gbc = GradientBoostingClassifier(n_estimators=300, random_state=44)
# train the model
gbc.fit(X_train[['Cabin_reduced', 'Sex']], y_train)
# make predictions on train and test set
pred_train = gbc.predict_proba(X_train[['Cabin_reduced', 'Sex']])
pred_test = gbc.predict_proba(X_test[['Cabin_reduced', 'Sex']].fillna(0))
print('Train set')
print('Gradient Boosted Trees roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Gradient Boosted Trees roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))
Train set
Gradient Boosted Trees roc-auc: 0.8204756946703976
Test set
Gradient Boosted Trees roc-auc: 0.8332142857142857
We see that reducing the cardinality of the variable results in a significant boost in performance on unseen data across all models evaluated.
Conclusion
In this tutorial, you explored how high-cardinality categorical variables affect the performance of machine learning algorithms. Using the Titanic dataset, you observed how high cardinality leads to uneven category distribution across train-test splits, introduces null values upon encoding, and results in severe model overfitting. By grouping cabin values into their corresponding decks, you successfully reduced cardinality from 148 to 9 categories, improving model generalization across all classifiers.
Key takeaways:
- High Cardinality: Features with too many categories often cause models to overfit by memorizing noise instead of learning meaningful patterns.
- Data Leakage and Splitting Issues: High cardinality causes categories to appear uniquely in either the training or testing splits, leading to encoding and prediction errors.
- Grouping and Dimensionality Reduction: Grouping categories based on domain knowledge (such as extracting the first letter from cabin identifiers) is a powerful way to reduce cardinality and boost model performance.
Next steps:
- Explore structured methods for missing data identification in Missing Values and Their Mechanisms.
- Refresh your dataset preprocessing capabilities in Pandas Crash Course.
