Cardinality in Machine Learning

When preparing categorical variables for machine learning models, one of the most critical characteristics to evaluate is the number of unique categories or labels.

The number of unique labels within a categorical variable is known as its cardinality. A high number of labels within a variable is referred to as high cardinality. High-cardinality features (like zip codes, cities, or IP addresses) present unique challenges for machine learning algorithms, particularly decision trees.

In this tutorial, you will explore the concept of cardinality using the Titanic dataset. You will learn how to quantify cardinality, identify the challenges it introduces during train-test splits, observe its impact on the performance of algorithms like Random Forests, AdaBoost, Logistic Regression, and Gradient Boosting, and implement a deck-based grouping technique to reduce cardinality and prevent overfitting.

Prerequisites: Python 3.x, Pandas, Scikit-learn.

High Cardinality Challenges

High cardinality in categorical variables can introduce several problems in machine learning models:

Tree Bias: Tree-based algorithms tend to favor variables with many levels over those with few levels, regardless of their actual predictive power.
Overfitting and Noise: Variables with too many labels introduce high-frequency noise with little information, making the model prone to memorizing the training set.
Train-Test Inconsistency: Some labels might only be present in the training set, causing overfitting. Conversely, new labels might appear only in the test set, leaving the model unable to compute predictions on unseen data.

We will demonstrate these effects below using the Titanic dataset to predict passenger survival.

Setup and Libraries

Before processing the dataset, import the necessary data manipulation, model building, and evaluation libraries:

PYTHON

# to read the dataset into a dataframe and perform operations on it
import pandas as pd

# to perform basic array operations
import numpy as np

# to build machine learning models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

# to evaluate the models
from sklearn.metrics import roc_auc_score

# to separate data into train and test
from sklearn.model_selection import train_test_split

Load the Titanic dataset using Pandas:

PYTHON

data = pd.read_csv('https://raw.githubusercontent.com/laxmimerit/All-CSV-ML-Data-Files-Download/master/titanic.csv')
data.head()

OUTPUT

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

Inspect the cardinality (the number of unique categories) of each categorical variable in the dataset:

PYTHON

print('Number of categories in the variable Name: {}'.format(
    len(data.Name.unique())))

print('Number of categories in the variable Gender: {}'.format(
    len(data.Sex.unique())))

print('Number of categories in the variable Ticket: {}'.format(
    len(data.Ticket.unique())))

print('Number of categories in the variable Cabin: {}'.format(
    len(data.Cabin.unique())))

print('Number of categories in the variable Embarked: {}'.format(
    len(data.Embarked.unique())))

print('Total number of passengers in the Titanic: {}'.format(len(data)))

OUTPUT

Number of categories in the variable Name: 891
Number of categories in the variable Gender: 2
Number of categories in the variable Ticket: 681
Number of categories in the variable Cabin: 148
Number of categories in the variable Embarked: 4
Total number of passengers in the Titanic: 891

The printout confirms that while Sex (2 categories) and Embarked (4 categories) have low cardinality, Ticket (681 categories), Name (891 categories), and Cabin (148 categories) exhibit high cardinality.

To demonstrate the impact of cardinality, we will focus on the Cabin column. Display the unique values of Cabin to see their original structure:

PYTHON

data.Cabin.unique()

OUTPUT

array([nan, 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6',
       'C23 C25 C27', 'B78', 'D33', 'B30', 'C52', 'B28', 'C83', 'F33',
       'F G73', 'E31', 'A5', 'D10 D12', 'D26', 'C110', 'B58 B60', 'E101',
       'F E69', 'D47', 'B86', 'F2', 'C2', 'E33', 'B19', 'A7', 'C49', 'F4',
       'A32', 'B4', 'B80', 'A31', 'D36', 'D15', 'C93', 'C78', 'D35',
       'C87', 'B77', 'E67', 'B94', 'C125', 'C99', 'C118', 'D7', 'A19',
       'B49', 'D', 'C22 C26', 'C106', 'C65', 'E36', 'C54',
       'B57 B59 B63 B66', 'C7', 'E34', 'C32', 'B18', 'C124', 'C91', 'E40',
       'T', 'C128', 'D37', 'B35', 'E50', 'C82', 'B96 B98', 'E10', 'E44',
       'A34', 'C104', 'C111', 'C92', 'E38', 'D21', 'E12', 'E63', 'A14',
       'B37', 'C30', 'D20', 'B79', 'E25', 'D46', 'B73', 'C95', 'B38',
       'B39', 'B22', 'C86', 'C70', 'A16', 'C101', 'C68', 'A10', 'E68',
       'B41', 'A20', 'D19', 'D50', 'D9', 'A23', 'B50', 'A26', 'D48',
       'E58', 'C126', 'B71', 'B51 B53 B55', 'D49', 'B5', 'B20', 'F G63',
       'C62 C64', 'E24', 'C90', 'C45', 'E8', 'B101', 'D45', 'C46', 'D30',
       'E121', 'D11', 'E77', 'F38', 'B3', 'D6', 'B82 B84', 'D17', 'A36',
       'B102', 'B69', 'E49', 'C47', 'D28', 'E17', 'A24', 'C50', 'B42',
       'C148'], dtype=object)

We will reduce the cardinality of Cabin by extracting only the first letter. The first letter indicates the deck (e.g., A, B, C) where the cabin was located, which acts as a proxy for social class and proximity to the ship's surface:

PYTHON

# let's capture the first letter of Cabin
data['Cabin_reduced'] = data['Cabin'].astype(str).str[0]

data[['Cabin', 'Cabin_reduced']].head()

OUTPUT

	Cabin	Cabin_reduced
0	NaN	n
1	C85	C
2	NaN	n
3	C123	C
4	NaN	n

Verify the new cardinality of the reduced Cabin_reduced variable:

PYTHON

print('Number of categories in the variable Cabin: {}'.format(
    len(data.Cabin.unique())))

print('Number of categories in the variable Cabin_reduced: {}'.format(
    len(data.Cabin_reduced.unique())))

OUTPUT

Number of categories in the variable Cabin: 148
Number of categories in the variable Cabin_reduced: 9

Splitting the first letter reduces the cardinality from 148 categories down to just 9.

Split the dataset into training (70%) and testing (30%) sets:

PYTHON

use_cols = ['Cabin', 'Cabin_reduced', 'Sex']

X_train, X_test, y_train, y_test = train_test_split(
    data[use_cols],
    data['Survived'],
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape

OUTPUT

((623, 3), (268, 3))

Inconsistent Category Distributions

When variables have high cardinality, categories frequently land in the training set but not in the testing set, or vice-versa. Count the number of categories in Cabin that only exist in the training set:

PYTHON

unique_to_train_set = [
    x for x in X_train.Cabin.unique() if x not in X_test.Cabin.unique()
]

len(unique_to_train_set)

OUTPUT

There are 100 cabins that only appear in the training dataset. Now count the categories that only exist in the testing dataset:

PYTHON

unique_to_test_set = [
    x for x in X_test.Cabin.unique() if x not in X_train.Cabin.unique()
]

len(unique_to_test_set)

OUTPUT

Repeat the same analysis on the Cabin_reduced variable to evaluate the impact of reduced cardinality:

PYTHON

unique_to_train_set = [
    x for x in X_train['Cabin_reduced'].unique()
    if x not in X_test['Cabin_reduced'].unique()
]

len(unique_to_train_set)

OUTPUT

Count the reduced categories that appear only in the testing set:

PYTHON

unique_to_test_set = [
    x for x in X_test['Cabin_reduced'].unique()
    if x not in X_train['Cabin_reduced'].unique()
]

len(unique_to_test_set)

OUTPUT

By reducing cardinality, only 1 category remains unique to the training set, and no categories are unique to the test set.

Encoding and Mapping Categorical Variables

To train models, we must map our string categories to integer values. Create an integer mapping dictionary for the original Cabin feature:

PYTHON

import itertools

cabin_dict = {k: i for i, k in enumerate(X_train['Cabin'].unique(), 0)}
print(dict(itertools.islice(cabin_dict.items(), 100)))

OUTPUT

{'E17': 0, 'D33': 1, nan: 2, 'D26': 3, 'B58 B60': 4, 'C128': 5, 'D17': 6, 'A14': 7, 'F33': 8, 'B19': 9, 'D21': 10, 'C148': 11, 'C30': 12, 'D56': 13, 'E24': 14, 'E40': 15, 'E31': 16, 'E44': 17, 'E38': 18, 'D37': 19, 'E8': 20, 'C92': 21, 'E63': 22, 'C125': 23, 'F4': 24, 'E67': 25, 'C126': 26, 'B73': 27, 'E36': 28, 'C78': 29, 'E46': 30, 'C111': 31, 'E101': 32, 'D15': 33, 'E12': 34, 'G6': 35, 'A32': 36, 'B4': 37, 'A10': 38, 'A5': 39, 'C95': 40, 'E25': 41, 'C90': 42, 'D6': 43, 'A36': 44, 'D': 45, 'D50': 46, 'B96 B98': 47, 'C93': 48, 'E77': 49, 'C101': 50, 'D11': 51, 'C123': 52, 'C32': 53, 'B35': 54, 'C91': 55, 'T': 56, 'B101': 57, 'E58': 58, 'A23': 59, 'B77': 60, 'D28': 61, 'B82 B84': 62, 'B79': 63, 'C45': 64, 'C2': 65, 'B5': 66, 'C104': 67, 'B20': 68, 'A19': 69, 'B51 B53 B55': 70, 'B80': 71, 'B38': 72, 'B22': 73, 'B18': 74, 'C22 C26': 75, 'A16': 76, 'F2': 77, 'D47': 78, 'E121': 79, 'C23 C25 C27': 80, 'B28': 81, 'E10': 82, 'D36': 83, 'C46': 84, 'B39': 85, 'D30': 86, 'E33': 87, 'C50': 88, 'D20': 89, 'C124': 90, 'A34': 91, 'C110': 92, 'D19': 93, 'B86': 94, 'D35': 95, 'C99': 96, 'D46': 97, 'F38': 98, 'A24': 99}

Map the string values to integers inside both train and test sets:

PLAINTEXT

X_train.loc[:, 'Cabin_mapped'] = X_train.loc[:, 'Cabin'].map(cabin_dict)
X_test.loc[:, 'Cabin_mapped'] = X_test.loc[:, 'Cabin'].map(cabin_dict)

X_train[['Cabin_mapped', 'Cabin']].head(10)

OUTPUT

	Cabin_mapped	Cabin
857	0	E17
52	1	D33
386	2	NaN
124	3	D26
578	2	NaN
549	2	NaN
118	4	B58 B60
12	2	NaN
157	2	NaN
127	2	NaN

Repeat the mapping process on the Cabin_reduced variable:

PYTHON

# create replace dictionary
cabin_dict = {k: i for i, k in enumerate(X_train['Cabin_reduced'].unique(), 0)}

# replace labels by numbers with dictionary
X_train.loc[:, 'Cabin_reduced'] = X_train.loc[:, 'Cabin_reduced'].map(cabin_dict)
X_test.loc[:, 'Cabin_reduced'] = X_test.loc[:, 'Cabin_reduced'].map(cabin_dict)

X_train[['Cabin_reduced', 'Cabin']].head(20)

OUTPUT

	Cabin_reduced	Cabin
857	0	E17
52	1	D33
386	2	NaN
124	1	D26
578	2	NaN
549	2	NaN
118	3	B58 B60
12	2	NaN
157	2	NaN
127	2	NaN
653	2	NaN
235	2	NaN
785	2	NaN
241	2	NaN
351	4	C128
862	1	D17
851	2	NaN
753	2	NaN
532	2	NaN
485	2	NaN

Map the categorical Sex column to binary values:

PYTHON

X_train.loc[:, 'Sex'] = X_train.loc[:, 'Sex'].map({'male': 0, 'female': 1})
X_test.loc[:, 'Sex'] = X_test.loc[:, 'Sex'].map({'male': 0, 'female': 1})

X_train.Sex.head()

OUTPUT

857    0
52     1
386    0
326    0
578    1
Name: Sex, dtype: int64

Verify missing values in the processed training set columns:

PYTHON

X_train[['Cabin_mapped', 'Cabin_reduced', 'Sex']].isnull().sum()

OUTPUT

Cabin_mapped     0
Cabin_reduced    0
Sex              0
dtype: int64

Verify missing values in the test set columns:

PYTHON

X_test[['Cabin_mapped', 'Cabin_reduced', 'Sex']].isnull().sum()

OUTPUT

Cabin_mapped     30
Cabin_reduced     0
Sex               0
dtype: int64

The high-cardinality Cabin_mapped feature introduces 30 missing values in the test set. This is because the categories unique to the test set were not present in the dictionary constructed from the training set, mapping them to NaN.

Show the count of unique classes in each feature:

PYTHON

len(X_train.Cabin_mapped.unique()), len(X_train.Cabin_reduced.unique())

OUTPUT

(121, 9)

Model Evaluation

We will now train several machine learning models to observe the impact of cardinality on model performance and generalization.

Random Forests

Train a Random Forest classifier using the high-cardinality Cabin_mapped feature:

PYTHON

# call the model
rf = RandomForestClassifier(n_estimators=200, random_state=39)

# train the model
rf.fit(X_train[['Cabin_mapped', 'Sex']], y_train)

# make predictions on train and test set
pred_train = rf.predict_proba(X_train[['Cabin_mapped', 'Sex']])
pred_test = rf.predict_proba(X_test[['Cabin_mapped', 'Sex']].fillna(0))

print('Train set')
print('Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))

OUTPUT

Train set
Random Forests roc-auc: 0.8617329342096702
Test set
Random Forests roc-auc: 0.8078571428571428

The model shows high performance on the training set (0.86) compared to the test set (0.80), demonstrating overfitting.

Train the Random Forest classifier using the reduced cardinality Cabin_reduced feature:

PYTHON

# call the model
rf = RandomForestClassifier(n_estimators=200, random_state=39)

# train the model
rf.fit(X_train[['Cabin_reduced', 'Sex']], y_train)

# make predictions on train and test set
pred_train = rf.predict_proba(X_train[['Cabin_reduced', 'Sex']])
pred_test = rf.predict_proba(X_test[['Cabin_reduced', 'Sex']])

print('Train set')
print('Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))

OUTPUT

Train set
Random Forests roc-auc: 0.8199550985878832
Test set
Random Forests roc-auc: 0.8332142857142857

Reducing cardinality mitigates overfitting, leading to better generalization and a higher ROC-AUC score on the test set.

AdaBoost

Train an AdaBoost classifier using the high-cardinality feature:

PYTHON

# call the model
ada = AdaBoostClassifier(n_estimators=200, random_state=44)

# train the model
ada.fit(X_train[['Cabin_mapped', 'Sex']], y_train)

# make predictions on train and test set
pred_train = ada.predict_proba(X_train[['Cabin_mapped', 'Sex']])
pred_test = ada.predict_proba(X_test[['Cabin_mapped', 'Sex']].fillna(0))

print('Train set')
print('Adaboost roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Adaboost roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))

OUTPUT

Train set
Adaboost roc-auc: 0.8399546647578144
Test set
Adaboost roc-auc: 0.809375

Train the AdaBoost classifier using the reduced cardinality feature:

PYTHON

# call the model
ada = AdaBoostClassifier(n_estimators=200, random_state=44)

# train the model
ada.fit(X_train[['Cabin_reduced', 'Sex']], y_train)

# make predictions on train and test set
pred_train = ada.predict_proba(X_train[['Cabin_reduced', 'Sex']])
pred_test = ada.predict_proba(X_test[['Cabin_reduced', 'Sex']].fillna(0))

print('Train set')
print('Adaboost roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Adaboost roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))

OUTPUT

Train set
Adaboost roc-auc: 0.8195863430294354
Test set
Adaboost roc-auc: 0.8332142857142857

Similar to Random Forests, the AdaBoost model trained on the reduced variable generalizes better and is less prone to overfitting.

Logistic Regression

Train a Logistic Regression model using the high-cardinality feature:

PYTHON

# call the model
logit = LogisticRegression(random_state=44, solver='lbfgs')

# train the model
logit.fit(X_train[['Cabin_mapped', 'Sex']], y_train)

# make predictions on train and test set
pred_train = logit.predict_proba(X_train[['Cabin_mapped', 'Sex']])
pred_test = logit.predict_proba(X_test[['Cabin_mapped', 'Sex']].fillna(0))

print('Train set')
print('Logistic regression roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Logistic regression roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))

OUTPUT

Train set
Logistic regression roc-auc: 0.8094564109238411
Test set
Logistic regression roc-auc: 0.7591071428571431

Train the Logistic Regression model using the reduced cardinality feature:

PYTHON

# call the model
logit = LogisticRegression(random_state=44, solver='lbfgs')

# train the model
logit.fit(X_train[['Cabin_reduced', 'Sex']], y_train)

# make predictions on train and test set
pred_train = logit.predict_proba(X_train[['Cabin_reduced', 'Sex']])
pred_test = logit.predict_proba(X_test[['Cabin_reduced', 'Sex']].fillna(0))

print('Train set')
print('Logistic regression roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Logistic regression roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))

OUTPUT

Train set
Logistic regression roc-auc: 0.7672664367367301
Test set
Logistic regression roc-auc: 0.7957738095238095

Gradient Boosted Classifier

Train a Gradient Boosting classifier using the high-cardinality feature:

PYTHON

# call the model
gbc = GradientBoostingClassifier(n_estimators=300, random_state=44)

# train the model
gbc.fit(X_train[['Cabin_mapped', 'Sex']], y_train)

# make predictions on train and test set
pred_train = gbc.predict_proba(X_train[['Cabin_mapped', 'Sex']])
pred_test = gbc.predict_proba(X_test[['Cabin_mapped', 'Sex']].fillna(0))

print('Train set')
print('Gradient Boosted Trees roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Gradient Boosted Trees roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))

OUTPUT

Train set
Gradient Boosted Trees roc-auc: 0.8731860480249887
Test set
Gradient Boosted Trees roc-auc: 0.816845238095238

Train the Gradient Boosting classifier using the reduced cardinality feature:

PYTHON

# model build on data with plenty of categories in Cabin variable

# call the model
gbc = GradientBoostingClassifier(n_estimators=300, random_state=44)

# train the model
gbc.fit(X_train[['Cabin_reduced', 'Sex']], y_train)

# make predictions on train and test set
pred_train = gbc.predict_proba(X_train[['Cabin_reduced', 'Sex']])
pred_test = gbc.predict_proba(X_test[['Cabin_reduced', 'Sex']].fillna(0))

print('Train set')
print('Gradient Boosted Trees roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Gradient Boosted Trees roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))

OUTPUT

Train set
Gradient Boosted Trees roc-auc: 0.8204756946703976
Test set
Gradient Boosted Trees roc-auc: 0.8332142857142857

We see that reducing the cardinality of the variable results in a significant boost in performance on unseen data across all models evaluated.

Conclusion

In this tutorial, you explored how high-cardinality categorical variables affect the performance of machine learning algorithms. Using the Titanic dataset, you observed how high cardinality leads to uneven category distribution across train-test splits, introduces null values upon encoding, and results in severe model overfitting. By grouping cabin values into their corresponding decks, you successfully reduced cardinality from 148 to 9 categories, improving model generalization across all classifiers.

Key takeaways:

High Cardinality: Features with too many categories often cause models to overfit by memorizing noise instead of learning meaningful patterns.
Data Leakage and Splitting Issues: High cardinality causes categories to appear uniquely in either the training or testing splits, leading to encoding and prediction errors.
Grouping and Dimensionality Reduction: Grouping categories based on domain knowledge (such as extracting the first letter from cabin identifiers) is a powerful way to reduce cardinality and boost model performance.

Next steps:

Explore structured methods for missing data identification in Missing Values and Their Mechanisms.
Refresh your dataset preprocessing capabilities in Pandas Crash Course.

Cardinality in Machine Learning

Topics You Will Master

High Cardinality Challenges

Setup and Libraries

Inconsistent Category Distributions

Encoding and Mapping Categorical Variables

Model Evaluation

Random Forests

AdaBoost

Logistic Regression

Gradient Boosted Classifier

Conclusion

Latest recommendations you might like

LinkedIn Auto Connect Bot

Dimensionality Reduction with LDA and PCA in Python

Find this tutorial useful?

Discussion & Comments