# Feature Engineering Series Tutorial 2: Cardinality in Machine Learning

Cardinality refers to the number of possible values that a feature can assume. For example, the variable “US State” is one that has 50 possible values. The binary features, of course, could only assume one of two values (0 or 1).

The values of a categorical variable are selected from a group of categories, also called labels. For example, in the variable gender the categories or labels are male and female, whereas in the variable city the labels can be London, Manchester, Brighton and so on.

Different categorical variables contain different number of labels or categories. The variable gender contains only 2 labels, but a variable like city or postcode, can contain a huge number of different labels.

The number of different labels within a categorical variable is known as cardinality. A high number of labels within a variable is known as high cardinality.

### Are multiple labels in a categorical variable a problem?

High cardinality may pose the following problems:

• Variables with too many labels tend to dominate over those with only a few labels, particularly in Tree based algorithms.
• A big number of labels within a variable may introduce noise with little, if any, information, therefore making machine learning models prone to over-fit.
• Some of the labels may only be present in the training data set, but not in the test set, therefore machine learning algorithms may over-fit to the training set.
• Contrarily, some labels may appear only in the test set, therefore leaving the machine learning algorithms unable to perform a calculation over the new (unseen) observation.

In particular, tree methods can be biased towards variables with lots of labels (variables with high cardinality). Thus, their performance may be affected by high cardinality.

Below we will see the effect of high cardinality of variables on the performance of different machine learning algorithms and how a quick fix to reduce the number of labels, without any sort of data insight, helps to boost the performance.

## In this Blog:

We will:

• Learn how to quantify cardinality
• See examples of high and low cardinality variables
• Understand the effect of cardinality while preparing train and test sets
• See the effect of cardinality on Machine Learning Model performance

We will use the Titanic dataset.

## Let’s start!

We will first import all the necessary libraries.

```# to read the dataset into a dataframe and perform operations on it
import pandas as pd

# to perform basic array operations
import numpy as np

# to build machine learning models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# to evaluate the models
from sklearn.metrics import roc_auc_score

# to separate data into train and test
from sklearn.model_selection import train_test_split
```

Now we will read the titanic dataset using `read_csv()``head()` shows the first 5 rows of the dataframe. The categorical variables in this dataset are `Name``Sex``Ticket``Cabin` and `Embarked`.

Note`Ticket` and `Cabin` contain both letters and numbers, so they could be treated as Mixed Variables. For this demonstration, we will treat them as categorical variables.

```data = pd.read_csv('https://raw.githubusercontent.com/laxmimerit/All-CSV-ML-Data-Files-Download/master/titanic.csv')
```

Let’s inspect the cardinality of each categorical variable in the dataset.

```print('Number of categories in the variable Name: {}'.format(
len(data.Name.unique())))

print('Number of categories in the variable Gender: {}'.format(
len(data.Sex.unique())))

print('Number of categories in the variable Ticket: {}'.format(
len(data.Ticket.unique())))

print('Number of categories in the variable Cabin: {}'.format(
len(data.Cabin.unique())))

print('Number of categories in the variable Embarked: {}'.format(
len(data.Embarked.unique())))

print('Total number of passengers in the Titanic: {}'.format(len(data)))
```
```Number of categories in the variable Name: 891
Number of categories in the variable Gender: 2
Number of categories in the variable Ticket: 681
Number of categories in the variable Cabin: 148
Number of categories in the variable Embarked: 4
Total number of passengers in the Titanic: 891
```

While the variable `Sex` contains only 2 categories and `Embarked` contains 4 (low cardinality), the variables `Ticket``Name` and `Cabin`, as expected, contain a huge number of different labels (high cardinality).

To demonstrate the effect of high cardinality in train and test sets and machine learning performance, we will work with the variable `Cabin`. We will create a new variable with reduced cardinality.

We will begin by exploring the values in the variable `Cabin`. As we saw in the previous cell there are 148 unique values. We will display these values using `unique()`

```data.Cabin.unique()
```
```array([nan, 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6',
'C23 C25 C27', 'B78', 'D33', 'B30', 'C52', 'B28', 'C83', 'F33',
'F G73', 'E31', 'A5', 'D10 D12', 'D26', 'C110', 'B58 B60', 'E101',
'F E69', 'D47', 'B86', 'F2', 'C2', 'E33', 'B19', 'A7', 'C49', 'F4',
'A32', 'B4', 'B80', 'A31', 'D36', 'D15', 'C93', 'C78', 'D35',
'C87', 'B77', 'E67', 'B94', 'C125', 'C99', 'C118', 'D7', 'A19',
'B49', 'D', 'C22 C26', 'C106', 'C65', 'E36', 'C54',
'B57 B59 B63 B66', 'C7', 'E34', 'C32', 'B18', 'C124', 'C91', 'E40',
'T', 'C128', 'D37', 'B35', 'E50', 'C82', 'B96 B98', 'E10', 'E44',
'A34', 'C104', 'C111', 'C92', 'E38', 'D21', 'E12', 'E63', 'A14',
'B37', 'C30', 'D20', 'B79', 'E25', 'D46', 'B73', 'C95', 'B38',
'B39', 'B22', 'C86', 'C70', 'A16', 'C101', 'C68', 'A10', 'E68',
'B41', 'A20', 'D19', 'D50', 'D9', 'A23', 'B50', 'A26', 'D48',
'E58', 'C126', 'B71', 'B51 B53 B55', 'D49', 'B5', 'B20', 'F G63',
'C62 C64', 'E24', 'C90', 'C45', 'E8', 'B101', 'D45', 'C46', 'D30',
'E121', 'D11', 'E77', 'F38', 'B3', 'D6', 'B82 B84', 'D17', 'A36',
'B102', 'B69', 'E49', 'C47', 'D28', 'E17', 'A24', 'C50', 'B42',
'C148'], dtype=object)```

Now we will reduce the cardinality of the variable. To do so, instead of using the entire cabin value, we will retain only the first letter in `Cabin_reduced`.

Rationale: The first letter indicates the deck on which the cabin was located, and is therefore an indication of both social class status and proximity to the surface of the Titanic. Both are known to improve the probability of survival.

```# let's capture the first letter of Cabin
data['Cabin_reduced'] = data['Cabin'].astype(str).str

```

Now let’s check the cardinality of `Cabin_reduced`. We reduced the number of different labels from 148 to 9.

```print('Number of categories in the variable Cabin: {}'.format(
len(data.Cabin.unique())))

print('Number of categories in the variable Cabin_reduced: {}'.format(
len(data.Cabin_reduced.unique())))
```
```Number of categories in the variable Cabin: 148
Number of categories in the variable Cabin_reduced: 9
```

Now we will split the data into training and testing set with the help of `train_test_split()``use_col` contains the variables of the feature space i.e. the variables which provide information necessary for prediction. `Survived` contains the values which have to be predicted. The `test_size = 0.3` will keep 30% data for testing and 70% data will be used for training the model. `random_state` controls the shuffling applied to the data before applying the split.

```use_cols = ['Cabin', 'Cabin_reduced', 'Sex']

X_train, X_test, y_train, y_test = train_test_split(
data[use_cols],
data['Survived'],
test_size=0.3,
random_state=0)

X_train.shape, X_test.shape
```
`((623, 3), (268, 3))`

As you can see from the previous cell the training set contains 623 rows and the test dataset contains 268 rows.

### High cardinality leads to uneven distribution of categories in train and test sets

When a variable has high cardinality, often some categories land only in the training set, or only in the testing set. If present only in the training set, they may lead to over-fitting. If present only on the testing set, the machine learning algorithm will not know how to handle them, as it has not seen them during training.

We will find the number of labels in `Cabin` which are present only in the training set and are not present in the test dataset.

```unique_to_train_set = [
x for x in X_train.Cabin.unique() if x not in X_test.Cabin.unique()
]

len(unique_to_train_set)
```
`100`

There are 100 Cabins that are only present in the training set, and not in the testing set. Simillarly, we will compute the number of labels present only in the test set and not in the training set.

```unique_to_test_set = [
x for x in X_test.Cabin.unique() if x not in X_train.Cabin.unique()
]

len(unique_to_test_set)
```
`28`

This problem can be overcomed by reducing the cardinality of the variable. Let’s find out the number of labels present only in the training set for Cabin with reduced cardinality i.e. `Cabin_reduced`.

```unique_to_train_set = [
x for x in X_train['Cabin_reduced'].unique()
if x not in X_test['Cabin_reduced'].unique()
]

len(unique_to_train_set)
```
`1`

Now we will find the number of labels present only in the test set for Cabin with reduced cardinality i.e. `Cabin_reduced`.

```unique_to_test_set = [
x for x in X_test['Cabin_reduced'].unique()
if x not in X_train['Cabin_reduced'].unique()
]

len(unique_to_test_set)
```
`0`

Observe how by reducing the cardinality there is now only 1 label in the training set that is not present in the test set. And no labels in the test set which are not present in the training set.

### Effect of cardinality on Machine Learning Model Performance

In order to evaluate the effect of categorical variables in machine learning models, we will quickly replace the categories by numbers. We will re-map `Cabin` to numbers so we can use it to train ML models.

Note: This is neither the only nor the best way to encode categorical variables into numbers

Here `itertools` is just used to display the first 100 elements in the newly created dictionary.

```import itertools

cabin_dict = {k: i for i, k in enumerate(X_train['Cabin'].unique(), 0)}
print(dict(itertools.islice(cabin_dict.items(), 100)))
```
```{'E17': 0, 'D33': 1, nan: 2, 'D26': 3, 'B58 B60': 4, 'C128': 5, 'D17': 6, 'A14': 7, 'F33': 8, 'B19': 9, 'D21': 10, 'C148': 11, 'C30': 12, 'D56': 13, 'E24': 14, 'E40': 15, 'E31': 16, 'E44': 17, 'E38': 18, 'D37': 19, 'E8': 20, 'C92': 21, 'E63': 22, 'C125': 23, 'F4': 24, 'E67': 25, 'C126': 26, 'B73': 27, 'E36': 28, 'C78': 29, 'E46': 30, 'C111': 31, 'E101': 32, 'D15': 33, 'E12': 34, 'G6': 35, 'A32': 36, 'B4': 37, 'A10': 38, 'A5': 39, 'C95': 40, 'E25': 41, 'C90': 42, 'D6': 43, 'A36': 44, 'D': 45, 'D50': 46, 'B96 B98': 47, 'C93': 48, 'E77': 49, 'C101': 50, 'D11': 51, 'C123': 52, 'C32': 53, 'B35': 54, 'C91': 55, 'T': 56, 'B101': 57, 'E58': 58, 'A23': 59, 'B77': 60, 'D28': 61, 'B82 B84': 62, 'B79': 63, 'C45': 64, 'C2': 65, 'B5': 66, 'C104': 67, 'B20': 68, 'A19': 69, 'B51 B53 B55': 70, 'B80': 71, 'B38': 72, 'B22': 73, 'B18': 74, 'C22 C26': 75, 'A16': 76, 'F2': 77, 'D47': 78, 'E121': 79, 'C23 C25 C27': 80, 'B28': 81, 'E10': 82, 'D36': 83, 'C46': 84, 'B39': 85, 'D30': 86, 'E33': 87, 'C50': 88, 'D20': 89, 'C124': 90, 'A34': 91, 'C110': 92, 'D19': 93, 'B86': 94, 'D35': 95, 'C99': 96, 'D46': 97, 'F38': 98, 'A24': 99}
```

Now we will replace the labels in `Cabin` using the dictionary `cabin_dict` created above. The numerical values will be stored in `Cabin_mapped`.

```X_train.loc[:, 'Cabin_mapped'] = X_train.loc[:, 'Cabin'].map(cabin_dict)
X_test.loc[:, 'Cabin_mapped'] = X_test.loc[:, 'Cabin'].map(cabin_dict)

```

We can see that NaN takes the value 2 in the new variable, E17 takes the value 0, D33 takes the value 1, and so on. Now we will replace the letters in the `Cabin_reduced` variable with numbers following the same procedure as above.

```# create replace dictionary
cabin_dict = {k: i for i, k in enumerate(X_train['Cabin_reduced'].unique(), 0)}

# replace labels by numbers with dictionary
X_train.loc[:, 'Cabin_reduced'] = X_train.loc[:, 'Cabin_reduced'].map(cabin_dict)
X_test.loc[:, 'Cabin_reduced'] = X_test.loc[:, 'Cabin_reduced'].map(cabin_dict)

```

We see now that D33 and D26 correspond to the same number, 1, because we are capturing only the first letter. They both start with D.

Now we wil map the categorical variable `Sex` to numbers.

```X_train.loc[:, 'Sex'] = X_train.loc[:, 'Sex'].map({'male': 0, 'female': 1})
X_test.loc[:, 'Sex'] = X_test.loc[:, 'Sex'].map({'male': 0, 'female': 1})

```
```857    0
52     1
386    0
124    0
578    1
Name: Sex, dtype: int64```

Next we will check if there are any missing values in these variables in the training as well as testing dataset.

```X_train[['Cabin_mapped', 'Cabin_reduced', 'Sex']].isnull().sum()
```
```Cabin_mapped     0
Cabin_reduced    0
Sex              0
dtype: int64```
```X_test[['Cabin_mapped', 'Cabin_reduced', 'Sex']].isnull().sum()
```
```Cabin_mapped     30
Cabin_reduced     0
Sex               0
dtype: int64```

In the test set, there are now 30 missing values for the highly cardinal variable `Cabin_mapped`. These were introduced while encoding the categories into numbers.

Why?

Many categories exist only in the test set. Thus, when we created our encoding dictionary using only the train set, we did not generate a number to replace those labels present only in the test set. As a consequence, they were encoded as NaN. We will see in future notebooks how to tackle this problem. For now, we will fill those missing values with 0.

Let’s check the number of different categories in the encoded variables

```len(X_train.Cabin_mapped.unique()), len(X_train.Cabin_reduced.unique())
```
`(121, 9)`

From here we can conclude that from the original 148 cabins in the dataset, only 121 are present in the training set. We also see how we reduced the number of different categories to just 9 in our previous step.

Let’s go ahead and evaluate the effect of labels on machine learning algorithms.

### Random Forests

We will build the model on data with high cardinality for cabin and then predict using that model.

```# call the model
rf = RandomForestClassifier(n_estimators=200, random_state=39)

# train the model
rf.fit(X_train[['Cabin_mapped', 'Sex']], y_train)

# make predictions on train and test set
pred_train = rf.predict_proba(X_train[['Cabin_mapped', 'Sex']])
pred_test = rf.predict_proba(X_test[['Cabin_mapped', 'Sex']].fillna(0))

print('Train set')
print('Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))
```
```Train set
Random Forests roc-auc: 0.8617329342096702
Test set
Random Forests roc-auc: 0.8078571428571428
```

We observe that the performance of the Random Forests on the training set is quite superior to its performance in the test set. This indicates that the model is over-fitting, which means that it does a great job at predicting the outcome on the dataset it was trained on, but it lacks the power to generalise the prediction for unseen data.

Now we will build the model on data with low cardinality for cabin and then predict using that model.

```# call the model
rf = RandomForestClassifier(n_estimators=200, random_state=39)

# train the model
rf.fit(X_train[['Cabin_reduced', 'Sex']], y_train)

# make predictions on train and test set
pred_train = rf.predict_proba(X_train[['Cabin_reduced', 'Sex']])
pred_test = rf.predict_proba(X_test[['Cabin_reduced', 'Sex']])

print('Train set')
print('Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))
```
```Train set
Random Forests roc-auc: 0.8199550985878832
Test set
Random Forests roc-auc: 0.8332142857142857
```

We can see now that the Random Forests no longer over-fitS to the training set. In addition, the model is much better at generalising the predictions.

Note:- We can overcome the effect of high cardinality by adjusting the hyper-parameters of the random forests. That goes beyond the scope of this blog. Here, I want to show you that given a same model, with identical hyper-parameters, high cardinality may cause the model to over-fit.

We will build the model on data with high cardinality for cabin and then predict using that model.

```# call the model

# train the model

# make predictions on train and test set

print('Train set')
print('Test set')
```
```Train set
Test set
```

Now we will build the model on data with low cardinality for cabin and then predict using that model.

```# call the model

# train the model

# make predictions on train and test set

print('Train set')
print('Test set')
```
```Train set
Test set
```

The Adaboost model trained on the variable with high cardinality is also overfitting to the training set. Whereas the Adaboost trained on the low cardinal variable is not overfitting and therefore does a better job in generalising the predictions.

In addition, building an AdaBoost on a model with less categories in Cabin, is a) simpler and b) should a different category in the test set appear, by taking just the front letter of cabin, the ML model will know how to handle it because it was seen during training.

### Logistic Regression

We will build the model on data with high cardinality for cabin and then predict using that model.

```# call the model
logit = LogisticRegression(random_state=44, solver='lbfgs')

# train the model
logit.fit(X_train[['Cabin_mapped', 'Sex']], y_train)

# make predictions on train and test set
pred_train = logit.predict_proba(X_train[['Cabin_mapped', 'Sex']])
pred_test = logit.predict_proba(X_test[['Cabin_mapped', 'Sex']].fillna(0))

print('Train set')
print('Logistic regression roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Logistic regression roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))
```
```Train set
Logistic regression roc-auc: 0.8094564109238411
Test set
Logistic regression roc-auc: 0.7591071428571431
```

Now we will build the model on data with low cardinality for cabin and then predict using that model.

```# call the model
logit = LogisticRegression(random_state=44, solver='lbfgs')

# train the model
logit.fit(X_train[['Cabin_reduced', 'Sex']], y_train)

# make predictions on train and test set
pred_train = logit.predict_proba(X_train[['Cabin_reduced', 'Sex']])
pred_test = logit.predict_proba(X_test[['Cabin_reduced', 'Sex']].fillna(0))

print('Train set')
print('Logistic regression roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Logistic regression roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))
```
```Train set
Logistic regression roc-auc: 0.7672664367367301
Test set
Logistic regression roc-auc: 0.7957738095238095
```

We can draw the same conclusion for Logistic Regression: reducing the cardinality improves the performance and generalisation of the algorithm.

We will build the model on data with high cardinality for cabin and then predict using that model.

```# call the model

# train the model
gbc.fit(X_train[['Cabin_mapped', 'Sex']], y_train)

# make predictions on train and test set
pred_train = gbc.predict_proba(X_train[['Cabin_mapped', 'Sex']])
pred_test = gbc.predict_proba(X_test[['Cabin_mapped', 'Sex']].fillna(0))

print('Train set')
print('Gradient Boosted Trees roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Gradient Boosted Trees roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))
```
```Train set
Test set
```

Now we will build the model on data with low cardinality for cabin and then predict using that model.

```# model build on data with plenty of categories in Cabin variable

# call the model

# train the model
gbc.fit(X_train[['Cabin_reduced', 'Sex']], y_train)

# make predictions on train and test set
pred_train = gbc.predict_proba(X_train[['Cabin_reduced', 'Sex']])
pred_test = gbc.predict_proba(X_test[['Cabin_reduced', 'Sex']].fillna(0))

print('Train set')
print('Gradient Boosted Trees roc-auc: {}'.format(roc_auc_score(y_train, pred_train[:,1])))
print('Test set')
print('Gradient Boosted Trees roc-auc: {}'.format(roc_auc_score(y_test, pred_test[:,1])))
```
```Train set
Test set
```

We can see that all the algorithms give better performance when the cardinality of the variables is low.

#### Aarya

Hi, I am Aarya Tadvalkar! Currently, I am pursuing Computer Engineering. I have a keen interest in Machine Learning and Data Science. I am always enthusiastic about learning new things and expanding my knowledge!

Subscribe
Notify of 