Feature Engineering: Rare Labels

When we work with categorical features in real-world datasets, we will almost always encounter rare labels: category values that appear in only a small fraction of our rows. A label that shows up in fewer than 5 % of all records gives our model very little data to learn from. That limited exposure creates problems: the model may overfit to the noise around that label, or it may never see the label at all in one of the train/test splits.

In this blog, we work with three categorical columns from the Ames housing dataset, Neighborhood, Exterior1st, and Exterior2nd, and the numeric target SalePrice. We measure category frequencies and visualise their relationship with the target. Then we group rare categories into a single bucket and see how rare labels split unevenly between training and test sets.

Prerequisites: Python 3.x, pandas, NumPy, Matplotlib, scikit-learn.

Setting Up: Imports and Data

Import the libraries we need for this blog. pandas handles tabular data, NumPy supports array operations, Matplotlib draws the charts, and train_test_split from scikit-learn divides the dataset into training and test portions.

PYTHON

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

Load only the four columns needed for this demo.

The variable definitions are:

Neighborhood: the physical location within Ames city limits
Exterior1st: the primary exterior covering material
Exterior2nd: the secondary exterior covering material (when more than one material is used)
SalePrice: the final sale price of the house (the prediction target)

PYTHON

use_cols = ['Neighborhood', 'Exterior1st', 'Exterior2nd', 'SalePrice']

data = pd.read_csv('https://raw.githubusercontent.com/laxmimerit/All-CSV-ML-Data-Files-Download/master/houseprice.csv',usecols=use_cols)
data.head()

OUTPUT

	Neighborhood	Exterior1st	Exterior2nd	SalePrice
0	CollgCr	VinylSd	VinylSd	208500
1	Veenker	MetalSd	MetalSd	181500
2	CollgCr	VinylSd	VinylSd	223500
3	Crawfor	Wd Sdng	Wd Shng	140000
4	NoRidge	VinylSd	VinylSd	250000

Identifying Rare Labels

Counting Unique Categories

Before we can identify rare labels, we need to know how many unique categories each variable contains, a property called cardinality. High cardinality means more distinct values; many of them may be rare.

Count the distinct categories in each variable using nunique():

PYTHON

# these are the loaded categorical variables
cat_cols = ['Neighborhood', 'Exterior1st', 'Exterior2nd']

for col in cat_cols:
    print('variable: ', col, ' number of labels: ', data[col].nunique())

print('total houses: ', len(data))

OUTPUT

variable:  Neighborhood  number of labels:  25
variable:  Exterior1st  number of labels:  15
variable:  Exterior2nd  number of labels:  16
total houses:  1460

Neighborhood has 25 distinct values across 1 460 houses, meaning many individual categories will represent only a tiny fraction of the data.

Visualising Category Frequency

To flag which categories are rare, plot the percentage of houses that belong to each category and draw a horizontal threshold line at 5 %. Any bar below that line is a rare label.

PYTHON

total_houses = len(data)

# for each categorical variable
for col in cat_cols:

    # count the number of houses per categoryand divide by total houses
    # aka percentage of houses per category

    temp_df = pd.Series(data[col].value_counts() / total_houses)

    # plot percentages as bar chart
    fig = temp_df.sort_values(ascending=False).plot.bar()
    fig.set_xlabel(col)

    # add a line at 5 % to flag the threshold for rare categories
    fig.axhline(y=0.05, color='red')
    fig.set_ylabel('Percentage of houses')
    plt.show()

The bar chart below shows category frequencies for Neighborhood. The red horizontal line marks the 5 % threshold. Bars below it are rare labels:

Bar plot showing the percentage of houses per neighborhood category with a red threshold line at 5%

The same frequency chart for Exterior1st shows that only VinylSd, HdBoard, MetalSd, and Wd Sdng appear in more than 5 % of houses:

Bar plot showing the percentage of houses per Exterior1st category with a red threshold line at 5%

For Exterior2nd, even more categories fall below the 5 % threshold:

Bar plot showing the percentage of houses per Exterior2nd category with a red threshold line at 5%

Across all three variables, a handful of dominant categories account for the majority of observations while many categories sit well below 5 %. These infrequent categories are the rare labels that can cause overfitting.

Why Rare Labels Matter: Relationship with the Target

Knowing a category is rare is one thing; understanding its relationship with the target variable SalePrice is another. We need to know whether the rare label carries a genuine signal or just noise.

Helper Functions

The function below computes two things for every category in a given column: the percentage of houses that belong to it, and the mean SalePrice for those houses.

PYTHON

def calculate_mean_target_per_category(df, var):

    # total number of houses
    total_houses = len(df)

    # percentage of houses per category
    temp_df = pd.Series(df[var].value_counts() / total_houses).reset_index()
    temp_df.columns = [var, 'perc_houses']

    # add the mean SalePrice
    temp_df = temp_df.merge(df.groupby([var])['SalePrice'].mean().reset_index(),
                            on=var,
                            how='left')

    return temp_df

The second function draws a dual-axis chart: grey bars show the percentage of houses per category (left axis), while a green line shows the mean SalePrice per category (right axis):

PYTHON

def plot_categories(df, var):

    fig, ax = plt.subplots(figsize=(8, 4))
    plt.xticks(df.index, df[var], rotation=90)

    ax2 = ax.twinx()
    ax.bar(df.index, df["perc_houses"], color='lightgrey')
    ax2.plot(df.index, df["SalePrice"], color='green', label='Seconds')
    ax.axhline(y=0.05, color='red')
    ax.set_ylabel('percentage of houses per category')
    ax.set_xlabel(var)
    ax2.set_ylabel('Average Sale Price per category')
    plt.show()

Neighborhood vs. SalePrice

Apply the helper functions to the Neighborhood column first:

PYTHON

temp_df = calculate_mean_target_per_category(data, 'Neighborhood')
temp_df

OUTPUT

	Neighborhood	perc_houses	SalePrice
0	NAmes	0.154110	145847.080000
1	CollgCr	0.102740	197965.773333
2	OldTown	0.077397	128225.300885
3	Edwards	0.068493	128219.700000
4	Somerst	0.058904	225379.837209
5	Gilbert	0.054110	192854.506329
6	NridgHt	0.052740	316270.623377
7	Sawyer	0.050685	136793.135135
8	NWAmes	0.050000	189050.068493
9	SawyerW	0.040411	186555.796610
10	BrkSide	0.039726	124834.051724
11	Crawfor	0.034932	210624.725490
12	Mitchel	0.033562	156270.122449
13	NoRidge	0.028082	335295.317073
14	Timber	0.026027	242247.447368
15	IDOTRR	0.025342	100123.783784
16	ClearCr	0.019178	212565.428571
17	SWISU	0.017123	142591.360000
18	StoneBr	0.017123	310499.000000
19	Blmngtn	0.011644	194870.882353
20	MeadowV	0.011644	98576.470588
21	BrDale	0.010959	104493.750000
22	Veenker	0.007534	238772.727273
23	NPkVill	0.006164	142694.444444
24	Blueste	0.001370	137500.000000

About 15 % of houses are in NAmes with a mean SalePrice of $145847, g i v in g t h e m o d e l pl e n t y o f d a t a . I n co n t r a s t, ‘ S t o n e B r ‘ ha s am e an p r i ce ab o v e$ 310 000 but appears in fewer than 2 % of records: a strong-looking signal supported by very few observations.

Now plot the full picture for Neighborhood:

PYTHON

plot_categories(temp_df, 'Neighborhood')

The dual-axis chart below shows bars for category frequency (grey, left axis) and a green line for mean SalePrice (right axis). The red line is the 5 % threshold:

Bar and line plot illustrating the relationship between house percentage per neighborhood category and average SalePrice

NridgHt houses sell at a high average price, while Sawyer houses tend to be cheaper. StoneBr shows an average SalePrice above $300 000, yet fewer than 2 % of the dataset's houses are located there. Because we have only a handful of StoneBr observations to learn from, the model may over- or under-estimate the neighbourhood's true effect on price.

Now plot the remaining two categorical variables using the same helper functions:

PYTHON

for col in cat_cols:

    # Neighborhood already plotted above
    if col !='Neighborhood':

        # apply helper functions
        temp_df = calculate_mean_target_per_category(data, col)
        plot_categories(temp_df, col)

The chart for Exterior1st shows a wide swing in mean SalePrice across rare categories (those to the right of the red line), suggesting noisy estimates rather than reliable signals:

Relationship between Exterior1st categories and average SalePrice

The chart for Exterior2nd makes the problem even clearer. Nearly all of its categories fall below the 5 % threshold, and the green price line oscillates sharply among them:

Relationship between Exterior2nd categories and average SalePrice

For Exterior2nd, most categories appear in fewer than 5 % of houses, and the mean SalePrice swings up and down across those rare categories. This is a sign of noisy estimates. So few observations back each label that we cannot say whether a high or low mean price reflects a true pattern or just random variation. These rare labels might carry real predictive power, or they might just add noise. With only a handful of samples, we cannot tell which.

Note: Adding standard deviation bars or an interquartile range to this chart would show exactly how variable SalePrice is within each category, giving us an even clearer picture of estimation uncertainty.

Grouping Rare Labels

The Grouping Strategy

One standard approach to handling rare labels is to merge all infrequent categories into a single umbrella category, typically labelled 'rare' or 'Other'. This consolidation lets the model learn the collective influence of infrequent categories on the target, rather than trying to learn an unreliable estimate from each tiny group individually.

The function below replaces every category that appears in fewer than 5 % of rows with the string 'rare':

PYTHON

def group_rare_labels(df, var):

    total_houses = len(df)

    # calculate % of houses per category
    temp_df = pd.Series(df[var].value_counts() / total_houses)

    # create mapping dict: replace categories below 5% threshold with 'rare'

    grouping_dict = {
        k: ('rare' if k not in temp_df[temp_df >= 0.05].index else k)
        for k in temp_df.index
    }

    # replace rare categories
    tmp = df[var].map(grouping_dict)

    return tmp

Applying the Grouping to Neighborhood

Apply group_rare_labels to Neighborhood and inspect the result:

PYTHON

data['Neighborhood_grouped'] = group_rare_labels(data, 'Neighborhood')

data[['Neighborhood', 'Neighborhood_grouped']].head(10)

OUTPUT

	Neighborhood	Neighborhood_grouped
0	CollgCr	CollgCr
1	Veenker	rare
2	CollgCr	CollgCr
3	Crawfor	rare
4	NoRidge	rare
5	Mitchel	rare
6	Somerst	Somerst
7	NWAmes	NWAmes
8	OldTown	OldTown
9	BrkSide	rare

Neighbourhoods like Veenker, Crawfor, and NoRidge, which individually appeared in fewer than 5 % of rows, are now mapped to 'rare'.

Plot the grouped variable to see how the 'rare' bucket compares with the remaining common categories:

PYTHON

temp_df = calculate_mean_target_per_category(data, 'Neighborhood_grouped')
plot_categories(temp_df, 'Neighborhood_grouped')

The chart below shows the consolidated Neighborhood_grouped variable. The 'rare' bar on the left now represents all infrequent neighbourhoods as a single group, and its associated mean SalePrice reflects their combined average effect:

Grouped Neighborhood categories showing consolidated rare labels mapped to a single rare category versus average SalePrice

The 'rare' category now captures the overall influence of all infrequent neighbourhoods on SalePrice. Compare this with the original ungrouped chart:

PYTHON

# plot original Neighborhood for comparison
temp_df = calculate_mean_target_per_category(data, 'Neighborhood')
plot_categories(temp_df, 'Neighborhood')

The original Neighborhood chart for comparison. Notice how many bars fall below the red 5 % line, each with its own noisy price estimate:

Original Neighborhood category distribution and average SalePrice for comparison

Only 9 neighbourhoods are common enough to stand on their own. All the others are now folded into 'rare', which provides a single, more stable estimate of their combined influence.

Applying the Grouping to the Remaining Variables

Apply the same grouping to Exterior1st and Exterior2nd and visualise the result:

PYTHON

for col in cat_cols[1:]:

    # apply helper functions to remaining variables
    data[col+'_grouped'] = group_rare_labels(data, col)
    temp_df = calculate_mean_target_per_category(data, col+'_grouped')
    plot_categories(temp_df, col+'_grouped')

For Exterior1st_grouped, all rare exterior types are consolidated into 'rare', and the remaining common categories show a more stable price relationship:

Grouped Exterior1st categories with rare labels consolidated versus average SalePrice

For Exterior2nd_grouped, the same consolidation applies. The erratic price swings from the ungrouped version are replaced by a cleaner, more interpretable chart:

Grouped Exterior2nd categories with rare labels consolidated versus average SalePrice

Notice an interesting pattern: in both Exterior1st_grouped and Exterior2nd_grouped, houses with rare exterior types tend to have a higher average SalePrice than houses with common exterior types (except for VinylSd). The rare categories seem to share something in common, perhaps signalling premium or non-standard materials. Grouping them together lets the model learn this shared signal from a larger combined pool of observations.

Note: Ideally we would also plot the standard deviation or interquartile range of SalePrice within each group to quantify how much the price varies inside that bucket.

Rare Labels and Train/Test Splits

Why the Split Creates a Problem

When we split a dataset into training and test sets, rare labels often end up in only one of the two splits. A label that lands only in the training set wastes model capacity: the model learns a special case it will never encounter at inference time. A label that lands only in the test set is even worse: the model has never seen it before and cannot make a meaningful prediction.

Split the data into 70 % training and 30 % test, keeping only the three categorical columns as features and SalePrice as the target:

PYTHON

X_train, X_test, y_train, y_test = train_test_split(data[cat_cols],
                                                    data['SalePrice'],
                                                    test_size=0.3,
                                                    random_state=2910)

X_train.shape, X_test.shape

OUTPUT

((1022, 3), (438, 3))

The training set contains 1 022 rows and the test set contains 438 rows.

Categories Present Only in Training

Find the Exterior1st categories that appear in the training set but not in the test set:

PYTHON

unique_to_train_set = [
    x for x in X_train['Exterior1st'].unique() if x not in X_test['Exterior1st'].unique()
]

print(unique_to_train_set)

OUTPUT

['Stone', 'BrkComm', 'ImStucc', 'CBlock']

There are 4 categories present in the training set and are not present in the test set.

Categories Present Only in Testing

Now find the labels that appear in the test set but are missing from training:

PYTHON

unique_to_test_set = [
    x for x in X_test['Exterior1st'].unique() if x not in X_train['Exterior1st'].unique()
]

print(unique_to_test_set)

OUTPUT

['AsphShn']

In this case, there is 1 rare value present in the test set only. The model trained on X_train has never seen 'AsphShn' and will not know how to encode or score it correctly at inference time. This is a direct consequence of having rare labels: with so few observations, a single random split is enough to remove a category from one side entirely.

Conclusion

In this blog, we explored rare labels in the Ames housing dataset. We measured category frequencies for Neighborhood, Exterior1st, and Exterior2nd, and visualised each category's relationship with SalePrice. We grouped all categories below the 5 % threshold into a single 'rare' bucket, and confirmed that rare labels split unevenly between training and test sets.

Key takeaways:

A rare label is a category that appears in fewer than 5 % of rows; the threshold is a rule of thumb and can be adjusted for our dataset.
Rare labels produce unreliable mean-target estimates because so few observations back them up. The model ends up learning from noise rather than signal.
Grouping rare categories into 'rare' lets the model learn their collective influence on the target from a larger, more stable pool of observations.
Without grouping, rare labels often land entirely in one split, causing the model to either overfit (train-only label) or fail at inference (test-only label).
Always compute rare-label thresholds on the training set only to avoid data leakage; never use test-set frequencies to decide which labels to group.

Next steps:

Continue the series with Linear Model Assumptions to learn how categorical encoding choices affect the assumptions underlying linear models.
Read Outlier Detection and Treatment to apply similar frequency-based thinking to numeric variables with extreme values.
Explore Variable Magnitude and Scaling to understand why the scale of our numeric features can matter as much as the encoding of our categorical ones.

Feature Engineering: Rare Labels

Setting Up: Imports and Data

Identifying Rare Labels

Counting Unique Categories

Visualising Category Frequency

Why Rare Labels Matter: Relationship with the Target

Helper Functions

Neighborhood vs. SalePrice

Grouping Rare Labels

The Grouping Strategy

Applying the Grouping to Neighborhood

Applying the Grouping to the Remaining Variables

Rare Labels and Train/Test Splits

Why the Split Creates a Problem

Categories Present Only in Training

Categories Present Only in Testing

Conclusion

Found this useful? Keep building with me.

Latest recommendations you might like

Feature Engineering: Variable Magnitude

Feature Engineering: Outlier Detection

Feature Engineering: Linear Model Assumptions

Cardinality in Machine Learning

Find this tutorial useful?

Discussion & Comments