#feature engineering#rare labels#machine learning#python

Feature Engineering: Rare Labels

Learn what rare labels are in categorical variables, why they cause overfitting and train/test mismatches, and how to group them safely in Python.

May 19, 2026 at 5:15 PM12 min readFollowFollow (Hindi)

Topics You Will Master

What a rare label is and why it hurts model performance
How to measure category frequency and spot rare categories with a 5 % threshold
How to visualise the relationship between category frequency and a numeric target
How to group rare categories into a single 'rare' bucket using pandas
Why rare labels cause uneven category distribution across train and test splits
Best For

Python developers and data scientists who are building feature engineering pipelines and want to handle categorical variables robustly before training a model.

Expected Outcome

A working pandas pipeline that identifies rare labels in any categorical column and consolidates them into a single 'rare' category, plus a clear understanding of why this step matters for model reliability.

When you work with categorical features in real-world datasets, you will almost always encounter rare labels — category values that appear in only a small fraction of your rows. A label that shows up in fewer than 5 % of all records gives your model very little data to learn from. That limited exposure creates problems: the model may overfit to the noise around that label, or it may never see the label at all in one of the train/test splits.

In this tutorial you will work with three categorical columns from the Ames housing dataset — Neighborhood, Exterior1st, and Exterior2nd — and the numeric target SalePrice. You will measure category frequencies, visualise their relationship with the target, group rare categories into a unified bucket, and observe how rare labels split unevenly between training and test sets.

Prerequisites: Python 3.x, pandas, NumPy, Matplotlib, scikit-learn.

Setting Up: Imports and Data

Import the libraries you need for this tutorial. pandas handles tabular data, NumPy supports array operations, Matplotlib draws the charts, and train_test_split from scikit-learn divides the dataset into training and test portions.

PYTHON
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

Load only the four columns needed for this demo.

The variable definitions are:

  • Neighborhood — the physical location within Ames city limits
  • Exterior1st — the primary exterior covering material
  • Exterior2nd — the secondary exterior covering material (when more than one material is used)
  • SalePrice — the final sale price of the house (the prediction target)
PYTHON
use_cols = ['Neighborhood', 'Exterior1st', 'Exterior2nd', 'SalePrice']

data = pd.read_csv('https://raw.githubusercontent.com/laxmimerit/All-CSV-ML-Data-Files-Download/master/houseprice.csv',usecols=use_cols)
data.head()
OUTPUT
NeighborhoodExterior1stExterior2ndSalePrice
0CollgCrVinylSdVinylSd208500
1VeenkerMetalSdMetalSd181500
2CollgCrVinylSdVinylSd223500
3CrawforWd SdngWd Shng140000
4NoRidgeVinylSdVinylSd250000

Identifying Rare Labels

Counting Unique Categories

Before you can identify rare labels, you need to know how many unique categories each variable contains — a property called cardinality. High cardinality means more distinct values; many of them may be rare.

Count the distinct categories in each variable using nunique():

PYTHON
# these are the loaded categorical variables
cat_cols = ['Neighborhood', 'Exterior1st', 'Exterior2nd']

for col in cat_cols:
    print('variable: ', col, ' number of labels: ', data[col].nunique())

print('total houses: ', len(data))
OUTPUT
variable:  Neighborhood  number of labels:  25
variable:  Exterior1st  number of labels:  15
variable:  Exterior2nd  number of labels:  16
total houses:  1460

Neighborhood has 25 distinct values across 1 460 houses, meaning many individual categories will represent only a tiny fraction of the data.

Visualising Category Frequency

To flag which categories are rare, plot the percentage of houses that belong to each category and draw a horizontal threshold line at 5 %. Any bar below that line is a rare label.

PYTHON
total_houses = len(data)

# for each categorical variable
for col in cat_cols:

    # count the number of houses per categoryand divide by total houses
    # aka percentage of houses per category

    temp_df = pd.Series(data[col].value_counts() / total_houses)

    # make plot with the above percentages
    fig = temp_df.sort_values(ascending=False).plot.bar()
    fig.set_xlabel(col)

    # add a line at 5 % to flag the threshold for rare categories
    fig.axhline(y=0.05, color='red')
    fig.set_ylabel('Percentage of houses')
    plt.show()

The bar chart below shows category frequencies for Neighborhood. The red horizontal line marks the 5 % threshold — bars below it are rare labels:

Bar plot showing the percentage of houses per neighborhood category with a red threshold line at 5%

The same frequency chart for Exterior1st shows that only VinylSd, HdBoard, MetalSd, and Wd Sdng appear in more than 5 % of houses:

Bar plot showing the percentage of houses per Exterior1st category with a red threshold line at 5%

For Exterior2nd, even more categories fall below the 5 % threshold:

Bar plot showing the percentage of houses per Exterior2nd category with a red threshold line at 5%

Across all three variables, a handful of dominant categories account for the majority of observations while many categories sit well below 5 %. These infrequent categories are the rare labels that can cause overfitting.

Why Rare Labels Matter: Relationship with the Target

Knowing a category is rare is one thing; understanding its relationship with the target variable SalePrice is another. You need to know whether the rare label carries a genuine signal or just noise.

Helper Functions

The function below computes two things for every category in a given column: the percentage of houses that belong to it, and the mean SalePrice for those houses.

PYTHON
def calculate_mean_target_per_category(df, var):

    # total number of houses
    total_houses = len(df)

    # percentage of houses per category
    temp_df = pd.Series(df[var].value_counts() / total_houses).reset_index()
    temp_df.columns = [var, 'perc_houses']

    # add the mean SalePrice
    temp_df = temp_df.merge(df.groupby([var])['SalePrice'].mean().reset_index(),
                            on=var,
                            how='left')

    return temp_df

The second function draws a dual-axis chart: grey bars show the percentage of houses per category (left axis), while a green line shows the mean SalePrice per category (right axis):

PYTHON
def plot_categories(df, var):

    fig, ax = plt.subplots(figsize=(8, 4))
    plt.xticks(df.index, df[var], rotation=90)

    ax2 = ax.twinx()
    ax.bar(df.index, df["perc_houses"], color='lightgrey')
    ax2.plot(df.index, df["SalePrice"], color='green', label='Seconds')
    ax.axhline(y=0.05, color='red')
    ax.set_ylabel('percentage of houses per category')
    ax.set_xlabel(var)
    ax2.set_ylabel('Average Sale Price per category')
    plt.show()

Neighborhood vs. SalePrice

Apply the helper functions to the Neighborhood column first:

PYTHON
temp_df = calculate_mean_target_per_category(data, 'Neighborhood')
temp_df
OUTPUT
Neighborhoodperc_housesSalePrice
0NAmes0.154110145847.080000
1CollgCr0.102740197965.773333
2OldTown0.077397128225.300885
3Edwards0.068493128219.700000
4Somerst0.058904225379.837209
5Gilbert0.054110192854.506329
6NridgHt0.052740316270.623377
7Sawyer0.050685136793.135135
8NWAmes0.050000189050.068493
9SawyerW0.040411186555.796610
10BrkSide0.039726124834.051724
11Crawfor0.034932210624.725490
12Mitchel0.033562156270.122449
13NoRidge0.028082335295.317073
14Timber0.026027242247.447368
15IDOTRR0.025342100123.783784
16ClearCr0.019178212565.428571
17SWISU0.017123142591.360000
18StoneBr0.017123310499.000000
19Blmngtn0.011644194870.882353
20MeadowV0.01164498576.470588
21BrDale0.010959104493.750000
22Veenker0.007534238772.727273
23NPkVill0.006164142694.444444
24Blueste0.001370137500.000000

About 15 % of houses are in NAmes with a mean SalePrice of 310 000 but appears in fewer than 2 % of records — a strong-looking signal supported by very few observations.

Now plot the full picture for Neighborhood:

PYTHON
plot_categories(temp_df, 'Neighborhood')

The dual-axis chart below shows bars for category frequency (grey, left axis) and a green line for mean SalePrice (right axis). The red line is the 5 % threshold:

Bar and line plot illustrating the relationship between house percentage per neighborhood category and average SalePrice

NridgHt houses sell at a high average price, while Sawyer houses tend to be cheaper. StoneBr shows an average SalePrice above $300 000, yet fewer than 2 % of the dataset's houses are located there. Because you have only a handful of StoneBr observations to learn from, the model may over- or under-estimate the neighbourhood's true effect on price.

Now plot the remaining two categorical variables using the same helper functions:

PYTHON
for col in cat_cols:

    # we plotted this variable already
    if col !='Neighborhood':

        # using the functions we created
        temp_df = calculate_mean_target_per_category(data, col)
        plot_categories(temp_df, col)

The chart for Exterior1st shows a wide swing in mean SalePrice across rare categories (those to the right of the red line), suggesting noisy estimates rather than reliable signals:

Relationship between Exterior1st categories and average SalePrice

The chart for Exterior2nd makes the problem even clearer — nearly all of its categories fall below the 5 % threshold, and the green price line oscillates sharply among them:

Relationship between Exterior2nd categories and average SalePrice

For Exterior2nd, most categories appear in fewer than 5 % of houses, and the mean SalePrice swings up and down erratically across those rare categories. This volatility is a sign of noisy estimates: because so few observations back each label, you cannot confidently say whether a high or low mean price reflects a true pattern or just random variation. These rare labels might carry genuine predictive power, or they might simply be introducing noise — and with only a handful of samples you cannot tell which.

Note: Adding standard deviation bars or an interquartile range to this chart would show exactly how variable SalePrice is within each category, giving you an even clearer picture of estimation uncertainty.

Grouping Rare Labels

The Grouping Strategy

One standard approach to handling rare labels is to merge all infrequent categories into a single umbrella category, typically labelled 'rare' or 'Other'. This consolidation lets the model learn the collective influence of infrequent categories on the target, rather than trying to learn an unreliable estimate from each tiny group individually.

The function below replaces every category that appears in fewer than 5 % of rows with the string 'rare':

PYTHON
def group_rare_labels(df, var):

    total_houses = len(df)

    # first we will calculate the % of houses for each category
    temp_df = pd.Series(df[var].value_counts() / total_houses)

    # then we will create a dictionary to replace the rare labels with the string 'rare' if they are present in less than 5% of houses

    grouping_dict = {
        k: ('rare' if k not in temp_df[temp_df >= 0.05].index else k)
        for k in temp_df.index
    }

    # now we will replace the rare categories
    tmp = df[var].map(grouping_dict)

    return tmp

Applying the Grouping to Neighborhood

Apply group_rare_labels to Neighborhood and inspect the result:

PYTHON
data['Neighborhood_grouped'] = group_rare_labels(data, 'Neighborhood')

data[['Neighborhood', 'Neighborhood_grouped']].head(10)
OUTPUT
NeighborhoodNeighborhood_grouped
0CollgCrCollgCr
1Veenkerrare
2CollgCrCollgCr
3Crawforrare
4NoRidgerare
5Mitchelrare
6SomerstSomerst
7NWAmesNWAmes
8OldTownOldTown
9BrkSiderare

Neighbourhoods like Veenker, Crawfor, and NoRidge — which individually appeared in fewer than 5 % of rows — are now mapped to 'rare'.

Plot the grouped variable to see how the 'rare' bucket compares with the remaining common categories:

PYTHON
temp_df = calculate_mean_target_per_category(data, 'Neighborhood_grouped')
plot_categories(temp_df, 'Neighborhood_grouped')

The chart below shows the consolidated Neighborhood_grouped variable. The 'rare' bar on the left now represents all infrequent neighbourhoods as a single group, and its associated mean SalePrice reflects their combined average effect:

Grouped Neighborhood categories showing consolidated rare labels mapped to a single rare category versus average SalePrice

The 'rare' category now captures the overall influence of all infrequent neighbourhoods on SalePrice. Compare this with the original ungrouped chart:

PYTHON
# let's plot the original Neighborhood for comparison
temp_df = calculate_mean_target_per_category(data, 'Neighborhood')
plot_categories(temp_df, 'Neighborhood')

The original Neighborhood chart for comparison — notice how many bars fall below the red 5 % line, each with its own noisy price estimate:

Original Neighborhood category distribution and average SalePrice for comparison

Only 9 neighbourhoods are common enough to stand on their own. All the others are now folded into 'rare', which provides a single, more stable estimate of their combined influence.

Applying the Grouping to the Remaining Variables

Apply the same grouping to Exterior1st and Exterior2nd and visualise the result:

PYTHON
for col in cat_cols[1:]:

    # re using the functions I created
    data[col+'_grouped'] = group_rare_labels(data, col)
    temp_df = calculate_mean_target_per_category(data, col+'_grouped')
    plot_categories(temp_df, col+'_grouped')

For Exterior1st_grouped, all rare exterior types are consolidated into 'rare', and the remaining common categories show a more stable price relationship:

Grouped Exterior1st categories with rare labels consolidated versus average SalePrice

For Exterior2nd_grouped, the same consolidation applies — the erratic price swings from the ungrouped version are replaced by a cleaner, more interpretable chart:

Grouped Exterior2nd categories with rare labels consolidated versus average SalePrice

Notice an interesting pattern: in both Exterior1st_grouped and Exterior2nd_grouped, houses with rare exterior types tend to have a higher average SalePrice than houses with common exterior types (except for VinylSd). The rare categories seem to share something in common — perhaps they signal premium or non-standard materials. Grouping them together lets the model learn this shared signal from a larger combined pool of observations.

Note: Ideally you would also plot the standard deviation or interquartile range of SalePrice within each group to quantify how much the price varies inside that bucket.

Rare Labels and Train/Test Splits

Why the Split Creates a Problem

When you split a dataset into training and test sets, rare labels often end up in only one of the two splits. A label that lands only in the training set wastes model capacity — the model learns a special case it will never encounter at inference time. A label that lands only in the test set is even worse: the model has never seen it before and cannot make a meaningful prediction.

Split the data into 70 % training and 30 % test, keeping only the three categorical columns as features and SalePrice as the target:

PYTHON
X_train, X_test, y_train, y_test = train_test_split(data[cat_cols],
                                                    data['SalePrice'],
                                                    test_size=0.3,
                                                    random_state=2910)

X_train.shape, X_test.shape
OUTPUT
((1022, 3), (438, 3))

The training set contains 1 022 rows and the test set contains 438 rows.

Categories Present Only in Training

Find the Exterior1st categories that appear in the training set but not in the test set:

PYTHON
unique_to_train_set = [
    x for x in X_train['Exterior1st'].unique() if x not in X_test['Exterior1st'].unique()
]

print(unique_to_train_set)
OUTPUT
['Stone', 'BrkComm', 'ImStucc', 'CBlock']

There are 4 categories present in the training set and are not present in the test set.

Categories Present Only in Testing

Now find the labels that appear in the test set but are missing from training:

PYTHON
unique_to_test_set = [
    x for x in X_test['Exterior1st'].unique() if x not in X_train['Exterior1st'].unique()
]

print(unique_to_test_set)
OUTPUT
['AsphShn']

In this case, there is 1 rare value present in the test set only. The model trained on X_train has never seen 'AsphShn' and will not know how to encode or score it correctly at inference time. This is a direct consequence of having rare labels: with so few observations, a single random split is enough to remove a category from one side entirely.

Conclusion

In this tutorial you explored rare labels in the Ames housing dataset. You measured category frequencies for Neighborhood, Exterior1st, and Exterior2nd, visualised each category's relationship with SalePrice, grouped all categories below the 5 % frequency threshold into a unified 'rare' bucket, and confirmed that rare labels split unevenly between training and test sets.

Key takeaways:

  • A rare label is a category that appears in fewer than 5 % of rows; the threshold is a rule of thumb and can be adjusted for your dataset.
  • Rare labels produce unreliable mean-target estimates because so few observations back them up — the model learns from noise rather than signal.
  • Grouping rare categories into 'rare' lets the model learn their collective influence on the target from a larger, more stable pool of observations.
  • Without grouping, rare labels often land entirely in one split, causing the model to either overfit (train-only label) or fail at inference (test-only label).
  • Always compute rare-label thresholds on the training set only to avoid data leakage; never use test-set frequencies to decide which labels to group.

Next steps:

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments