Feature Selection: Fisher Score & Chi2

When you build a classification model, not every feature in your dataset is useful. Some features are statistically independent of the target variable — they carry no signal — and including them adds noise, slows training, and can hurt accuracy. Feature selection is the process of identifying and keeping only the features that matter.

For categorical features, the Chi-Squared ( $χ^{2}$ ) test is one of the most reliable filter methods. It is a statistical test that measures whether two categorical variables are independent of each other. A low p-value (a measure of statistical significance — closer to zero means stronger evidence against independence) tells you that a feature is likely related to the target class and worth keeping. A high p-value suggests the feature and target are independent, making the feature a good candidate for removal.

Fisher Score is a related supervised method that ranks each feature individually by how well it separates the classes — features that produce tight within-class clusters and wide between-class distances receive a higher score.

In this tutorial, you will apply both methods to the Titanic dataset to select the most informative categorical features, then validate those rankings by training Random Forest classifiers on progressively larger feature subsets.

You can also

The Chi-Square test statistic is defined as:

χ^{2} = \sum \frac{( O _{i} - E _{i} ) ^{2}}{E _{i}}

Where:

$χ^{2}$ — the Chi-Squared test statistic; larger values indicate stronger dependence between the feature and the target
$O_{i}$ — the observed frequency count for category $i$ in the data
$E_{i}$ — the expected frequency count for category $i$ under the null hypothesis that the feature and target are independent

Prerequisites: Python 3.x, Scikit-learn, Pandas, NumPy, Seaborn, Matplotlib.

Loading the Required Dataset

Import all the libraries needed for data analysis, modeling, and statistical testing up front:

PYTHON

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

PYTHON

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest, SelectPercentile
from sklearn.metrics import accuracy_score

Load the Titanic dataset directly from Seaborn's built-in datasets:

PYTHON

titanic = sns.load_dataset('titanic')
titanic.head()

OUTPUT

	survived	pclass	sex	age	sibsp	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	3	male	22.0	1	7.2500	S	Third	man	True	NaN	Southampton	no	False
1	1	1	female	38.0	1	71.2833	C	First	woman	False	C	Cherbourg	yes	False
2	1	3	female	26.0	0	7.9250	S	Third	woman	False	NaN	Southampton	yes	True
3	1	1	female	35.0	1	53.1000	S	First	woman	False	C	Southampton	yes	False
4	0	3	male	35.0	0	8.0500	S	Third	man	True	NaN	Southampton	no	True

Inspect the dataset for missing values so you know which columns need cleaning before modeling:

PYTHON

titanic.isnull().sum()

OUTPUT

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

The age column is missing 177 values and deck is missing 688 out of 891 — too many to impute reliably. Drop both columns, then drop any remaining rows with NaN values:

PYTHON

titanic.drop(labels = ['age', 'deck'], axis = 1, inplace = True)
titanic = titanic.dropna()
titanic.isnull().sum()

OUTPUT

survived       0
pclass         0
sex            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
embark_town    0
alive          0
alone          0
dtype: int64

The Chi-Square test operates only on categorical (or discrete non-negative integer) features, so filter the dataset down to those columns only:

PYTHON

data = titanic[['pclass', 'sex', 'sibsp', 'parch', 'embarked', 'who', 'alone']].copy()
data.head()

OUTPUT

	pclass	sex	sibsp	embarked	who	alone
0	3	male	1	S	man	False
1	1	female	1	C	woman	False
2	3	female	0	S	woman	True
3	1	female	1	S	woman	False
4	3	male	0	S	man	True

Confirm there are no missing values in your filtered feature set:

PYTHON

data.isnull().sum()

OUTPUT

pclass      0
sex         0
sibsp       0
parch       0
embarked    0
who         0
alone       0
dtype: int64

Encoding Categorical Features

Scikit-learn's chi2 function requires non-negative integer inputs, so string and boolean columns must be mapped to numeric codes. Encode sex as 0 for male and 1 for female:

PYTHON

sex = {'male': 0, 'female': 1}
data['sex'] = data['sex'].map(sex)
data.head()

OUTPUT

	pclass	sex	sibsp	embarked	who	alone
0	3	0	1	S	man	False
1	1	1	1	C	woman	False
2	3	1	0	S	woman	True
3	1	1	1	S	woman	False
4	3	0	0	S	man	True

Encode embarked port codes to integers, who passenger categories, and the boolean alone flag:

PYTHON

ports = {'S': 0, 'C': 1, 'Q': 2}
data['embarked'] = data['embarked'].map(ports)

PYTHON

who = {'man': 0, 'woman': 1, 'child': 2}
data['who'] = data['who'].map(who)

PYTHON

alone = {True: 1, False: 0}
data['alone'] = data['alone'].map(alone)
data.head()

OUTPUT

	pclass	sex	sibsp	embarked	who	alone
0	3	0	1	0	0	0
1	1	1	1	1	1	0
2	3	1	0	0	1	1
3	1	1	1	0	1	0
4	3	0	0	0	0	1

Running the Chi-Squared Test

Separate your encoded features from the survival target, then check shapes to confirm alignment:

PYTHON

X = data.copy()
y = titanic['survived']
X.shape, y.shape

OUTPUT

((889, 7), (889,))

Split into training and test sets (80 / 20), then call chi2 on the training data to compute the test statistic and p-value for each feature:

PYTHON

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
f_score = chi2(X_train, y_train)
f_score

OUTPUT

(array([ 22.65169202, 152.91534343,   0.52934285,  10.35663782, 16.13255653, 161.42431175,  13.4382363 ]), array([1.94189138e-06, 3.99737147e-35, 4.66883271e-01, 1.29009955e-03, 5.90599986e-05, 5.52664700e-37, 2.46547298e-04]))

chi2 returns a tuple: the first array contains the $χ^{2}$ statistics and the second contains the corresponding p-values. Align the p-values with their feature names and sort them in ascending order so the most significant features appear first:

PYTHON

p_values = pd.Series(f_score[1], index = X_train.columns)
p_values.sort_values(ascending = True, inplace = True)
p_values

OUTPUT

who         5.526647e-37
sex         3.997371e-35
pclass      1.941891e-06
embarked    5.906000e-05
alone       2.465473e-04
parch       1.290100e-03
sibsp       4.668833e-01
dtype: float64

who and sex have p-values near zero — extremely strong evidence of dependence on survival. sibsp has a p-value of 0.47, which is not statistically significant; it is effectively independent of survival and should be removed.

The bar chart below confirms this visually — all features except sibsp have p-values indistinguishable from zero, while sibsp stands clearly apart near 0.46:

Bar plot of Chi-Squared test p-values for each Titanic feature, showing sibsp as the only non-significant feature

Evaluating Feature Subsets with a Random Forest Classifier

To confirm that the statistical rankings translate into real accuracy gains, you will train a Random Forest classifier — an ensemble of decision trees that votes on the final prediction — on progressively larger subsets of features, ordered by their p-values from lowest to highest.

Define a helper function that fits a 100-tree Random Forest and prints the test accuracy:

PYTHON

def run_randomForest(X_train, X_test, y_train, y_test):
    clf = RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print('Accuracy: ', accuracy_score(y_test, y_pred))

Two Features: `who` and `sex`

Start with only the two most significant features to establish a baseline:

PYTHON

X_train_2 = X_train[['who', 'sex']]
X_test_2 = X_test[['who', 'sex']]

PYTHON

%%time
run_randomForest(X_train_2, X_test_2, y_train, y_test)

OUTPUT

Accuracy:  0.7191011235955056
Wall time: 687 ms

With just two features the model reaches 71.9% accuracy — a reasonable starting point given the limited information.

Three Features: `who`, `sex`, and `pclass`

Add pclass (the third most significant feature by p-value) and retrain:

PYTHON

X_train_3 = X_train[['who', 'sex', 'pclass']]
X_test_3 = X_test[['who', 'sex', 'pclass']]

PYTHON

%%time
run_randomForest(X_train_3, X_test_3, y_train, y_test)

OUTPUT

Accuracy:  0.7415730337078652
Wall time: 649 ms

Adding pclass lifts accuracy to 74.2%, confirming that passenger class carries meaningful signal beyond what who and sex already capture.

Four Features: `who`, `sex`, `pclass`, and `embarked`

PYTHON

X_train_4 = X_train[['who', 'sex', 'pclass', 'embarked']]
X_test_4 = X_test[['who', 'sex', 'pclass', 'embarked']]

PYTHON

%%time
run_randomForest(X_train_4, X_test_4, y_train, y_test)

OUTPUT

Accuracy:  0.7584269662921348
Wall time: 609 ms

This is the best result so far at 75.8%. Including the embarkation port further improves the model.

Four Features: `who`, `sex`, `pclass`, and `alone`

Test whether swapping embarked for alone (the fifth-ranked feature) produces a different outcome:

PYTHON

X_train_4 = X_train[['who', 'sex', 'pclass', 'alone']]
X_test_4 = X_test[['who', 'sex', 'pclass', 'alone']]

PYTHON

%%time
run_randomForest(X_train_4, X_test_4, y_train, y_test)

OUTPUT

Accuracy:  0.7528089887640449
Wall time: 710 ms

The alone variant scores 75.3%, slightly below embarked — confirming that embarked carries marginally more survival signal at this subset size.

Five Features: `who`, `sex`, `pclass`, `embarked`, and `alone`

PYTHON

X_train_5 = X_train[['who', 'sex', 'pclass', 'embarked', 'alone']]
X_test_5 = X_test[['who', 'sex', 'pclass', 'embarked', 'alone']]

PYTHON

%%time
run_randomForest(X_train_5, X_test_5, y_train, y_test)

OUTPUT

Accuracy:  0.7528089887640449
Wall time: 413 ms

Adding alone to the four-feature set provides no further gain — accuracy stays at 75.3%. This is a sign of diminishing returns as you approach the low-signal features.

Full Dataset: All Seven Features

Finally, evaluate using the complete set, including the statistically non-significant sibsp:

PYTHON

%%time
run_randomForest(X_train, X_test, y_train, y_test)

OUTPUT

Accuracy:  0.7359550561797753
Wall time: 576 ms

Using all seven features drops accuracy back to 73.6%. The non-significant sibsp feature introduces noise that hurts generalization — a clear demonstration that more features does not always mean a better model.

Conclusion

In this tutorial, you applied the Chi-Squared ( $χ^{2}$ ) test to rank categorical features by their statistical dependence on survival in the Titanic dataset. Features like who, sex, and pclass showed p-values near zero — extremely strong evidence of a real relationship with the target. The statistically non-significant feature sibsp (p-value ≈ 0.47) was correctly identified as noise. A systematic experiment with Random Forest classifiers confirmed the rankings: four carefully chosen features (who, sex, pclass, embarked) achieved 75.8% accuracy, while using all seven features dropped accuracy to 73.6%.

Key takeaways:

The Chi-Squared test quantifies independence between a categorical feature and the target class. A p-value below 0.05 is the conventional threshold for statistical significance.
Fisher Score and $χ^{2}$ both evaluate features individually — they find the best single features, not necessarily the best subset. Treat their rankings as a strong starting point, not a definitive answer.
Encoding categorical features as non-negative integers is a prerequisite for chi2; the function will raise an error on string or boolean inputs.
Removing a statistically insignificant feature can raise model accuracy — noise reduction often matters more than marginal information gain.

Next steps:

Extend your feature selection knowledge to continuous numerical variables with Feature Selection Using Univariate ANOVA Test.
Learn information-theory-based selection in Feature Selection Based on Mutual Information.
Explore filter methods that remove constant, quasi-constant, and duplicate features first in Filtering Method: Constant, Quasi-Constant, and Duplicate Feature Removal.

Feature Selection: Fisher Score & Chi2

Topics You Will Master

Loading the Required Dataset

Encoding Categorical Features

Running the Chi-Squared Test

Evaluating Feature Subsets with a Random Forest Classifier

Two Features: `who` and `sex`

Three Features: `who`, `sex`, and `pclass`

Four Features: `who`, `sex`, `pclass`, and `embarked`

Four Features: `who`, `sex`, `pclass`, and `alone`

Five Features: `who`, `sex`, `pclass`, `embarked`, and `alone`

Full Dataset: All Seven Features

Conclusion

Latest recommendations you might like

LinkedIn Auto Connect Bot

Dimensionality Reduction with LDA and PCA in Python

Find this tutorial useful?

Discussion & Comments

Topics You Will Master

Loading the Required Dataset

Encoding Categorical Features

Running the Chi-Squared Test

Evaluating Feature Subsets with a Random Forest Classifier

Two Features: who and sex

Three Features: who, sex, and pclass

Four Features: who, sex, pclass, and embarked

Four Features: who, sex, pclass, and alone

Five Features: who, sex, pclass, embarked, and alone

Full Dataset: All Seven Features

Conclusion

Latest recommendations you might like

LinkedIn Auto Connect Bot

Dimensionality Reduction with LDA and PCA in Python

Find this tutorial useful?

Discussion & Comments

Two Features: `who` and `sex`

Three Features: `who`, `sex`, and `pclass`

Four Features: `who`, `sex`, `pclass`, and `embarked`

Four Features: `who`, `sex`, `pclass`, and `alone`

Five Features: `who`, `sex`, `pclass`, `embarked`, and `alone`