#classification#chi-square#fisher-score#feature-selection#titanic#python#scikit-learn

Feature Selection: Fisher Score & Chi2

Apply Fisher Score and Chi-squared tests for feature selection on the Titanic dataset in Python. Covers categorical feature scoring with scikit-learn chi2.

May 23, 2026 at 4:30 PM8 min readFollowFollow (Hindi)

Topics You Will Master

What the Chi-Squared (χ2\chi^2) test is and when to use it for feature selection
How Fisher Score ranks features by their ability to separate classes
How to apply chi2 with SelectKBest from Scikit-learn on real categorical data
How to validate feature rankings by comparing Random Forest classifier accuracy across different feature subsets
Best For

Python developers and data scientists who work with classification problems containing categorical features and want a principled, statistics-based way to decide which features to keep.

Expected Outcome

A ranked list of Titanic features by statistical significance, plus a Random Forest classifier trained on the best subset — demonstrating that four well-chosen features outperform the full seven-feature set.

When you build a classification model, not every feature in your dataset is useful. Some features are statistically independent of the target variable — they carry no signal — and including them adds noise, slows training, and can hurt accuracy. Feature selection is the process of identifying and keeping only the features that matter.

For categorical features, the Chi-Squared () test is one of the most reliable filter methods. It is a statistical test that measures whether two categorical variables are independent of each other. A low p-value (a measure of statistical significance — closer to zero means stronger evidence against independence) tells you that a feature is likely related to the target class and worth keeping. A high p-value suggests the feature and target are independent, making the feature a good candidate for removal.

Fisher Score is a related supervised method that ranks each feature individually by how well it separates the classes — features that produce tight within-class clusters and wide between-class distances receive a higher score.

In this tutorial, you will apply both methods to the Titanic dataset to select the most informative categorical features, then validate those rankings by training Random Forest classifiers on progressively larger feature subsets.

You can also

.

The Chi-Square test statistic is defined as:

Where:

  • — the Chi-Squared test statistic; larger values indicate stronger dependence between the feature and the target
  • — the observed frequency count for category in the data
  • — the expected frequency count for category under the null hypothesis that the feature and target are independent

Prerequisites: Python 3.x, Scikit-learn, Pandas, NumPy, Seaborn, Matplotlib.

Loading the Required Dataset

Import all the libraries needed for data analysis, modeling, and statistical testing up front:

PYTHON
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
PYTHON
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest, SelectPercentile
from sklearn.metrics import accuracy_score

Load the Titanic dataset directly from Seaborn's built-in datasets:

PYTHON
titanic = sns.load_dataset('titanic')
titanic.head()
OUTPUT
survivedpclasssexagesibspparchfareembarkedclasswhoadult_maledeckembark_townalivealone
003male22.0107.2500SThirdmanTrueNaNSouthamptonnoFalse
111female38.01071.2833CFirstwomanFalseCCherbourgyesFalse
213female26.0007.9250SThirdwomanFalseNaNSouthamptonyesTrue
311female35.01053.1000SFirstwomanFalseCSouthamptonyesFalse
403male35.0008.0500SThirdmanTrueNaNSouthamptonnoTrue

Inspect the dataset for missing values so you know which columns need cleaning before modeling:

PYTHON
titanic.isnull().sum()
OUTPUT
survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

The age column is missing 177 values and deck is missing 688 out of 891 — too many to impute reliably. Drop both columns, then drop any remaining rows with NaN values:

PYTHON
titanic.drop(labels = ['age', 'deck'], axis = 1, inplace = True)
titanic = titanic.dropna()
titanic.isnull().sum()
OUTPUT
survived       0
pclass         0
sex            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
embark_town    0
alive          0
alone          0
dtype: int64

The Chi-Square test operates only on categorical (or discrete non-negative integer) features, so filter the dataset down to those columns only:

PYTHON
data = titanic[['pclass', 'sex', 'sibsp', 'parch', 'embarked', 'who', 'alone']].copy()
data.head()
OUTPUT
pclasssexsibspparchembarkedwhoalone
03male10SmanFalse
11female10CwomanFalse
23female00SwomanTrue
31female10SwomanFalse
43male00SmanTrue

Confirm there are no missing values in your filtered feature set:

PYTHON
data.isnull().sum()
OUTPUT
pclass      0
sex         0
sibsp       0
parch       0
embarked    0
who         0
alone       0
dtype: int64

Encoding Categorical Features

Scikit-learn's chi2 function requires non-negative integer inputs, so string and boolean columns must be mapped to numeric codes. Encode sex as 0 for male and 1 for female:

PYTHON
sex = {'male': 0, 'female': 1}
data['sex'] = data['sex'].map(sex)
data.head()
OUTPUT
pclasssexsibspparchembarkedwhoalone
03010SmanFalse
11110CwomanFalse
23100SwomanTrue
31110SwomanFalse
43000SmanTrue

Encode embarked port codes to integers, who passenger categories, and the boolean alone flag:

PYTHON
ports = {'S': 0, 'C': 1, 'Q': 2}
data['embarked'] = data['embarked'].map(ports)
PYTHON
who = {'man': 0, 'woman': 1, 'child': 2}
data['who'] = data['who'].map(who)
PYTHON
alone = {True: 1, False: 0}
data['alone'] = data['alone'].map(alone)
data.head()
OUTPUT
pclasssexsibspparchembarkedwhoalone
03010000
11110110
23100011
31110010
43000001

Running the Chi-Squared Test

Separate your encoded features from the survival target, then check shapes to confirm alignment:

PYTHON
X = data.copy()
y = titanic['survived']
X.shape, y.shape
OUTPUT
((889, 7), (889,))

Split into training and test sets (80 / 20), then call chi2 on the training data to compute the test statistic and p-value for each feature:

PYTHON
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
f_score = chi2(X_train, y_train)
f_score
OUTPUT
(array([ 22.65169202, 152.91534343,   0.52934285,  10.35663782, 16.13255653, 161.42431175,  13.4382363 ]), array([1.94189138e-06, 3.99737147e-35, 4.66883271e-01, 1.29009955e-03, 5.90599986e-05, 5.52664700e-37, 2.46547298e-04]))

chi2 returns a tuple: the first array contains the statistics and the second contains the corresponding p-values. Align the p-values with their feature names and sort them in ascending order so the most significant features appear first:

PYTHON
p_values = pd.Series(f_score[1], index = X_train.columns)
p_values.sort_values(ascending = True, inplace = True)
p_values
OUTPUT
who         5.526647e-37
sex         3.997371e-35
pclass      1.941891e-06
embarked    5.906000e-05
alone       2.465473e-04
parch       1.290100e-03
sibsp       4.668833e-01
dtype: float64

who and sex have p-values near zero — extremely strong evidence of dependence on survival. sibsp has a p-value of 0.47, which is not statistically significant; it is effectively independent of survival and should be removed.

The bar chart below confirms this visually — all features except sibsp have p-values indistinguishable from zero, while sibsp stands clearly apart near 0.46:

Bar plot of Chi-Squared test p-values for each Titanic feature, showing sibsp as the only non-significant feature

Evaluating Feature Subsets with a Random Forest Classifier

To confirm that the statistical rankings translate into real accuracy gains, you will train a Random Forest classifier — an ensemble of decision trees that votes on the final prediction — on progressively larger subsets of features, ordered by their p-values from lowest to highest.

Define a helper function that fits a 100-tree Random Forest and prints the test accuracy:

PYTHON
def run_randomForest(X_train, X_test, y_train, y_test):
    clf = RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print('Accuracy: ', accuracy_score(y_test, y_pred))

Two Features: who and sex

Start with only the two most significant features to establish a baseline:

PYTHON
X_train_2 = X_train[['who', 'sex']]
X_test_2 = X_test[['who', 'sex']]
PYTHON
%%time
run_randomForest(X_train_2, X_test_2, y_train, y_test)
OUTPUT
Accuracy:  0.7191011235955056
Wall time: 687 ms

With just two features the model reaches 71.9% accuracy — a reasonable starting point given the limited information.

Three Features: who, sex, and pclass

Add pclass (the third most significant feature by p-value) and retrain:

PYTHON
X_train_3 = X_train[['who', 'sex', 'pclass']]
X_test_3 = X_test[['who', 'sex', 'pclass']]
PYTHON
%%time
run_randomForest(X_train_3, X_test_3, y_train, y_test)
OUTPUT
Accuracy:  0.7415730337078652
Wall time: 649 ms

Adding pclass lifts accuracy to 74.2%, confirming that passenger class carries meaningful signal beyond what who and sex already capture.

Four Features: who, sex, pclass, and embarked

PYTHON
X_train_4 = X_train[['who', 'sex', 'pclass', 'embarked']]
X_test_4 = X_test[['who', 'sex', 'pclass', 'embarked']]
PYTHON
%%time
run_randomForest(X_train_4, X_test_4, y_train, y_test)
OUTPUT
Accuracy:  0.7584269662921348
Wall time: 609 ms

This is the best result so far at 75.8%. Including the embarkation port further improves the model.

Four Features: who, sex, pclass, and alone

Test whether swapping embarked for alone (the fifth-ranked feature) produces a different outcome:

PYTHON
X_train_4 = X_train[['who', 'sex', 'pclass', 'alone']]
X_test_4 = X_test[['who', 'sex', 'pclass', 'alone']]
PYTHON
%%time
run_randomForest(X_train_4, X_test_4, y_train, y_test)
OUTPUT
Accuracy:  0.7528089887640449
Wall time: 710 ms

The alone variant scores 75.3%, slightly below embarked — confirming that embarked carries marginally more survival signal at this subset size.

Five Features: who, sex, pclass, embarked, and alone

PYTHON
X_train_5 = X_train[['who', 'sex', 'pclass', 'embarked', 'alone']]
X_test_5 = X_test[['who', 'sex', 'pclass', 'embarked', 'alone']]
PYTHON
%%time
run_randomForest(X_train_5, X_test_5, y_train, y_test)
OUTPUT
Accuracy:  0.7528089887640449
Wall time: 413 ms

Adding alone to the four-feature set provides no further gain — accuracy stays at 75.3%. This is a sign of diminishing returns as you approach the low-signal features.

Full Dataset: All Seven Features

Finally, evaluate using the complete set, including the statistically non-significant sibsp:

PYTHON
%%time
run_randomForest(X_train, X_test, y_train, y_test)
OUTPUT
Accuracy:  0.7359550561797753
Wall time: 576 ms

Using all seven features drops accuracy back to 73.6%. The non-significant sibsp feature introduces noise that hurts generalization — a clear demonstration that more features does not always mean a better model.

Conclusion

In this tutorial, you applied the Chi-Squared () test to rank categorical features by their statistical dependence on survival in the Titanic dataset. Features like who, sex, and pclass showed p-values near zero — extremely strong evidence of a real relationship with the target. The statistically non-significant feature sibsp (p-value ≈ 0.47) was correctly identified as noise. A systematic experiment with Random Forest classifiers confirmed the rankings: four carefully chosen features (who, sex, pclass, embarked) achieved 75.8% accuracy, while using all seven features dropped accuracy to 73.6%.

Key takeaways:

  • The Chi-Squared test quantifies independence between a categorical feature and the target class. A p-value below 0.05 is the conventional threshold for statistical significance.
  • Fisher Score and both evaluate features individually — they find the best single features, not necessarily the best subset. Treat their rankings as a strong starting point, not a definitive answer.
  • Encoding categorical features as non-negative integers is a prerequisite for chi2; the function will raise an error on string or boolean inputs.
  • Removing a statistically insignificant feature can raise model accuracy — noise reduction often matters more than marginal information gain.

Next steps:

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments