Feature Selection using Fisher Score and Chi2 (χ2) Test | Titanic Dataset | Machine Learning | KGP Talkie

Published by KGP Talkie on

Feature Selection using Fisher Score and Chi2 (χ2) Test

Watch Full Playlist: https://www.youtube.com/playlist?list=PLc2rvfiptPSQYzmDIFuq2PqN2n28ZjxDH

What is Fisher Score and Chi2 ( χ2) Test

Fisher score is one of the most widely used supervised feature selection methods. However, it selects each feature independently according to their scores under the Fisher criterion, which leads to a suboptimal subset of features.

Chi Square (χ2) Test

A chi-squared test, also written as X2

test, is any statistical hypothesis test where the sampling distribution of the test statistic is a chi-squared distribution.

chi-square test measures dependence between stochastic variables, so using this function weeds out the features that are the most likely to be independent of class and therefore irrelevant for classification.

Importing required libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest, SelectPercentile
from sklearn.metrics import accuracy_score

Loading the required dataset

titanic = sns.load_dataset('titanic')
titanic.head()
survivedpclasssexagesibspparchfareembarkedclasswhoadult_maledeckembark_townalivealone
003male22.0107.2500SThirdmanTrueNaNSouthamptonnoFalse
111female38.01071.2833CFirstwomanFalseCCherbourgyesFalse
213female26.0007.9250SThirdwomanFalseNaNSouthamptonyesTrue
311female35.01053.1000SFirstwomanFalseCSouthamptonyesFalse
403male35.0008.0500SThirdmanTrueNaNSouthamptonnoTrue
titanic.isnull().sum()
survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

Let's drop labels age and dect from the dataset.

titanic.drop(labels = ['age', 'deck'], axis = 1, inplace = True)
titanic = titanic.dropna()
titanic.isnull().sum()
survived       0
pclass         0
sex            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
embark_town    0
alive          0
alone          0
dtype: int64
data = titanic[['pclass', 'sex', 'sibsp', 'parch', 'embarked', 'who', 'alone']].copy()
data.head()
pclasssexsibspparchembarkedwhoalone
03male10SmanFalse
11female10CwomanFalse
23female00SwomanTrue
31female10SwomanFalse
43male00SmanTrue
data.isnull().sum()
pclass      0
sex         0
sibsp       0
parch       0
embarked    0
who         0
alone       0
dtype: int64
sex = {'male': 0, 'female': 1}
data['sex'] = data['sex'].map(sex)
data.head()
pclasssexsibspparchembarkedwhoalone
03010SmanFalse
11110CwomanFalse
23100SwomanTrue
31110SwomanFalse
43000SmanTrue
ports = {'S': 0, 'C': 1, 'Q': 2}
data['embarked'] = data['embarked'].map(ports)
who = {'man': 0, 'woman': 1, 'child': 2}
data['who'] = data['who'].map(who)
alone = {True: 1, False: 0}
data['alone'] = data['alone'].map(alone)
data.head()
pclasssexsibspparchembarkedwhoalone
03010000
11110110
23100011
31110010
43000001

Do F_Score

X = data.copy()
y = titanic['survived']
X.shape, y.shape
((889, 7), (889,))

Let's train, test and split the dataset with test size equals to 0.2.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
f_score = chi2(X_train, y_train)
f_score
(array([ 22.65169202, 152.91534343,   0.52934285,  10.35663782,
         16.13255653, 161.42431175,  13.4382363 ]),
 array([1.94189138e-06, 3.99737147e-35, 4.66883271e-01, 1.29009955e-03,
        5.90599986e-05, 5.52664700e-37, 2.46547298e-04]))
p_values = pd.Series(f_score[1], index = X_train.columns)
p_values.sort_values(ascending = True, inplace = True)
p_values
who         5.526647e-37
sex         3.997371e-35
pclass      1.941891e-06
embarked    5.906000e-05
alone       2.465473e-04
parch       1.290100e-03
sibsp       4.668833e-01
dtype: float64
p_values.plot.bar()
plt.title('pvalues with respect to features')
X_train_2 = X_train[['who', 'sex']]
X_test_2 = X_test[['who', 'sex']]

Now, we will do the Random classification to predict the value of y.

def run_randomForest(X_train, X_test, y_train, y_test):
    clf = RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print('Accuracy: ', accuracy_score(y_test, y_pred))
%%time
run_randomForest(X_train_2, X_test_2, y_train, y_test)
Accuracy:  0.7191011235955056
Wall time: 687 ms
X_train_3 = X_train[['who', 'sex', 'pclass']]
X_test_3 = X_test[['who', 'sex', 'pclass']]
%%time
run_randomForest(X_train_3, X_test_3, y_train, y_test)
Accuracy:  0.7415730337078652
Wall time: 649 ms
X_train_4 = X_train[['who', 'sex', 'pclass', 'embarked']]
X_test_4 = X_test[['who', 'sex', 'pclass', 'embarked']]
%%time
run_randomForest(X_train_4, X_test_4, y_train, y_test)
Accuracy:  0.7584269662921348
Wall time: 609 ms
X_train_4 = X_train[['who', 'sex', 'pclass', 'alone']]
X_test_4 = X_test[['who', 'sex', 'pclass', 'alone']]
%%time
run_randomForest(X_train_4, X_test_4, y_train, y_test)
Accuracy:  0.7528089887640449
Wall time: 710 ms
X_train_5 = X_train[['who', 'sex', 'pclass', 'embarked', 'alone']]
X_test_5 = X_test[['who', 'sex', 'pclass', 'embarked', 'alone']]    

Let's find out the accuracy and training time of the trined dataset.

%%time
run_randomForest(X_train_5, X_test_5, y_train, y_test)
Accuracy:  0.7528089887640449
Wall time: 413 ms

Let's find out the accuracy and training time of the original dataset.

%%time
run_randomForest(X_train, X_test, y_train, y_test)
Accuracy:  0.7359550561797753
Wall time: 576 ms