## Feature Selection using Fisher Score and Chi2 (χ2) Test

### What is Fisher Score and Chi2 ( χ2) Test

Fisher score is one of the most widely used supervised feature selection methods. However, it selects each feature independently according to their `scores` under the Fisher criterion, which leads to a `suboptimal subset` of features.

### Chi Square (χ2) Test

A chi-squared test, also written as X2

test, is any `statistical` hypothesis test where the sampling distribution of the test `statistic` is a chi-squared distribution.

chi-square test measures dependence between `stochastic variables`, so using this function `weeds out` the features that are the most likely to be independent of class and therefore irrelevant for classification.

Importing required libraries

```import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
```
```from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest, SelectPercentile
from sklearn.metrics import accuracy_score
```

Loading the required dataset

```titanic = sns.load_dataset('titanic')
titanic.head()```
`titanic.isnull().sum()`
```survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64```

Let’s drop labels age and dect from the dataset.

```titanic.drop(labels = ['age', 'deck'], axis = 1, inplace = True)
titanic = titanic.dropna()
titanic.isnull().sum()```
```survived       0
pclass         0
sex            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
embark_town    0
alive          0
alone          0
dtype: int64```
```data = titanic[['pclass', 'sex', 'sibsp', 'parch', 'embarked', 'who', 'alone']].copy()
data.head()```
```data.isnull().sum()
```
```pclass      0
sex         0
sibsp       0
parch       0
embarked    0
who         0
alone       0
dtype: int64```
```sex = {'male': 0, 'female': 1}
data['sex'] = data['sex'].map(sex)
data.head()```
```ports = {'S': 0, 'C': 1, 'Q': 2}
data['embarked'] = data['embarked'].map(ports)```
```who = {'man': 0, 'woman': 1, 'child': 2}
data['who'] = data['who'].map(who)```
```alone = {True: 1, False: 0}
data['alone'] = data['alone'].map(alone)
data.head()```

### Do F_Score

```X = data.copy()
y = titanic['survived']
X.shape, y.shape```
`((889, 7), (889,))`

Let’s train, test and split the dataset with test size equals to `0.2`.

```X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
f_score = chi2(X_train, y_train)
f_score```
```(array([ 22.65169202, 152.91534343,   0.52934285,  10.35663782,
16.13255653, 161.42431175,  13.4382363 ]),
array([1.94189138e-06, 3.99737147e-35, 4.66883271e-01, 1.29009955e-03,
5.90599986e-05, 5.52664700e-37, 2.46547298e-04]))```
```p_values = pd.Series(f_score, index = X_train.columns)
p_values.sort_values(ascending = True, inplace = True)
p_values```
```who         5.526647e-37
sex         3.997371e-35
pclass      1.941891e-06
embarked    5.906000e-05
alone       2.465473e-04
parch       1.290100e-03
sibsp       4.668833e-01
dtype: float64```
```p_values.plot.bar()
plt.title('pvalues with respect to features')```
```X_train_2 = X_train[['who', 'sex']]
X_test_2 = X_test[['who', 'sex']]
```

Now, we will do the `Random classification` to predict the value of y.

```def run_randomForest(X_train, X_test, y_train, y_test):
clf = RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print('Accuracy: ', accuracy_score(y_test, y_pred))
```
```%%time
run_randomForest(X_train_2, X_test_2, y_train, y_test)
```
```Accuracy:  0.7191011235955056
Wall time: 687 ms
```
```X_train_3 = X_train[['who', 'sex', 'pclass']]
X_test_3 = X_test[['who', 'sex', 'pclass']]
```
```%%time
run_randomForest(X_train_3, X_test_3, y_train, y_test)
```
```Accuracy:  0.7415730337078652
Wall time: 649 ms
```
```X_train_4 = X_train[['who', 'sex', 'pclass', 'embarked']]
X_test_4 = X_test[['who', 'sex', 'pclass', 'embarked']]
```
```%%time
run_randomForest(X_train_4, X_test_4, y_train, y_test)
```
```Accuracy:  0.7584269662921348
Wall time: 609 ms
```
```X_train_4 = X_train[['who', 'sex', 'pclass', 'alone']]
X_test_4 = X_test[['who', 'sex', 'pclass', 'alone']]
```
```%%time
run_randomForest(X_train_4, X_test_4, y_train, y_test)
```
```Accuracy:  0.7528089887640449
Wall time: 710 ms
```
```X_train_5 = X_train[['who', 'sex', 'pclass', 'embarked', 'alone']]
X_test_5 = X_test[['who', 'sex', 'pclass', 'embarked', 'alone']]
```

Let’s find out the `accuracy` and `training time` of the trined dataset.

```%%time
run_randomForest(X_train_5, X_test_5, y_train, y_test)
```
```Accuracy:  0.7528089887640449
Wall time: 413 ms
```

Let’s find out the `accuracy` and `training time` of the original dataset.

```%%time
run_randomForest(X_train, X_test, y_train, y_test)
```
```Accuracy:  0.7359550561797753
Wall time: 576 ms
```

