# Feature Engineering Tutorial Series 6: Variable magnitude

### Does the magnitude of the variable matter?

In Linear Regression models, the scale of variables used to estimate the output matters. Linear models are of the type **y = w x + b**, where the regression coefficient w represents the expected change in y for a one unit change in x (the predictor). Thus, the magnitude of w is partly determined by the magnitude of the units being used for x. If x is a distance variable, just changing the scale from kilometers to miles will cause a change in the magnitude of the coefficient.

In addition, in situations where we estimate the outcome y by contemplating multiple predictors x1, x2, …xn, predictors with greater numeric ranges dominate over those with smaller numeric ranges.

Gradient descent converges faster when all the predictors (x1 to xn) are within a similar scale, therefore having features in a similar scale is useful for Neural Networks as well as.

In Support Vector Machines, feature scaling can decrease the time required to find the support vectors.

Finally, methods using Euclidean distances or distances in general are also affected by the magnitude of the features, as Euclidean distance is sensitive to variations in the magnitude or scales of the predictors. Therefore feature scaling is required for methods that utilise distance calculations like k-nearest neighbours (KNN) and k-means clustering.

In short:

#### Magnitude matters because:

- The regression coefficient is directly influenced by the scale of the variable
- Variables with bigger magnitude / value range dominate over the ones with smaller magnitude / value range
- Gradient descent converges faster when features are on similar scales
- Feature scaling helps decrease the time to find support vectors for SVMs
- Euclidean distances are sensitive to feature magnitude.

#### The machine learning models affected by the magnitude of the feature are:

- Linear and Logistic Regression
- Neural Networks
- Support Vector Machines
- KNN
- K-means clustering
- Linear Discriminant Analysis (LDA)
- Principal Component Analysis (PCA)

#### Machine learning models insensitive to feature magnitude are the ones based on Trees:

- Classification and Regression Trees
- Random Forests
- Gradient Boosted Trees

## In this Blog

We will study the effect of feature magnitude on the performance of different machine learning algorithms.

We will use the Titanic dataset.

## Let’s Start!

We will start by importing the necessary libraries.

# to read the dataset into a dataframe and perform operations on itimportpandasaspd# to perform basic array operationsimportnumpyasnp# import several machine learning algorithmsfromsklearn.linear_modelimportLogisticRegressionfromsklearn.ensembleimportAdaBoostClassifierfromsklearn.ensembleimportRandomForestClassifierfromsklearn.svmimportSVCfromsklearn.neural_networkimportMLPClassifierfromsklearn.neighborsimportKNeighborsClassifier# to scale the featuresfromsklearn.preprocessingimportMinMaxScaler# to evaluate performance and separate into# train and test setfromsklearn.metricsimportroc_auc_scorefromsklearn.model_selectionimporttrain_test_split

### Load data with numerical variables only

We will start by loading only the variables having numeric values from the `titanic`

dataset.

data = pd.read_csv('https://raw.githubusercontent.com/laxmimerit/All-CSV-ML-Data-Files-Download/master/titanic.csv', usecols=['Pclass', 'Age', 'Fare', 'Survived']) data.head()

Survived | Pclass | Age | Fare | |
---|---|---|---|---|

0 | 0 | 3 | 22.0 | 7.2500 |

1 | 1 | 1 | 38.0 | 71.2833 |

2 | 1 | 3 | 26.0 | 7.9250 |

3 | 1 | 1 | 35.0 | 53.1000 |

4 | 0 | 3 | 35.0 | 8.0500 |

Now we will have a look at the values of those variables to get an idea of the feature magnitudes. `describe`

provides descriptive statistics including those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

data.describe()

Survived | Pclass | Age | Fare | |
---|---|---|---|---|

count | 891.000000 | 891.000000 | 714.000000 | 891.000000 |

mean | 0.383838 | 2.308642 | 29.699118 | 32.204208 |

std | 0.486592 | 0.836071 | 14.526497 | 49.693429 |

min | 0.000000 | 1.000000 | 0.420000 | 0.000000 |

25% | 0.000000 | 2.000000 | 20.125000 | 7.910400 |

50% | 0.000000 | 3.000000 | 28.000000 | 14.454200 |

75% | 1.000000 | 3.000000 | 38.000000 | 31.000000 |

max | 1.000000 | 3.000000 | 80.000000 | 512.329200 |

We can see that `Fare`

varies between 0 and 512, `Age`

between 0 and 80, and `Pclass`

between 1 and 3. So the variables have different magnitudes.

Let’s calculate the range of each variable. The range of a set of data is the difference between the largest and smallest values.

forcolin['Pclass', 'Age', 'Fare']: print(col, 'range: ', data[col].max() - data[col].min())

Pclass range: 2 Age range: 79.58 Fare range: 512.3292

The range of values that each variable takes are quite different.

Now we will split the data into training and testing set with the help of `train_test_split()`

. We will use the variables `Pclass`

, `Age`

and `Fare`

as the feature space and `Survived`

as the target. The `test_size = 0.3`

will keep 30% data for testing and 70% data will be used for training the model. `random_state`

controls the shuffling applied to the data before applying the split. The `titanic`

dataset contains missing information so for this demo, we will fill those with 0s using `fillna()`

.

X_train, X_test, y_train, y_test = train_test_split( data[['Pclass', 'Age', 'Fare']].fillna(0), data.Survived, test_size=0.3, random_state=0) X_train.shape, X_test.shape

((623, 3), (268, 3))

The training dataset contains 623 rows while the test dataset contains 268 rows.

### Feature Scaling

For this demonstration, we will scale the features between 0 and 1, using the `MinMaxScaler`

from scikit-learn. To learn more about this scaling visit the Scikit-Learn website

The transformation is given by:

**X_rescaled = (X – X.min) / (X.max – X.min)**

And to transform the re-scaled features back to their original magnitude:

**X = X_rescaled * (max – min) + min**

We will first initialize `scalar`

. Then we will fit the `scalar`

to the training dataset. Using this `scalar`

we will transform `X_train`

as well as `X_test`

.

# call the scalerscaler = MinMaxScaler()# fit the scalerscaler.fit(X_train)# re scale the datasetsX_train_scaled = scaler.transform(X_train) X_test_scaled = scaler.transform(X_test)

Let’s have a look at the scaled training dataset.

print('Mean: ', X_train_scaled.mean(axis=0)) print('Standard Deviation: ', X_train_scaled.std(axis=0)) print('Minimum value: ', X_train_scaled.min(axis=0)) print('Maximum value: ', X_train_scaled.max(axis=0))

Mean: [0.64365971 0.30131421 0.06335433] Standard Deviation: [0.41999093 0.21983527 0.09411705] Minimum value: [0. 0. 0.] Maximum value: [1. 1. 1.]

Now, the maximum values for all the features is 1, and the minimum value is zero, as expected. So they are in a similar scale.

### Logistic Regression

Let’s evaluate the effect of feature scaling on a Logistic Regression. We will first build the model using unscaled variables and then the scaled variables.

# model built on unscaled variables# call the modellogit = LogisticRegression( random_state=44, C=1000,# Inverse of regularization strength (larger c to avoid regularization)solver='lbfgs')# Algorithm to use in the optimization problem.# train the modellogit.fit(X_train, y_train)# evaluate performanceprint('Train set') pred = logit.predict_proba(X_train) print('Logistic Regression roc-auc:{}'.format( roc_auc_score(y_train, pred[:, 1]))) print('Test set') pred = logit.predict_proba(X_test) print('Logistic Regression roc-auc:{}'.format( roc_auc_score(y_test, pred[:, 1])))

Train set Logistic Regression roc-auc: 0.7134823539619531 Test set Logistic Regression roc-auc: 0.7080952380952381

Let’s look at the coefficients. `coef_`

gives the coefficient of the features in the decision function.

logit.coef_

array([[-0.92585764, -0.01822689, 0.00233577]])

# model built on scaled variables# call the modellogit = LogisticRegression( random_state=44, C=1000,# Inverse of regularization strength (larger c to avoid regularization)solver='lbfgs')# Algorithm to use in the optimization problem.# train the model using the re-scaled datalogit.fit(X_train_scaled, y_train)# evaluate performanceprint('Train set') pred = logit.predict_proba(X_train_scaled) print('Logistic Regression roc-auc:{}'.format( roc_auc_score(y_train, pred[:, 1]))) print('Test set') pred = logit.predict_proba(X_test_scaled) print('Logistic Regression roc-auc:{}'.format( roc_auc_score(y_test, pred[:, 1])))

Train set Logistic Regression roc-auc: 0.7134931997136721 Test set Logistic Regression roc-auc: 0.7080952380952381

Let’s look at the coefficients.

logit.coef_

array([[-1.85170244, -1.45782986, 1.19540159]])

We observe that the performance of logistic regression did not change due to the datasets with the features scaled (compare roc-auc values for train and test set for models with and without feature scaling).

However, when looking at the coefficients we do see a big difference in the values. This is because the magnitude of the variable was affecting the coefficients. After scaling, all 3 variables have the relatively the same effect (coefficient) towards survival, whereas before scaling, we would be inclined to think that PClass was driving the Survival outcome.

### Support Vector Machines

Let’s evaluate the effect of feature scaling on Support Vector Machines. We will first build the model using unscaled variables and then the scaled variables.

# model build on unscaled variables# call the modelSVM_model = SVC(random_state=44, probability=True, gamma='auto')# train the modelSVM_model.fit(X_train, y_train)# evaluate performanceprint('Train set') pred = SVM_model.predict_proba(X_train) print('SVM roc-auc:{}'.format(roc_auc_score(y_train, pred[:, 1]))) print('Test set') pred = SVM_model.predict_proba(X_test) print('SVM roc-auc:{}'.format(roc_auc_score(y_test, pred[:, 1])))

Train set SVM roc-auc: 0.9016995292943755 Test set SVM roc-auc: 0.6768154761904762

# model built on scaled variables# call the modelSVM_model = SVC(random_state=44, probability=True, gamma='auto')# train the modelSVM_model.fit(X_train_scaled, y_train)# evaluate performanceprint('Train set') pred = SVM_model.predict_proba(X_train_scaled) print('SVM roc-auc:{}'.format(roc_auc_score(y_train, pred[:, 1]))) print('Test set') pred = SVM_model.predict_proba(X_test_scaled) print('SVM roc-auc:{}'.format(roc_auc_score(y_test, pred[:, 1])))

Train set SVM roc-auc: 0.7047081408212403 Test set SVM roc-auc: 0.6988690476190476

Feature scaling improved the performance of the support vector machine. After feature scaling the model is no longer over-fitting to the training set (compare the roc-auc of 0.9 for the model on unscaled features vs the roc-auc of 0.7). In addition, the roc-auc for the testing set increased as well (0.67 vs 0.69).

### K-Nearest Neighbours

Let’s evaluate the effect of feature scaling on K-Nearest Neighbours. We will first build the model using unscaled variables and then the scaled variables.

#model built on unscaled features# call the modelKNN = KNeighborsClassifier(n_neighbors=5)# train the modelKNN.fit(X_train, y_train)# evaluate performanceprint('Train set') pred = KNN.predict_proba(X_train) print('KNN roc-auc:{}'.format(roc_auc_score(y_train, pred[:,1]))) print('Test set') pred = KNN.predict_proba(X_test) print('KNN roc-auc:{}'.format(roc_auc_score(y_test, pred[:,1])))

Train set KNN roc-auc: 0.8131141849360215 Test set KNN roc-auc: 0.6947901111664178

# model built on scaled# call the modelKNN = KNeighborsClassifier(n_neighbors=5)# train the modelKNN.fit(X_train_scaled, y_train)# evaluate performanceprint('Train set') pred = KNN.predict_proba(X_train_scaled) print('KNN roc-auc:{}'.format(roc_auc_score(y_train, pred[:,1]))) print('Test set') pred = KNN.predict_proba(X_test_scaled) print('KNN roc-auc:{}'.format(roc_auc_score(y_test, pred[:,1])))

Train set KNN roc-auc: 0.826928785995703 Test set KNN roc-auc: 0.7232453957192633

We observe for KNN as well that feature scaling improved the performance of the model. The model built on scaled features shows a better generalisation, with a higher roc-auc 0.72 for the testing set vs 0.69 for model built on unscaled features.

Both KNN methods are over-fitting to the train set. Thus, we would need to change the parameters of the model or use less features to try and decrease over-fitting, which exceeds the purpose of this demonstration.

### Random Forests

Let’s evaluate the effect of feature scaling on Random Forests. We will first build the model using unscaled variables and then the scaled variables.

# model built on unscaled features# call the modelrf = RandomForestClassifier(n_estimators=200, random_state=39)# train the modelrf.fit(X_train, y_train)# evaluate performanceprint('Train set') pred = rf.predict_proba(X_train) print('Random Forests roc-auc:{}'.format(roc_auc_score(y_train, pred[:, 1]))) print('Test set') pred = rf.predict_proba(X_test) print('Random Forests roc-auc:{}'.format(roc_auc_score(y_test, pred[:, 1])))

Train set Random Forests roc-auc: 0.9916108110453136 Test set Random Forests roc-auc: 0.7614285714285715

# model built in scaled features# call the modelrf = RandomForestClassifier(n_estimators=200, random_state=39)# train the modelrf.fit(X_train_scaled, y_train)# evaluate performanceprint('Train set') pred = rf.predict_proba(X_train_scaled) print('Random Forests roc-auc:{}'.format(roc_auc_score(y_train, pred[:,1]))) print('Test set') pred = rf.predict_proba(X_test_scaled) print('Random Forests roc-auc:{}'.format(roc_auc_score(y_test, pred[:,1])))

Train set Random Forests roc-auc: 0.9916541940521898 Test set Random Forests roc-auc: 0.7610714285714285

As expected, Random Forests shows no change in performance regardless of whether it is trained on a dataset with scaled or unscaled features. This model in particular, is over-fitting to the training set. So we need to do some work to remove the over-fitting. That exceeds the scope of this demonstration.

### AdaBoost

Let’s evaluate the effect of feature scaling on AdaBoost. We will first build the model using unscaled variables and then the scaled variables.

# train adaboost on non-scaled features# call the modelada = AdaBoostClassifier(n_estimators=200, random_state=44)# train the modelada.fit(X_train, y_train)# evaluate model performanceprint('Train set') pred = ada.predict_proba(X_train) print('AdaBoost roc-auc:{}'.format(roc_auc_score(y_train, pred[:,1]))) print('Test set') pred = ada.predict_proba(X_test) print('AdaBoost roc-auc:{}'.format(roc_auc_score(y_test, pred[:,1])))

Train set AdaBoost roc-auc: 0.8477364916162339 Test set AdaBoost roc-auc: 0.7733630952380953

# train adaboost on scaled features# call the modelada = AdaBoostClassifier(n_estimators=200, random_state=44)# train the modelada.fit(X_train_scaled, y_train)# evaluate model performanceprint('Train set') pred = ada.predict_proba(X_train_scaled) print('AdaBoost roc-auc:{}'.format(roc_auc_score(y_train, pred[:,1]))) print('Test set') pred = ada.predict_proba(X_test_scaled) print('AdaBoost roc-auc:{}'.format(roc_auc_score(y_test, pred[:,1])))

Train set AdaBoost roc-auc: 0.8477364916162339 Test set AdaBoost roc-auc: 0.7733630952380953

As expected, AdaBoost shows no change in performance regardless of whether it is trained on a dataset with scaled or unscaled features

Machine learning is like making a mixed fruit juice. If we want to get the best-mixed juice, we need to mix all fruit not by their size but based on their right proportion. We just need to remember apple and strawberry are not the same unless we make them similar in some context to compare their attribute. Similarly, in many machine learning algorithms, to bring all features in the same standing, we need to do scaling so that one significant number doesn’t impact the model just because of their large magnitude. Feature scaling in machine learning is one of the most critical steps during the pre-processing of data before creating a machine learning model. Scaling can make a difference between a weak machine learning model and a better one.