Feature Selection and CNN
This project builds a neural network to predict whether a particular bank customer is satisfied or not, using Convolutional Neural Networks. The dataset contains 370 features. Install TensorFlow with pip install tensorflow (or pip install tensorflow-gpu for GPU).
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Conv1D, MaxPool1D, Flatten, Dense, Dropout, BatchNormalization
from tensorflow.keras.optimizers import Adam
print(tf.__version__)
2.1.0
You can use this command to directly get the data from github.
!git clone https://github.com/laxmimerit/Data-Files-for-Feature-Selection.git
After downloading the data, read it using read_csv(). To see the first 5 rows of the data use data.head().
data = pd.read_csv('train.csv')
data.head()
| ID | var3 | var15 | imp_ent_var16_ult1 | imp_op_var39_comer_ult1 | imp_op_var39_comer_ult3 | imp_op_var40_comer_ult1 | imp_op_var40_comer_ult3 | imp_op_var40_efect_ult1 | imp_op_var40_efect_ult3 | ... | saldo_medio_var33_hace2 | saldo_medio_var33_hace3 | saldo_medio_var33_ult1 | saldo_medio_var33_ult3 | saldo_medio_var44_hace2 | saldo_medio_var44_hace3 | saldo_medio_var44_ult1 | saldo_medio_var44_ult3 | var38 | TARGET | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 23 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 39205.170000 | 0 |
| 1 | 3 | 2 | 34 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 49278.030000 | 0 |
| 2 | 4 | 2 | 23 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 67333.770000 | 0 |
| 3 | 8 | 2 | 37 | 0.0 | 195.0 | 195.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 64007.970000 | 0 |
| 4 | 10 | 2 | 39 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 117310.979016 | 0 |
5 rows x 371 columns
The dataset has 76020 rows and 371 columns.
data.shape
(76020, 371)
Create a feature space X containing only the columns that provide information for prediction. ID and TARGET do not contribute to prediction, so they are removed using drop(). After dropping these 2 columns, the column count reduces to 369.
X = data.drop(labels=['ID', 'TARGET'], axis = 1)
X.shape
(76020, 369)
Create a variable y containing the values to predict, i.e. TARGET.
y = data['TARGET']
Split the data into training and testing sets with train_test_split(). test_size = 0.2 reserves 20% for testing and 80% for training. random_state controls the shuffling applied before the split. stratify = y means the split is done in a stratified fashion, using y as the class labels.
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 0, stratify = y)
The training dataset consists of 60816 rows (80%) and the testing dataset consists of 15204 rows (20%).
X_train.shape, X_test.shape
((60816, 369), (15204, 369))
Remove Constant, Quasi Constant and Duplicate Features
Feature selection is the process of reducing the number of input variables when developing a predictive model.
Constant Featuresare the features that show single values in all the observations in the dataset. These features provide no information that allows ML models to predict the target.Quasi constantfeatures, as the name suggests, are the features that are almost constant. In other words, these features have the same values for a very large subset of the outputs. They have less variance. Such features are not very useful for making predictions.Duplicate Featuresas the name suggests are duplicated in the dataset.
The variance threshold is set to 1%: any column with variance below 1% is removed, retaining only columns with variance above 99%. VarianceThreshold() is fit to the training data only, and the test data is only transformed.
filter = VarianceThreshold(0.01)
X_train = filter.fit_transform(X_train)
X_test = filter.transform(X_test)
X_train.shape, X_test.shape
((60816, 273), (15204, 273))
After removing the Quasi constant features, 96 features are removed from the dataset.
369-273
96
To remove duplicate features, the data is transposed using .T, since Python has built-in functions to check for duplicate rows. After transposing, the shape of X_train_T is exactly opposite to that of X_train.
X_train_T = X_train.T
X_test_T = X_test.T
X_train_T = pd.DataFrame(X_train_T)
X_test_T = pd.DataFrame(X_test_T)
X_train_T.shape
(273, 60816)
.duplicated() returns a boolean Series denoting duplicate rows. 17 features are duplicated.
X_train_T.duplicated().sum()
17
The list of duplicated features below shows those with index True as duplicated.
duplicated_features = X_train_T.duplicated()
duplicated_features[70:90]
70 False
71 False
72 True
73 False
74 True
75 False
76 False
77 False
78 False
79 False
80 False
81 False
82 False
83 False
84 False
85 False
86 False
87 False
88 False
89 False
dtype: bool
The features with False values are not duplicated and should be retained. Inverting the boolean list changes False to True and vice versa.
features_to_keep = [not index for index in duplicated_features]
features_to_keep[70:90]
[True, True, False, True, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True]
With the values inverted, the features marked True are retained. The data is transposed again to restore the original shape. Applied to X_train:
X_train = X_train_T[features_to_keep].T
X_train.shape
(60816, 256)
Applied to X_test:
X_test = X_test_T[features_to_keep].T
X_test.shape
(15204, 256)
X_train.head()
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 263 | 264 | 265 | 266 | 267 | 268 | 269 | 270 | 271 | 272 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.0 | 26.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 117310.979016 |
| 1 | 2.0 | 23.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 85472.340000 |
| 2 | 2.0 | 23.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 317769.240000 |
| 3 | 2.0 | 30.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 76209.960000 |
| 4 | 2.0 | 23.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 302754.000000 |
5 rows x 256 columns
Bring the data into the same range. StandardScaler() standardizes features by removing the mean and scaling to unit variance.
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
X_train
array([[ 3.80478472e-02, -5.56029626e-01, -5.27331414e-02, ...,
-1.87046327e-02, -1.97720391e-02, 3.12133758e-03],
[ 3.80478472e-02, -7.87181903e-01, -5.27331414e-02, ...,
-1.87046327e-02, -1.97720391e-02, -1.83006062e-01],
[ 3.80478472e-02, -7.87181903e-01, -5.27331414e-02, ...,
-1.87046327e-02, -1.97720391e-02, 1.17499225e+00],
...,
[ 3.80478472e-02, 5.99731758e-01, -5.27331414e-02, ...,
-1.87046327e-02, -1.97720391e-02, -2.41865113e-01],
[ 3.80478472e-02, -1.70775831e-01, -5.27331414e-02, ...,
-1.87046327e-02, -1.97720391e-02, 3.12133758e-03],
[ 3.80478472e-02, 2.91528722e-01, 7.65192053e+00, ...,
-1.87046327e-02, -1.97720391e-02, 3.12133758e-03]])
X_train.shape, X_test.shape
((60816, 256), (15204, 256))
The data is 2-dimensional, but neural networks accept 3-dimensional input, so reshape() is applied.
X_train = X_train.reshape(60816, 256,1)
X_test = X_test.reshape(15204, 256, 1)
X_train.shape, X_test.shape
((60816, 256, 1), (15204, 256, 1))
y_train = y_train.to_numpy()
y_test = y_test.to_numpy()
Building the CNN
A Sequential() model is appropriate for a plain stack of layers where each layer has exactly one input tensor and one output tensor.
Conv1D() is a 1D Convolution Layer, effective for deriving features from a fixed-length segment of the overall dataset, where the location of the feature in the segment is less important. In the first Conv1D() layer, the model learns 36 filters with a convolutional window size of 3. The input_shape specifies the shape of the input, required for the first layer in any neural network. The ReLU activation function outputs the input directly if positive, otherwise zero.

BatchNormalization() allows each layer of a network to learn a little more independently of other layers. It normalizes the output of a previous activation layer by subtracting the batch mean and dividing by the batch standard deviation, keeping the mean output close to 0 and the standard deviation close to 1.
MaxPool1D() downsamples the input representation by taking the maximum value over the window defined by pool_size, which is 2 in the first Max Pool layer.
Dropout() randomly sets the outgoing edges of hidden units to 0 at each update of the training phase. The value passed in dropout specifies the probability at which outputs of the layer are dropped out.
Flatten() converts the data into a 1-dimensional array for inputting it to the next layer.
Dense() is the regular deeply connected neural network layer. The output layer has 1 neuron because a single value is predicted. The Sigmoid function is used because it outputs values between 0 and 1, which facilitates binary prediction.
model = Sequential()
model.add(Conv1D(32, 3, activation='relu', input_shape = (256,1)))
model.add(BatchNormalization())
model.add(MaxPool1D(2))
model.add(Dropout(0.3))
model.add(Conv1D(64, 3, activation='relu'))
model.add(BatchNormalization())
model.add(MaxPool1D(2))
model.add(Dropout(0.5))
model.add(Conv1D(128, 3, activation='relu'))
model.add(BatchNormalization())
model.add(MaxPool1D(2))
model.add(Dropout(0.5))
model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv1d (Conv1D) (None, 254, 32) 128
_________________________________________________________________
batch_normalization (BatchNo (None, 254, 32) 128
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 127, 32) 0
_________________________________________________________________
dropout (Dropout) (None, 127, 32) 0
_________________________________________________________________
conv1d_1 (Conv1D) (None, 125, 64) 6208
_________________________________________________________________
batch_normalization_1 (Batch (None, 125, 64) 256
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 62, 64) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 62, 64) 0
_________________________________________________________________
conv1d_2 (Conv1D) (None, 60, 128) 24704
_________________________________________________________________
batch_normalization_2 (Batch (None, 60, 128) 512
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 30, 128) 0
_________________________________________________________________
dropout_2 (Dropout) (None, 30, 128) 0
_________________________________________________________________
flatten (Flatten) (None, 3840) 0
_________________________________________________________________
dense (Dense) (None, 256) 983296
_________________________________________________________________
dropout_3 (Dropout) (None, 256) 0
_________________________________________________________________
dense_1 (Dense) (None, 1) 257
=================================================================
Total params: 1,015,489
Trainable params: 1,015,041
Non-trainable params: 448
_________________________________________________________________
Compiling and fitting the model uses an Adam optimizer with a 0.00005 learning rate. Training runs for 10 epochs. validation_data evaluates loss and metrics at the end of each epoch without training on that data. With metrics = ['accuracy'] the model is evaluated on accuracy.
model.compile(optimizer=Adam(lr=0.00005), loss='binary_crossentropy', metrics=['accuracy'])
history = model.fit(X_train, y_train, epochs=10, validation_data=(X_test, y_test), verbose=1)
Train on 60816 samples, validate on 15204 samples
Epoch 5/10
60816/60816 [==============================] - 111s 2ms/sample - loss: 0.1630 - accuracy: 0.9604 - val_loss: 0.1641 - val_accuracy: 0.9605
Epoch 6/10
60816/60816 [==============================] - 111s 2ms/sample - loss: 0.1599 - accuracy: 0.9603 - val_loss: 0.1595 - val_accuracy: 0.9605
Epoch 7/10
60816/60816 [==============================] - 111s 2ms/sample - loss: 0.1576 - accuracy: 0.9604 - val_loss: 0.1590 - val_accuracy: 0.9604
Epoch 8/10
60816/60816 [==============================] - 111s 2ms/sample - loss: 0.1556 - accuracy: 0.9604 - val_loss: 0.1610 - val_accuracy: 0.9605
Epoch 9/10
60816/60816 [==============================] - 111s 2ms/sample - loss: 0.1536 - accuracy: 0.9604 - val_loss: 0.1558 - val_accuracy: 0.9603
Epoch 10/10
60816/60816 [==============================] - 111s 2ms/sample - loss: 0.1542 - accuracy: 0.9604 - val_loss: 0.1602 - val_accuracy: 0.9599
history gives a summary of all the accuracies and losses calculated after each epoch.
history.history
{'accuracy': [0.95417327, 0.9592706, 0.95992833, 0.96033937, 0.96037227, 0.9603065, 0.9604052, 0.960438, 0.9603887, 0.9604052], 'loss': [0.21693714527215763, 0.17656464240582592, 0.16882949567384484, 0.16588703954582057, 0.16303560407957227, 0.15994301885150822, 0.15763013028843298, 0.15563193596928912, 0.1535658989747522, 0.1542411554370529], 'val_accuracy': [0.9600763, 0.9600763, 0.96033937, 0.9604052, 0.9604709, 0.9604709, 0.9604052, 0.9604709, 0.9602736, 0.959879], 'val_loss': [0.17092196812710614, 0.1765108920851371, 0.16735200087523436, 0.1662461552617033, 0.16413307644895303, 0.1594827836499469, 0.15897791552088097, 0.16101698756464938, 0.15578439738331923, 0.16016060526129197]}
The charts below plot model accuracy and model loss: training accuracy vs validation accuracy, and training loss vs validation loss.
def plot_learningCurve(history, epoch):
# Plot training & validation accuracy values
epoch_range = range(1, epoch+1)
plt.plot(epoch_range, history.history['accuracy'])
plt.plot(epoch_range, history.history['val_accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Val'], loc='upper left')
plt.show()
# Plot training & validation loss values
plt.plot(epoch_range, history.history['loss'])
plt.plot(epoch_range, history.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Val'], loc='upper left')
plt.show()
plot_learningCurve(history, 10)

The loss plot confirms the same trend, with both curves falling steadily and no sign of divergence:

The model reached 96% accuracy. Convolutional neural networks with appropriate feature selection can build an effective model for this dataset. Feature selection enables the machine learning algorithm to train faster, reduces model complexity, and can improve accuracy when the right subset is chosen.
Conclusion
In this tutorial you built a 1D CNN to predict bank customer satisfaction from 370 raw features. After removing constant, quasi-constant, and duplicate features, the dataset shrank to 256 informative columns. Trained on 60,816 samples for 10 epochs, the model achieved ~96% accuracy on the held-out test set, with training and validation curves tracking closely throughout.
Key takeaways:
- Feature selection (variance thresholding and duplicate removal) cut 370 features to 256 without sacrificing predictive power. Smaller inputs mean faster training and less overfitting risk.
- 1D CNNs can classify structured tabular data by treating each feature as a channel in a sequence; recurrent layers are not required for this task.
StandardScaleris essential before feeding tabular data to a CNN. Unnormalized large-magnitude features would dominate the convolutional filters.
Next steps:
- Apply the same 1D CNN approach to IMDB Sentiment Classification to see how convolutional filters work on text sequences.
- Try Human Activity Recognition with Accelerometer Data for another 1D sequence classification problem.
- Experiment with adding more convolutional blocks or a higher learning rate schedule to push accuracy above 96%.