Customer Satisfaction Prediction with CNN

Feature Selection and CNN

In this blog, we will build a neural network that predicts whether a bank customer is satisfied, using a Convolutional Neural Network. The dataset contains 370 features. Install TensorFlow with pip install tensorflow (or pip install tensorflow-gpu for GPU).

PYTHON

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold

import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Conv1D, MaxPool1D, Flatten, Dense, Dropout, BatchNormalization
from tensorflow.keras.optimizers import Adam
print(tf.__version__)

OUTPUT

2.1.0

We can use this command to directly get the data from github.

PLAINTEXT

!git clone https://github.com/laxmimerit/Data-Files-for-Feature-Selection.git

After downloading the data, read it using read_csv(). To see the first 5 rows of the data use data.head().

PYTHON

data = pd.read_csv('train.csv')
data.head()

OUTPUT

	ID	var3	var15	imp_op_var39_comer_ult1	imp_op_var39_comer_ult3	...	var38
0	1	2	23	0.0	0.0	...	39205.170000
1	3	2	34	0.0	0.0	...	49278.030000
2	4	2	23	0.0	0.0	...	67333.770000
3	8	2	37	195.0	195.0	...	64007.970000
4	10	2	39	0.0	0.0	...	117310.979016

5 rows x 371 columns

The dataset has 76020 rows and 371 columns.

PYTHON

data.shape

OUTPUT

(76020, 371)

Create a feature space X with only the columns that help us predict. ID and TARGET do not help, so we drop them with drop(). After we drop these 2 columns, 369 remain.

PYTHON

X = data.drop(labels=['ID', 'TARGET'], axis = 1)
X.shape

OUTPUT

(76020, 369)

Create a variable y containing the values to predict, i.e. TARGET.

PYTHON

y = data['TARGET']

Split the data into training and testing sets with train_test_split(). test_size = 0.2 reserves 20% for testing and 80% for training. random_state controls the shuffling applied before the split. stratify = y keeps the class balance in both sets, using y as the class labels.

PYTHON

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 0, stratify = y)

The training dataset consists of 60816 rows (80%) and the testing dataset consists of 15204 rows (20%).

PYTHON

X_train.shape, X_test.shape

OUTPUT

((60816, 369), (15204, 369))

Remove Constant, Quasi Constant and Duplicate Features

Feature selection is the process of cutting down the number of input variables when we build a model.

Constant Features show the same single value in every row. They give the model no help in predicting the target.
Quasi constant features are almost constant. They have the same value for a very large share of the rows, so they have little variance. Such features are not very useful for predictions.
Duplicate Features as the name suggests are duplicated in the dataset.

We set the variance threshold to 1%. Any column with variance below 1% is removed, and only columns above 99% are kept. We fit VarianceThreshold() on the training data only, and just transform the test data.

PYTHON

filter = VarianceThreshold(0.01)
X_train = filter.fit_transform(X_train)
X_test = filter.transform(X_test)

X_train.shape, X_test.shape

OUTPUT

((60816, 273), (15204, 273))

After removing the Quasi constant features, 96 features are removed from the dataset.

PLAINTEXT

369-273

PLAINTEXT

To remove duplicate features, we transpose the data with .T, because Python has built-in ways to check for duplicate rows. After we transpose, the shape of X_train_T is the reverse of X_train.

PYTHON

X_train_T = X_train.T
X_test_T = X_test.T

X_train_T = pd.DataFrame(X_train_T)
X_test_T = pd.DataFrame(X_test_T)

X_train_T.shape

OUTPUT

(273, 60816)

.duplicated() returns a boolean Series denoting duplicate rows. 17 features are duplicated.

PYTHON

X_train_T.duplicated().sum()

OUTPUT

The list of duplicated features below shows those with index True as duplicated.

PYTHON

duplicated_features = X_train_T.duplicated()
duplicated_features[70:90]

OUTPUT

70    False
71    False
72     True
73    False
74     True
75    False
76    False
77    False
78    False
79    False
80    False
81    False
82    False
83    False
84    False
85    False
86    False
87    False
88    False
89    False
dtype: bool

The features with False are not duplicated, so we keep them. Inverting the boolean list swaps False and True.

PYTHON

features_to_keep = [not index for index in duplicated_features]
features_to_keep[70:90]

OUTPUT

[True, True, False, True, False, True, True, True, True, True, True, True, True, True, True, True, True, True, True, True]

With the values inverted, the features marked True are retained. The data is transposed again to restore the original shape. Applied to X_train:

PYTHON

X_train = X_train_T[features_to_keep].T
X_train.shape

OUTPUT

(60816, 256)

Applied to X_test:

PYTHON

X_test = X_test_T[features_to_keep].T
X_test.shape

OUTPUT

(15204, 256)

PYTHON

X_train.head()

OUTPUT

	0	1	...	272
0	2.0	26.0	...	117310.979016
1	2.0	23.0	...	85472.340000
2	2.0	23.0	...	317769.240000
3	2.0	30.0	...	76209.960000
4	2.0	23.0	...	302754.000000

5 rows x 256 columns

Bring the data into the same range. StandardScaler() standardizes features by removing the mean and scaling to unit variance.

PYTHON

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
X_train

OUTPUT

array([[ 3.80478472e-02, -5.56029626e-01, -5.27331414e-02, ...,
        -1.87046327e-02, -1.97720391e-02,  3.12133758e-03],
       [ 3.80478472e-02, -7.87181903e-01, -5.27331414e-02, ...,
        -1.87046327e-02, -1.97720391e-02, -1.83006062e-01],
       [ 3.80478472e-02, -7.87181903e-01, -5.27331414e-02, ...,
        -1.87046327e-02, -1.97720391e-02,  1.17499225e+00],
       ...,
       [ 3.80478472e-02,  5.99731758e-01, -5.27331414e-02, ...,
        -1.87046327e-02, -1.97720391e-02, -2.41865113e-01],
       [ 3.80478472e-02, -1.70775831e-01, -5.27331414e-02, ...,
        -1.87046327e-02, -1.97720391e-02,  3.12133758e-03],
       [ 3.80478472e-02,  2.91528722e-01,  7.65192053e+00, ...,
        -1.87046327e-02, -1.97720391e-02,  3.12133758e-03]])

PYTHON

X_train.shape, X_test.shape

OUTPUT

((60816, 256), (15204, 256))

The data is 2-dimensional, but neural networks accept 3-dimensional input, so reshape() is applied.

PYTHON

X_train = X_train.reshape(60816, 256,1)
X_test = X_test.reshape(15204, 256, 1)
X_train.shape, X_test.shape

OUTPUT

((60816, 256, 1), (15204, 256, 1))

PYTHON

y_train = y_train.to_numpy()
y_test = y_test.to_numpy()

Building the CNN

A Sequential() model is appropriate for a plain stack of layers where each layer has exactly one input tensor and one output tensor.

Conv1D() is a 1D Convolution Layer, effective for deriving features from a fixed-length segment of the overall dataset, where the location of the feature in the segment is less important. In the first Conv1D() layer, the model learns 36 filters with a convolutional window size of 3. The input_shape specifies the shape of the input, required for the first layer in any neural network. The ReLU activation function outputs the input directly if positive, otherwise zero.

ReLU activation function graph showing zero output for negative inputs and linear output for positive values

BatchNormalization() allows each layer of a network to learn a little more independently of other layers. It normalizes the output of a previous activation layer by subtracting the batch mean and dividing by the batch standard deviation, keeping the mean output close to 0 and the standard deviation close to 1.

MaxPool1D() downsamples the input representation by taking the maximum value over the window defined by pool_size, which is 2 in the first Max Pool layer.

Dropout() randomly sets the outgoing edges of hidden units to 0 at each update of the training phase. The value passed in dropout specifies the probability at which outputs of the layer are dropped out.

Flatten() converts the data into a 1-dimensional array for inputting it to the next layer.

Dense() is the regular deeply connected neural network layer. The output layer has 1 neuron because a single value is predicted. The Sigmoid function is used because it outputs values between 0 and 1, which facilitates binary prediction.

PYTHON

model = Sequential()
model.add(Conv1D(32, 3, activation='relu', input_shape = (256,1)))
model.add(BatchNormalization())
model.add(MaxPool1D(2))
model.add(Dropout(0.3))

model.add(Conv1D(64, 3, activation='relu'))
model.add(BatchNormalization())
model.add(MaxPool1D(2))
model.add(Dropout(0.5))

model.add(Conv1D(128, 3, activation='relu'))
model.add(BatchNormalization())
model.add(MaxPool1D(2))
model.add(Dropout(0.5))

model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))

model.add(Dense(1, activation='sigmoid'))

PYTHON

model.summary()

PYTHON

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
conv1d (Conv1D)              (None, 254, 32)           128
_________________________________________________________________
batch_normalization (BatchNo (None, 254, 32)           128
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 127, 32)           0
_________________________________________________________________
dropout (Dropout)            (None, 127, 32)           0
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 125, 64)           6208
_________________________________________________________________
batch_normalization_1 (Batch (None, 125, 64)           256
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 62, 64)            0
_________________________________________________________________
dropout_1 (Dropout)          (None, 62, 64)            0
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 60, 128)           24704
_________________________________________________________________
batch_normalization_2 (Batch (None, 60, 128)           512
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 30, 128)           0
_________________________________________________________________
dropout_2 (Dropout)          (None, 30, 128)           0
_________________________________________________________________
flatten (Flatten)            (None, 3840)              0
_________________________________________________________________
dense (Dense)                (None, 256)               983296
_________________________________________________________________
dropout_3 (Dropout)          (None, 256)               0
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 257
=================================================================
Total params: 1,015,489
Trainable params: 1,015,041
Non-trainable params: 448
_________________________________________________________________

Compiling and fitting the model uses an Adam optimizer with a 0.00005 learning rate. Training runs for 10 epochs. validation_data evaluates loss and metrics at the end of each epoch without training on that data. With metrics = ['accuracy'] the model is evaluated on accuracy.

PLAINTEXT

model.compile(optimizer=Adam(lr=0.00005), loss='binary_crossentropy', metrics=['accuracy'])
history = model.fit(X_train, y_train, epochs=10, validation_data=(X_test, y_test), verbose=1)

PLAINTEXT

Train on 60816 samples, validate on 15204 samples

Epoch 5/10
60816/60816 [==============================] - 111s 2ms/sample - loss: 0.1630 - accuracy: 0.9604 - val_loss: 0.1641 - val_accuracy: 0.9605
Epoch 6/10
60816/60816 [==============================] - 111s 2ms/sample - loss: 0.1599 - accuracy: 0.9603 - val_loss: 0.1595 - val_accuracy: 0.9605
Epoch 7/10
60816/60816 [==============================] - 111s 2ms/sample - loss: 0.1576 - accuracy: 0.9604 - val_loss: 0.1590 - val_accuracy: 0.9604
Epoch 8/10
60816/60816 [==============================] - 111s 2ms/sample - loss: 0.1556 - accuracy: 0.9604 - val_loss: 0.1610 - val_accuracy: 0.9605
Epoch 9/10
60816/60816 [==============================] - 111s 2ms/sample - loss: 0.1536 - accuracy: 0.9604 - val_loss: 0.1558 - val_accuracy: 0.9603
Epoch 10/10
60816/60816 [==============================] - 111s 2ms/sample - loss: 0.1542 - accuracy: 0.9604 - val_loss: 0.1602 - val_accuracy: 0.9599

history gives a summary of all the accuracies and losses calculated after each epoch.

PYTHON

history.history

OUTPUT

{'accuracy': [0.95417327, 0.9592706, 0.95992833, 0.96033937, 0.96037227, 0.9603065, 0.9604052, 0.960438, 0.9603887, 0.9604052], 'loss': [0.21693714527215763, 0.17656464240582592, 0.16882949567384484, 0.16588703954582057, 0.16303560407957227, 0.15994301885150822, 0.15763013028843298, 0.15563193596928912, 0.1535658989747522, 0.1542411554370529], 'val_accuracy': [0.9600763, 0.9600763, 0.96033937, 0.9604052, 0.9604709, 0.9604709, 0.9604052, 0.9604709, 0.9602736, 0.959879], 'val_loss': [0.17092196812710614, 0.1765108920851371, 0.16735200087523436, 0.1662461552617033, 0.16413307644895303, 0.1594827836499469, 0.15897791552088097, 0.16101698756464938, 0.15578439738331923, 0.16016060526129197]}

The charts below plot model accuracy and model loss: training accuracy vs validation accuracy, and training loss vs validation loss.

PYTHON

def plot_learningCurve(history, epoch):
  # Plot training & validation accuracy values
  epoch_range = range(1, epoch+1)
  plt.plot(epoch_range, history.history['accuracy'])
  plt.plot(epoch_range, history.history['val_accuracy'])
  plt.title('Model accuracy')
  plt.ylabel('Accuracy')
  plt.xlabel('Epoch')
  plt.legend(['Train', 'Val'], loc='upper left')
  plt.show()

  # Plot training & validation loss values
  plt.plot(epoch_range, history.history['loss'])
  plt.plot(epoch_range, history.history['val_loss'])
  plt.title('Model loss')
  plt.ylabel('Loss')
  plt.xlabel('Epoch')
  plt.legend(['Train', 'Val'], loc='upper left')
  plt.show()

plot_learningCurve(history, 10)

Line chart showing training and validation accuracy converging near 96% over 10 epochs

The loss plot confirms the same trend, with both curves falling steadily and no sign of divergence:

Line chart showing training and validation loss both decreasing from ~0.22 to ~0.16 over 10 epochs

The model reached 96% accuracy. Convolutional neural networks with appropriate feature selection can build an effective model for this dataset. Feature selection enables the machine learning algorithm to train faster, reduces model complexity, and can improve accuracy when the right subset is chosen.

Conclusion

In this blog, we built a 1D CNN to predict bank customer satisfaction from 370 raw features. We removed constant, quasi-constant, and duplicate features, which shrank the dataset to 256 useful columns. We trained on 60,816 samples for 10 epochs, and the model reached about 96% accuracy on the held-out test set. The training and validation curves tracked closely throughout.

Key takeaways:

Feature selection (variance thresholding and duplicate removal) cut 370 features to 256 without losing predictive power. Smaller inputs mean faster training and less risk of overfitting.
1D CNNs can classify structured tabular data by treating each feature as a step in a sequence. We do not need recurrent layers for this task.
StandardScaler is a must before we feed tabular data to a CNN. Without it, large-value features would dominate the filters.

Next steps:

Apply the same 1D CNN approach to IMDB Sentiment Classification to see how convolutional filters work on text sequences.
Try Human Activity Recognition with Accelerometer Data for another 1D sequence classification problem.
Experiment with adding more convolutional blocks or a higher learning rate schedule to push accuracy above 96%.