Classification using CNN
It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase. In this project a model is built using CNN to predict if a transaction is genuine or fraudulent.
Dataset
The Credit Card Fraud Detection dataset from Kaggle contains anonymized credit card transactions labeled as fraudulent or genuine. Download it from here.
The dataset contains transactions made by credit cards in September 2013 by European cardholders. Transactions occurred over two days: 492 frauds out of 284,807 total transactions. The dataset is highly unbalanced, with the positive class (frauds) accounting for 0.172% of all transactions.
It contains only numerical input variables which are the result of a PCA transformation. The only features not transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. Feature 'Amount' is the transaction amount, which can be used for example-dependent cost-sensitive learning. Feature 'Class' is the response variable: 1 for fraud and 0 otherwise.
Tensorflow Installation
tensorflow is used to build the model. Install it with the commands below. Use the second command if your machine has a GPU.
!pip install tensorflow
!pip install tensorflow-gpu
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Flatten, Dense, Dropout, BatchNormalization
from tensorflow.keras.layers import Conv1D, MaxPool1D
from tensorflow.keras.optimizers import Adam
print(tf.__version__)
2.1.0
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
Read the dataset using read_csv() into a pandas dataframe.
data = pd.read_csv('creditcard.csv')
data.head()
| Time | V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | ... | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | Amount | Class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | -1.359807 | -0.072781 | 2.536347 | 1.378155 | -0.338321 | 0.462388 | 0.239599 | 0.098698 | 0.363787 | ... | -0.018307 | 0.277838 | -0.110474 | 0.066928 | 0.128539 | -0.189115 | 0.133558 | -0.021053 | 149.62 | 0 |
| 1 | 0.0 | 1.191857 | 0.266151 | 0.166480 | 0.448154 | 0.060018 | -0.082361 | -0.078803 | 0.085102 | -0.255425 | ... | -0.225775 | -0.638672 | 0.101288 | -0.339846 | 0.167170 | 0.125895 | -0.008983 | 0.014724 | 2.69 | 0 |
| 2 | 1.0 | -1.358354 | -1.340163 | 1.773209 | 0.379780 | -0.503198 | 1.800499 | 0.791461 | 0.247676 | -1.514654 | ... | 0.247998 | 0.771679 | 0.909412 | -0.689281 | -0.327642 | -0.139097 | -0.055353 | -0.059752 | 378.66 | 0 |
| 3 | 1.0 | -0.966272 | -0.185226 | 1.792993 | -0.863291 | -0.010309 | 1.247203 | 0.237609 | 0.377436 | -1.387024 | ... | -0.108300 | 0.005274 | -0.190321 | -1.175575 | 0.647376 | -0.221929 | 0.062723 | 0.061458 | 123.50 | 0 |
| 4 | 2.0 | -1.158233 | 0.877737 | 1.548718 | 0.403034 | -0.407193 | 0.095921 | 0.592941 | -0.270533 | 0.817739 | ... | -0.009431 | 0.798278 | -0.137458 | 0.141267 | -0.206010 | 0.502292 | 0.219422 | 0.215153 | 69.99 | 0 |
5 rows x 31 columns
The dataset has 284807 rows and 31 columns.
data.shape
(284807, 31)
Check for null values in the data.
data.isnull().sum()
Time 0
V1 0
V2 0
V3 0
V4 0
V5 0
V6 0
V7 0
V8 0
V9 0
V10 0
V11 0
V12 0
V13 0
V14 0
V15 0
V16 0
V17 0
V18 0
V19 0
V20 0
V21 0
V22 0
V23 0
V24 0
V25 0
V26 0
V27 0
V28 0
Amount 0
Class 0
dtype: int64
No null values are present, so check data.info() for column types. All values are either float or int.
data.info()
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
Time 284807 non-null float64
V1 284807 non-null float64
V2 284807 non-null float64
V3 284807 non-null float64
V4 284807 non-null float64
V5 284807 non-null float64
V6 284807 non-null float64
V7 284807 non-null float64
V8 284807 non-null float64
V9 284807 non-null float64
V10 284807 non-null float64
V11 284807 non-null float64
V12 284807 non-null float64
V13 284807 non-null float64
V14 284807 non-null float64
V15 284807 non-null float64
V16 284807 non-null float64
V17 284807 non-null float64
V18 284807 non-null float64
V19 284807 non-null float64
V20 284807 non-null float64
V21 284807 non-null float64
V22 284807 non-null float64
V23 284807 non-null float64
V24 284807 non-null float64
V25 284807 non-null float64
V26 284807 non-null float64
V27 284807 non-null float64
V28 284807 non-null float64
Amount 284807 non-null float64
Class 284807 non-null int64
dtypes: float64(30), int64(1)
memory usage: 67.4 MB
value_counts() returns a Series containing counts of unique values. This data has 2 classes, 0 and 1. The data with label 0 vastly outnumbers label 1, making it highly unbalanced.
data['Class'].value_counts()
0 284315
1 492
Name: Class, dtype: int64
Balance Dataset
non_fraud holds all genuine transactions (['Class']==0) and fraud holds all fraudulent transactions (['Class']==1). The shape attribute shows that non_fraud has 284315 rows and fraud has 492 rows.
non_fraud = data[data['Class']==0]
fraud = data[data['Class']==1]
non_fraud.shape, fraud.shape
((284315, 31), (492, 31))
To balance the data, 492 transactions are selected randomly from non_fraud.
non_fraud = non_fraud.sample(fraud.shape[0])
non_fraud.shape
(492, 31)
A new balanced dataset is created by appending non_fraud to fraud. With ignore_index=True, the resulting axis is labeled 0, 1, ..., n - 1.
data = fraud.append(non_fraud, ignore_index=True)
data.head()
| Time | V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | ... | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | Amount | Class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 406.0 | -2.312227 | 1.951992 | -1.609851 | 3.997906 | -0.522188 | -1.426545 | -2.537387 | 1.391657 | -2.770089 | ... | 0.517232 | -0.035049 | -0.465211 | 0.320198 | 0.044519 | 0.177840 | 0.261145 | -0.143276 | 0.00 | 1 |
| 1 | 472.0 | -3.043541 | -3.157307 | 1.088463 | 2.288644 | 1.359805 | -1.064823 | 0.325574 | -0.067794 | -0.270953 | ... | 0.661696 | 0.435477 | 1.375966 | -0.293803 | 0.279798 | -0.145362 | -0.252773 | 0.035764 | 529.00 | 1 |
| 2 | 4462.0 | -2.303350 | 1.759247 | -0.359745 | 2.330243 | -0.821628 | -0.075788 | 0.562320 | -0.399147 | -0.238253 | ... | -0.294166 | -0.932391 | 0.172726 | -0.087330 | -0.156114 | -0.542628 | 0.039566 | -0.153029 | 239.93 | 1 |
| 3 | 6986.0 | -4.397974 | 1.358367 | -2.592844 | 2.679787 | -1.128131 | -1.706536 | -3.496197 | -0.248778 | -0.247768 | ... | 0.573574 | 0.176968 | -0.436207 | -0.053502 | 0.252405 | -0.657488 | -0.827136 | 0.849573 | 59.00 | 1 |
| 4 | 7519.0 | 1.234235 | 3.019740 | -4.304597 | 4.732795 | 3.624201 | -1.357746 | 1.713445 | -0.496358 | -1.282858 | ... | -0.379068 | -0.704181 | -0.656805 | -1.632653 | 1.488901 | 0.566797 | -0.010016 | 0.146793 | 1.00 | 1 |
5 rows x 31 columns
data['Class'].value_counts()
1 492
0 492
Name: Class, dtype: int64
Separate the feature space and class label. X holds the features and y holds the class labels.
X = data.drop('Class', axis = 1)
y = data['Class']
Split the data into training and testing sets using train_test_split(). test_size = 0.2 reserves 20% for testing. stratify = y ensures both classes are proportionally represented in each split.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0, stratify = y)
There are 787 samples for training and 197 samples for testing.
X_train.shape, X_test.shape
((787, 30), (197, 30))
StandardScaler() standardizes the features by removing the mean and scaling to unit variance. The scaler is fit only on the training dataset, then applied to both training and testing data.
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
y_train = y_train.to_numpy()
y_test = y_test.to_numpy()
X_train.shape
(787, 30)
The data is 2-dimensional, but neural networks require 3-dimensional input. reshape() adds the extra dimension.
X_train = X_train.reshape(X_train.shape[0], X_train.shape[1], 1)
X_test = X_test.reshape(X_test.shape[0], X_test.shape[1], 1)
X_train.shape, X_test.shape
((787, 30, 1), (197, 30, 1))
Build CNN
A Sequential() model is appropriate for a plain stack of layers where each layer has exactly one input tensor and one output tensor.
Conv1D() is a 1D Convolution Layer, this layer is very effective for deriving features from a fixed-length segment of the overall dataset, where it is not so important where the feature is located in the segment. In the first Conv1D() layer we are learning a total of 32 filters with size of convolutional window as 2. The input_shape specifies the shape of the input. It is a necessary parameter for the first layer in any neural network. The ReLu activation function is used: a piecewise linear function that will output the input directly if it is positive, otherwise, it will output zero.
BatchNormalization() allows each layer of a network to learn by itself a little bit more independently of other layers. To increase the stability of a neural network, batch normalization normalizes the output of a previous activation layer by subtracting the batch mean and dividing by the batch standard deviation. It applies a transformation that maintains the mean output close to 0 and the output standard deviation close to 1.
Dropout() is used to randomly set the outgoing edges of hidden units to 0 at each update of the training phase. The value passed in dropout specifies the probability at which outputs of the layer are dropped out.
Flatten() is used to convert the data into a 1-dimensional array for inputting it to the next layer.
Dense() is the regular deeply connected neural network layer. The output layer is also a dense layer with 1 neuron because we are predicting a single value as this is a binary classification problem. Sigmoid function is used because it exists between (0 to 1) and this facilitates us to predict a binary input.
epochs = 20
model = Sequential()
model.add(Conv1D(32, 2, activation='relu', input_shape = X_train[0].shape))
model.add(BatchNormalization())
model.add(Dropout(0.2))
model.add(Conv1D(64, 2, activation='relu'))
model.add(BatchNormalization())
model.add(Dropout(0.5))
model.add(Flatten())
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv1d (Conv1D) (None, 29, 32) 96
_________________________________________________________________
batch_normalization (BatchNo (None, 29, 32) 128
_________________________________________________________________
dropout (Dropout) (None, 29, 32) 0
_________________________________________________________________
conv1d_1 (Conv1D) (None, 28, 64) 4160
_________________________________________________________________
batch_normalization_1 (Batch (None, 28, 64) 256
_________________________________________________________________
dropout_1 (Dropout) (None, 28, 64) 0
_________________________________________________________________
flatten (Flatten) (None, 1792) 0
_________________________________________________________________
dense (Dense) (None, 64) 114752
_________________________________________________________________
dropout_2 (Dropout) (None, 64) 0
_________________________________________________________________
dense_1 (Dense) (None, 1) 65
=================================================================
Total params: 119,457
Trainable params: 119,265
Non-trainable params: 192
_________________________________________________________________
The model is compiled and fit using the Adam optimizer with a 0.00001 learning rate and 20 epochs. An epoch is an iteration over the entire data provided. validation_data evaluates loss and metrics at the end of each epoch.
model.compile(optimizer=Adam(lr=0.0001), loss = 'binary_crossentropy', metrics=['accuracy'])
history = model.fit(X_train, y_train, epochs=epochs, validation_data=(X_test, y_test), verbose=1)
Train on 787 samples, validate on 197 samples
Epoch 15/20 787/787 [==============================] - 0s 397us/sample - loss: 0.2179 - accuracy: 0.9365 - val_loss: 0.2355 - val_accuracy: 0.8985
Epoch 16/20 787/787 [==============================] - 0s 359us/sample - loss: 0.2070 - accuracy: 0.9276 - val_loss: 0.2271 - val_accuracy: 0.8985
Epoch 17/20 787/787 [==============================] - 0s 379us/sample - loss: 0.2030 - accuracy: 0.9314 - val_loss: 0.2206 - val_accuracy: 0.8985
Epoch 18/20 787/787 [==============================] - 0s 329us/sample - loss: 0.2192 - accuracy: 0.9276 - val_loss: 0.2189 - val_accuracy: 0.9036
Epoch 19/20 787/787 [==============================] - 0s 368us/sample - loss: 0.1896 - accuracy: 0.9352 - val_loss: 0.2180 - val_accuracy: 0.8985
Epoch 20/20 787/787 [==============================] - 0s 399us/sample - loss: 0.2067 - accuracy: 0.9199 - val_loss: 0.2183 - val_accuracy: 0.8934
Visualize the results.
def plot_learningCurve(history, epoch):
# Plot training & validation accuracy values
epoch_range = range(1, epoch+1)
plt.plot(epoch_range, history.history['accuracy'])
plt.plot(epoch_range, history.history['val_accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Val'], loc='upper left')
plt.show()
# Plot training & validation loss values
plt.plot(epoch_range, history.history['loss'])
plt.plot(epoch_range, history.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Val'], loc='upper left')
plt.show()
plot_learningCurve(history, epochs)
Training accuracy is higher than validation accuracy, which shows the model is overfitting. Adding a MaxPool layer and increasing the number of epochs can improve accuracy.
Adding MaxPool
epochs = 50
model = Sequential()
model.add(Conv1D(32, 2, activation='relu', input_shape = X_train[0].shape))
model.add(BatchNormalization())
model.add(MaxPool1D(2))
model.add(Dropout(0.2))
model.add(Conv1D(64, 2, activation='relu'))
model.add(BatchNormalization())
model.add(MaxPool1D(2))
model.add(Dropout(0.5))
model.add(Flatten())
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer=Adam(lr=0.0001), loss = 'binary_crossentropy', metrics=['accuracy'])
history = model.fit(X_train, y_train, epochs=epochs, validation_data=(X_test, y_test), verbose=1)
Train on 787 samples, validate on 197 samples
Epoch 45/50 787/787 [==============================] - 0s 211us/sample - loss: 0.2494 - accuracy: 0.9187 - val_loss: 0.2509 - val_accuracy: 0.9137
Epoch 46/50 787/787 [==============================] - 0s 212us/sample - loss: 0.2390 - accuracy: 0.9136 - val_loss: 0.2498 - val_accuracy: 0.9137
Epoch 47/50 787/787 [==============================] - 0s 225us/sample - loss: 0.2490 - accuracy: 0.9111 - val_loss: 0.2466 - val_accuracy: 0.9137
Epoch 48/50 787/787 [==============================] - 0s 210us/sample - loss: 0.2435 - accuracy: 0.9149 - val_loss: 0.2443 - val_accuracy: 0.9137
Epoch 49/50 787/787 [==============================] - 0s 192us/sample - loss: 0.2413 - accuracy: 0.9136 - val_loss: 0.2453 - val_accuracy: 0.9137
Epoch 50/50 787/787 [==============================] - 0s 194us/sample - loss: 0.2445 - accuracy: 0.9123 - val_loss: 0.2449 - val_accuracy: 0.9137
Visualize the results again.
plot_learningCurve(history, epochs)
The results are better after re-training with these changes, showing a tighter train/validation gap.
Conclusion
In this tutorial you built two 1D CNN variants to detect credit card fraud from a highly imbalanced dataset. After under-sampling to balance the 492 fraud and 492 genuine transactions, the baseline model without MaxPool1D reached ~90% test accuracy but showed clear overfitting; adding MaxPool1D and extending to 50 epochs pushed validation accuracy to ~91.4% with a tighter train/val gap.
Key takeaways:
- Under-sampling to balance classes is a fast starting point for imbalanced data, but it discards the majority of genuine transaction data. SMOTE oversampling is a better alternative for production models.
MaxPool1Dreduces spatial resolution between convolution blocks, acting as a regularizer that helps prevent overfitting on small, balanced datasets.- The fraud detection task rewards high recall for class 1 (fraud) over raw accuracy. Always examine precision/recall alongside the overall accuracy metric.
Next steps:
- Apply the same 1D CNN approach to Bank Customer Satisfaction which has a larger dataset and similar tabular-to-CNN pipeline.
- Compare this CNN approach against the ANN baseline in Building Your First ANN with TensorFlow 2.0.
- Experiment with SMOTE oversampling instead of under-sampling to retain all 284,807 genuine transactions in training.