Breast Cancer Detection Using CNN
Breast Cancer Detection Using CNN in Python
Breast cancer
is the most commonly occurring cancer in women and the second most common cancer overall. There were over 2 million new cases in 2018, making it a significant health problem in present days.
The key challenge in breast cancer detection
is to classify tumors as malignant
or benign
. Malignant refers to cancer cells that can invade and kill nearby tissue and spread to other parts of your body. Unlike cancerous tumor(malignant), Benign does not spread to other parts of the body and is safe somehow. Deep neural network techniques can be used to improve the accuracy of early diagnosis significantly.
Deep Learning is a subfield of machine learning concerned with algorithms inspired by the structure and function of the brain called an artificial neural network.
A Convolutional Neural Network (ConvNet/CNN) is a Deep Learning algorithm which can take in an input image, assign importance (learnable weights and biases) to various aspects/objects in the image and be able to differentiate one from the other. The pre-processing required in a ConvNet is much lower as compared to other classification algorithms.
What is Dropout
Dropout
is a technique where randomly selected neurons are ignored during training. They are “dropped-out” randomly. This means that their contribution to the activation of downstream neurons is temporally removed on the forward pass and any weight updates are not applied to the neuron on the backward pass.
What is Batch Normalization
- It is a technique which is designed to automatically standardize the inputs to a layer in a deep learning neural network.
e.g
. We have four features having different unit after applying batch normalization it comes in similar unit. - By Normalizing the output of neurons the activation function will only receive inputs close to zero.
- Batch normalization ensures a non vanishing gradient.
We are going to use tensorflow
2.3 to build the model. You can install tensorflow by running this command.
!pip install tensorflow-gpu==2.3.0-rc0
Importing necessary library that will use in model building.
import tensorflow as tf from tensorflow import keras from tensorflow.keras import Sequential from tensorflow.keras.layers import Flatten, Dense, Dropout, BatchNormalization from tensorflow.keras.layers import Conv1D, MaxPool1D from tensorflow.keras.optimizers import Adam print(tf.__version__)
2.3.0
pandas
for loading and manipulating the data.
NumPy
is used for working with arrays. It also has functions for working in domain of linear algebra, fourier transform and matrices.
pyplot
from matplotlib is used to visualize the results.
Seaborn
is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt
/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead. import pandas.util.testing as tm
from sklearn import datasets, metrics from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler
Load and return the breast cancer
classification dataset. The breast cancer dataset is a classic and very easy binary classification dataset.
cancer = datasets.load_breast_cancer()
We can view any particular column with the help of cancer.DESCR
.
print(cancer.DESCR)
.. _breast_cancer_dataset: Breast cancer wisconsin (diagnostic) dataset -------------------------------------------- **Data Set Characteristics:** :Number of Instances: 569 :Number of Attributes: 30 numeric, predictive attributes and the class :Attribute Information: - radius (mean of distances from center to points on the perimeter) - texture (standard deviation of gray-scale values) - perimeter - area - smoothness (local variation in radius lengths) - compactness (perimeter^2 / area - 1.0) - concavity (severity of concave portions of the contour) - concave points (number of concave portions of the contour) - symmetry - fractal dimension ("coastline approximation" - 1) The mean, standard error, and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius. - class: - WDBC-Malignant - WDBC-Benign :Summary Statistics: ===================================== ====== ====== Min Max ===================================== ====== ====== radius (mean): 6.981 28.11 texture (mean): 9.71 39.28 perimeter (mean): 43.79 188.5 area (mean): 143.5 2501.0 smoothness (mean): 0.053 0.163 compactness (mean): 0.019 0.345 concavity (mean): 0.0 0.427 concave points (mean): 0.0 0.201 symmetry (mean): 0.106 0.304 fractal dimension (mean): 0.05 0.097 radius (standard error): 0.112 2.873 texture (standard error): 0.36 4.885 perimeter (standard error): 0.757 21.98 area (standard error): 6.802 542.2 smoothness (standard error): 0.002 0.031 compactness (standard error): 0.002 0.135 concavity (standard error): 0.0 0.396 concave points (standard error): 0.0 0.053 symmetry (standard error): 0.008 0.079 fractal dimension (standard error): 0.001 0.03 radius (worst): 7.93 36.04 texture (worst): 12.02 49.54 perimeter (worst): 50.41 251.2 area (worst): 185.2 4254.0 smoothness (worst): 0.071 0.223 compactness (worst): 0.027 1.058 concavity (worst): 0.0 1.252 concave points (worst): 0.0 0.291 symmetry (worst): 0.156 0.664 fractal dimension (worst): 0.055 0.208 ===================================== ====== ====== :Missing Attribute Values: None :Class Distribution: 212 - Malignant, 357 - Benign
We will be using pandas DataFrame to present all our data. We will create a dataframe with our cancer data and target data. It would help us to store all the inputs and outputs in one dataframe.
X = pd.DataFrame(data = cancer.data, columns=cancer.feature_names) X.head()
mean radius | mean texture | mean perimeter | mean area | mean smoothness | mean compactness | mean concavity | mean concave points | mean symmetry | mean fractal dimension | radius error | texture error | perimeter error | area error | smoothness error | compactness error | concavity error | concave points error | symmetry error | fractal dimension error | worst radius | worst texture | worst perimeter | worst area | worst smoothness | worst compactness | worst concavity | worst concave points | worst symmetry | worst fractal dimension | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | 0.2419 | 0.07871 | 1.0950 | 0.9053 | 8.589 | 153.40 | 0.006399 | 0.04904 | 0.05373 | 0.01587 | 0.03003 | 0.006193 | 25.38 | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 |
1 | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | 0.1812 | 0.05667 | 0.5435 | 0.7339 | 3.398 | 74.08 | 0.005225 | 0.01308 | 0.01860 | 0.01340 | 0.01389 | 0.003532 | 24.99 | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 |
2 | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | 0.2069 | 0.05999 | 0.7456 | 0.7869 | 4.585 | 94.03 | 0.006150 | 0.04006 | 0.03832 | 0.02058 | 0.02250 | 0.004571 | 23.57 | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 |
3 | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | 0.2597 | 0.09744 | 0.4956 | 1.1560 | 3.445 | 27.23 | 0.009110 | 0.07458 | 0.05661 | 0.01867 | 0.05963 | 0.009208 | 14.91 | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 |
4 | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | 0.1809 | 0.05883 | 0.7572 | 0.7813 | 5.438 | 94.44 | 0.011490 | 0.02461 | 0.05688 | 0.01885 | 0.01756 | 0.005115 | 22.54 | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 |
y = cancer.target
y
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0...])
cancer.target_names
array(['malignant', 'benign'], dtype='<U9')
X.shape
(569, 30)
It is not possible for us to manually split our dataset also we need to split the dataset in a random manner. To help us with this task, we will be using a SciKit library named train_test_split
. We will be using 80% of our dataset for training purposes and 20% for testing.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0, stratify = y)
X_train.shape
(455, 30)
X_test.shape
(114, 30)
StandardScaler
removes the mean and scales the data to unit variance.
scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test)
X_train = X_train.reshape(455,30,1) X_test = X_test.reshape(114, 30, 1)
Now, Let's go ahead and build our CNN model
A Sequential()
function is the easiest way to build a model in Keras. It allows you to build a model layer by layer. Each layer has weights that correspond to the layer the follows it. We use the add()
function to add layers to our model.
Conv1D()
is a 1D Convolution Layer, this layer is very effective for deriving features from a fixed-length segment of the overall dataset, where it is not so important where the feature is located in the segment. In the first Conv1D()
layer, we are learning a total of 36 filters
with size of the convolutional window as 3. The input_shape
specifies the shape of the input. It is a necessary parameter for the first layer in any neural network. We will be using the ReLu
activation function. The rectified linear activation function or ReLU
for short is a piecewise linear function that will output the input directly if it is positive, otherwise, it will output zero.
The Rectified Linear Unit
(ReLu) is the most commonly used activation function in deep learning models. The function returns 0 if it receives any negative input, but for any positive value x it returns that value back. So it can be written as f(x)=max(0,x)
To stop problem of shrinkage of data we use concept called Padding
.
It has two types:
- valid
- same
Flattening
is converting the data into a 1-dimensional array for inputting it to the next layer. We flatten
the output of the convolutional layers to create a single long feature vector.
The Sigmoid function
takes a value as input and outputs another value between 0 and 1. It is non-linear and easy to work with when constructing a neural network model. The good part about this function is that continuously differentiable over different values of z and has a fixed output range.
epochs = 50 model = Sequential() model.add(Conv1D(filters=32, kernel_size=2, activation='relu', input_shape = (30,1))) model.add(BatchNormalization()) model.add(Dropout(0.2)) model.add(Conv1D(filters=64, kernel_size=2, activation='relu')) model.add(BatchNormalization()) model.add(Dropout(0.5)) model.add(Flatten()) model.add(Dense(64, activation='relu')) model.add(Dropout(0.5)) model.add(Dense(1, activation='sigmoid'))
model.summary()
Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= conv1d (Conv1D) (None, 29, 32) 96 _________________________________________________________________ batch_normalization (BatchNo (None, 29, 32) 128 _________________________________________________________________ dropout (Dropout) (None, 29, 32) 0 _________________________________________________________________ conv1d_1 (Conv1D) (None, 28, 64) 4160 _________________________________________________________________ batch_normalization_1 (Batch (None, 28, 64) 256 _________________________________________________________________ dropout_1 (Dropout) (None, 28, 64) 0 _________________________________________________________________ flatten (Flatten) (None, 1792) 0 _________________________________________________________________ dense (Dense) (None, 64) 114752 _________________________________________________________________ dropout_2 (Dropout) (None, 64) 0 _________________________________________________________________ dense_1 (Dense) (None, 1) 65 ================================================================= Total params: 119,457 Trainable params: 119,265 Non-trainable params: 192 _________________________________________________________________
Compile
defines the loss function, the optimizer, and the metrics. That's all. It has nothing to do with the weights and you can compile a model as many times as you want without causing any problem to pretrained weights.
model.compile(optimizer=Adam(lr=0.00005), loss = 'binary_crossentropy', metrics=['accuracy'])
Trains the model for a fixed number of epochs (iterations on a dataset).
history = model.fit(X_train, y_train, epochs=epochs, validation_data=(X_test, y_test), verbose=1)
... Epoch 46/50 15/15 [==============================] - 0s 6ms/step - loss: 0.1054 - accuracy: 0.9560 - val_loss: 0.1064 - val_accuracy: 0.9649 Epoch 47/50 15/15 [==============================] - 0s 6ms/step - loss: 0.1373 - accuracy: 0.9473 - val_loss: 0.1074 - val_accuracy: 0.9649 Epoch 48/50 15/15 [==============================] - 0s 7ms/step - loss: 0.1078 - accuracy: 0.9538 - val_loss: 0.1068 - val_accuracy: 0.9649 Epoch 49/50 15/15 [==============================] - 0s 6ms/step - loss: 0.0896 - accuracy: 0.9648 - val_loss: 0.1060 - val_accuracy: 0.9649 Epoch 50/50 15/15 [==============================] - 0s 6ms/step - loss: 0.0927 - accuracy: 0.9648 - val_loss: 0.1047 - val_accuracy: 0.9649
def plot_learningCurve(history, epoch): # Plot training & validation accuracy values epoch_range = range(1, epoch+1) plt.plot(epoch_range, history.history['accuracy']) plt.plot(epoch_range, history.history['val_accuracy']) plt.title('Model accuracy') plt.ylabel('Accuracy') plt.xlabel('Epoch') plt.legend(['Train', 'Val'], loc='upper left') plt.show() # Plot training & validation loss values plt.plot(epoch_range, history.history['loss']) plt.plot(epoch_range, history.history['val_loss']) plt.title('Model loss') plt.ylabel('Loss') plt.xlabel('Epoch') plt.legend(['Train', 'Val'], loc='upper left') plt.show()
A history
object that contains all information collected during training.
history.history
{'accuracy': [0.6197802424430847, 0.7494505643844604, 0.795604407787323, 0.8461538553237915, 0.8395604491233826, 0.8593406677246094, 0.8901098966598511, 0.8791208863258362, 0.8813186883926392, 0.9098901152610779, 0.903296709060669, 0.9230769276618958, ...]}
plot_learningCurve(history, epochs)
In Model accuracy
graph A validation accuracy is always greater than train accuracy thats means our model is not overfitting.
In Model accuracy
graph A validation loss is also very lower than training loss so unless and until validation loss goes above than the training loss than we can keep training our model.
We have successfully created our program to detect breast cancer using Deep neural network
. We are able to classify cancer effectively with our CNN technique.
1 Comment