Support Vector Machines are one of the most mathematically elegant algorithms in machine learning — they find the single decision boundary that is as far as possible from every class. That "maximum margin" property makes SVMs unusually robust to new data and highly effective in high-dimensional spaces such as text classification, image recognition, and bioinformatics.
In this tutorial you will build SVM classifiers on the Wisconsin Breast Cancer dataset, a 569-sample, 30-feature dataset where the goal is to classify tumours as malignant or benign. You will train three kernel variants — linear, polynomial, and sigmoid — compare their accuracy, and visualise each result with a confusion matrix heatmap.
Prerequisites: Python 3.x, scikit-learn, NumPy, Pandas, Matplotlib, Seaborn.
What is a Support Vector Machine?
A Support Vector Machine (SVM) is a supervised binary classification algorithm. Given labelled training points in an -dimensional space, SVM finds an -dimensional hyperplane — a flat decision boundary — that best separates the two classes. For two-dimensional data this hyperplane is a straight line; for three dimensions it is a plane; in higher dimensions it generalises to a hyperplane.
The key insight is that many valid separating hyperplanes exist for any linearly separable dataset. SVM's answer to this ambiguity is to choose the one hyperplane that maximises the margin — the gap between the boundary and the closest training points on either side.
Support Vectors, Hyperplane, and Margin
Three concepts are central to understanding how SVM operates:
- Support Vectors are the training points that lie closest to the decision boundary. Only these points determine where the boundary is placed; all other training examples have no influence on it.
- Hyperplane is the decision surface that separates samples of different classes. SVM positions it to be equidistant from the nearest points of each class.
- Margin is the perpendicular distance between the hyperplane and the nearest support vectors on each side. A larger margin generally means better generalisation — SVM maximises this quantity during training.
How SVM Finds the Best Boundary
SVM evaluates candidate hyperplanes and selects the one with the greatest distance to the nearest data points on both sides. A boundary that is far from all training examples is less likely to misclassify slightly shifted test points, which is why maximum-margin classifiers generalise well. Among all hyperplanes that separate the classes without error, SVM picks the unique one that maximises this safety margin.
Linear and Non-Linear Separation
Not all datasets can be separated by a straight line or flat plane.
SVM handles two situations:
- Linear separation — the classes can be divided by a flat hyperplane directly in the original feature space. A linear kernel is used here and is computationally the cheapest option.
- Non-linear separation — the classes overlap or curve in the original space and no flat boundary can separate them cleanly.
The Kernel Trick
When data is not linearly separable, SVM uses a kernel function to implicitly map the input into a higher-dimensional feature space where a linear boundary does exist. The "trick" is that this mapping is never computed explicitly — the kernel evaluates the inner product in the high-dimensional space directly, keeping computation tractable even in infinite dimensions.
Three kernels are commonly used in practice:
- Linear kernel — equivalent to a standard dot product between two feature vectors. It is the fastest kernel and works well when the number of features is large relative to the number of samples, such as text classification tasks.
- Polynomial kernel — a generalisation of the linear kernel that captures curved decision boundaries. The
degreeparameter controls how complex the boundary can be; a higher degree increases flexibility but also risk of overfitting. - Radial Basis Function (RBF) kernel — maps inputs into an infinite-dimensional space by measuring the exponential distance between samples. RBF is the default scikit-learn kernel and performs well across a wide range of problems.
- Sigmoid kernel — originates from neural network theory. It behaves similarly to a two-layer perceptron activation and was historically popular but is less commonly recommended today.
Building the SVM Model
The following sections load the dataset, scale the features, split into train/test sets, and train three SVM variants.
Imports
Group all library imports at the top so every dependency is visible at a glance:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, metrics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
Loading the Breast Cancer Dataset
Scikit-learn ships with the Wisconsin Breast Cancer dataset ready to use. Load it and inspect its keys:
cancer = datasets.load_breast_cancer()
cancer.keys()
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])
print(cancer.DESCR)
The printed description gives a full summary of the dataset's 569 samples, 30 features, and class distribution:
.. _breast_cancer_dataset:
Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------
**Data Set Characteristics:**
:Number of Instances: 569
:Number of Attributes: 30 numeric, predictive attributes and the class
:Attribute Information:
- radius (mean of distances from center to points on the perimeter)
- texture (standard deviation of gray-scale values)
- perimeter
- area
- smoothness (local variation in radius lengths)
- compactness (perimeter^2 / area - 1.0)
- concavity (severity of concave portions of the contour)
- concave points (number of concave portions of the contour)
- symmetry
- fractal dimension ("coastline approximation" - 1)
The mean, standard error, and "worst" or largest (mean of the three
worst/largest values) of these features were computed for each image,
resulting in 30 features. For instance, field 0 is Mean Radius, field
10 is Radius SE, field 20 is Worst Radius.
- class:
- WDBC-Malignant
- WDBC-Benign
:Summary Statistics:
===================================== ====== ======
Min Max
===================================== ====== ======
radius (mean): 6.981 28.11
texture (mean): 9.71 39.28
perimeter (mean): 43.79 188.5
area (mean): 143.5 2501.0
smoothness (mean): 0.053 0.163
compactness (mean): 0.019 0.345
concavity (mean): 0.0 0.427
concave points (mean): 0.0 0.201
symmetry (mean): 0.106 0.304
fractal dimension (mean): 0.05 0.097
radius (standard error): 0.112 2.873
texture (standard error): 0.36 4.885
perimeter (standard error): 0.757 21.98
area (standard error): 6.802 542.2
smoothness (standard error): 0.002 0.031
compactness (standard error): 0.002 0.135
concavity (standard error): 0.0 0.396
concave points (standard error): 0.0 0.053
symmetry (standard error): 0.008 0.079
fractal dimension (standard error): 0.001 0.03
radius (worst): 7.93 36.04
texture (worst): 12.02 49.54
perimeter (worst): 50.41 251.2
area (worst): 185.2 4254.0
smoothness (worst): 0.071 0.223
compactness (worst): 0.027 1.058
concavity (worst): 0.0 1.252
concave points (worst): 0.0 0.291
symmetry (worst): 0.156 0.664
fractal dimension (worst): 0.055 0.208
===================================== ====== ======
:Missing Attribute Values: None
:Class Distribution: 212 - Malignant, 357 - Benign
Confirm the two target class names and inspect the first five feature names:
cancer.target_names
array(['malignant', 'benign'], dtype='<U9')
cancer.feature_names[: 5]
array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
'mean smoothness'], dtype='<U23')
Check the total number of features:
cancer.feature_names.shape
(30,)
Extract the feature matrix X and label vector y, then verify their shapes:
X = cancer.data
y = cancer.target
X.shape, y.shape
((569, 30), (569,))
Print the first two rows of X to see the raw numeric values:
X[: 2]
array([[1.799e+01, 1.038e+01, 1.228e+02, 1.001e+03, 1.184e-01, 2.776e-01,
3.001e-01, 1.471e-01, 2.419e-01, 7.871e-02, 1.095e+00, 9.053e-01,
8.589e+00, 1.534e+02, 6.399e-03, 4.904e-02, 5.373e-02, 1.587e-02,
3.003e-02, 6.193e-03, 2.538e+01, 1.733e+01, 1.846e+02, 2.019e+03,
1.622e-01, 6.656e-01, 7.119e-01, 2.654e-01, 4.601e-01, 1.189e-01],
[2.057e+01, 1.777e+01, 1.329e+02, 1.326e+03, 8.474e-02, 7.864e-02,
8.690e-02, 7.017e-02, 1.812e-01, 5.667e-02, 5.435e-01, 7.339e-01,
3.398e+00, 7.408e+01, 5.225e-03, 1.308e-02, 1.860e-02, 1.340e-02,
1.389e-02, 3.532e-03, 2.499e+01, 2.341e+01, 1.588e+02, 1.956e+03,
1.238e-01, 1.866e-01, 2.416e-01, 1.860e-01, 2.750e-01, 8.902e-02]])
Check the first ten labels — 0 is malignant, 1 is benign:
y[: 10]
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
Feature Standardisation
SVM is sensitive to the scale of input features because it measures distances between data points. StandardScaler — which transforms each feature to have zero mean and unit variance — is therefore essential before training. Without scaling, features with larger numeric ranges (such as area, which can exceed 2000) would dominate the margin calculation over features with values near zero.
Apply StandardScaler to the full feature matrix:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled[2:2]
array([], shape=(0, 30), dtype=float64)
The empty slice confirms the scaler ran successfully. The scaled matrix has the same shape as X but every feature now has mean 0 and standard deviation 1.
Splitting the Data and Training Kernels
Split the scaled data into 80 % training and 20 % test sets, using stratification to preserve the class ratio in both splits:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.2, random_state = 1, stratify = y)
With the data prepared, you can train each kernel variant in turn.
Linear Kernel SVM
A linear kernel is the best starting point when you have many features or expect the classes to be roughly separable without a curved boundary. It is also the fastest kernel to train. Import svm from scikit-learn and fit the classifier:
from sklearn import svm
clf = svm.SVC(kernel='linear')
clf.fit(X_train, y_train)
y_predict = clf.predict(X_test)
print('Accuracy: ', metrics.accuracy_score(y_test, y_predict))
print('Precision: ', metrics.precision_score(y_test, y_predict))
print('Recall: ', metrics.recall_score(y_test, y_predict))
print('Confusion Matrix')
mat = metrics.confusion_matrix(y_test, y_predict)
sns.heatmap(mat, square = True, annot = True, fmt = 'd', cbar = False,
xticklabels=cancer.target_names,
yticklabels=cancer.target_names)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()
Accuracy: 0.9649122807017544
Precision: 0.9594594594594594
Recall: 0.9861111111111112
Confusion Matrix
The linear kernel achieves 96.5 % accuracy. The recall of 0.986 means the model correctly identifies nearly all benign tumours — important in a medical context where false negatives carry high risk.
Counting Class Distribution in the Test Set
You can use np.unique() with return_counts=True to count how many samples of each class appear in the test set. This is useful for sanity-checking whether stratification worked correctly:
element, count = np.unique(y_test, return_counts=True)
element, count
(array([0, 1]), array([42, 72], dtype=int64))
The test set contains 42 malignant and 72 benign samples, matching the original 212:357 class ratio. Now re-run the linear kernel on the raw (unscaled) data to observe the effect of skipping standardisation:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1, stratify = y)
clf = svm.SVC(kernel='linear')
clf.fit(X_train, y_train)
y_predict = clf.predict(X_test)
print('Accuracy: ', metrics.accuracy_score(y_test, y_predict))
Accuracy: 0.9649122807017544
The accuracy happens to be the same on this dataset, but you should always scale before training SVMs to avoid scale-dominated margins in general.
Polynomial Kernel SVM
A polynomial kernel introduces a degree parameter that controls the complexity of the decision boundary. It is a non-stationary kernel, meaning it behaves differently across the input space, and works best when all features are normalised to a similar range. Here you train with degree=5 and a high gamma value:
clf = svm.SVC(kernel='poly', degree = 5, gamma = 100)
clf.fit(X_train, y_train)
y_predict = clf.predict(X_test)
print('Accuracy: ', metrics.accuracy_score(y_test, y_predict))
print('Precision: ', metrics.precision_score(y_test, y_predict))
print('Recall: ', metrics.recall_score(y_test, y_predict))
print('Confusion Matrix')
mat = metrics.confusion_matrix(y_test, y_predict)
sns.heatmap(mat, square = True, annot = True, fmt = 'd', cbar = False,
xticklabels=cancer.target_names,
yticklabels=cancer.target_names)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()
Accuracy: 0.631578947368421
Precision: 0.631578947368421
Recall: 1.0
The polynomial kernel with these hyper-parameters achieves only 63.2 % accuracy. A recall of 1.0 with precision of 0.63 means the model is predicting benign for almost every sample — it is classifying correctly by chance rather than learning a meaningful boundary. Tuning degree and gamma with cross-validation would be necessary to make the polynomial kernel competitive on this dataset.
Sigmoid Kernel SVM
The sigmoid kernel is inspired by the activation function used in neural networks. It was historically popular for SVMs but has largely been superseded by RBF in practice. To use it, set kernel='sigmoid' in SVC:
clf = svm.SVC(kernel='sigmoid', gamma = 200, C = 10000)
clf.fit(X_train, y_train)
y_predict = clf.predict(X_test)
print('Accuracy: ', metrics.accuracy_score(y_test, y_predict))
print('Precision: ', metrics.precision_score(y_test, y_predict))
print('Recall: ', metrics.recall_score(y_test, y_predict))
print('Confusion Matrix')
mat = metrics.confusion_matrix(y_test, y_predict)
sns.heatmap(mat, square = True, annot = True, fmt = 'd', cbar = False,
xticklabels=cancer.target_names,
yticklabels=cancer.target_names)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()
Accuracy: 0.631578947368421
Precision: 0.631578947368421
Recall: 1.0
The sigmoid kernel produces the same accuracy and recall pattern as the polynomial kernel under these settings — both are predicting the majority class and failing to separate malignant from benign. The linear kernel remains the strongest performer on this dataset without additional tuning.
Conclusion
In this tutorial you trained Support Vector Machine classifiers on the Wisconsin Breast Cancer dataset using three kernel functions. The linear kernel delivered 96.5 % accuracy with high recall (0.986), making it the most effective choice here. Both the polynomial and sigmoid kernels achieved only 63.2 % accuracy under the chosen hyper-parameters, demonstrating that kernel choice and tuning matter significantly — the same algorithm can produce wildly different results depending on its configuration.
Key takeaways:
- SVM finds the decision boundary that maximises the margin between classes, making it more robust than algorithms that simply minimise training error.
- Feature standardisation is essential before training SVMs because the algorithm measures distances between points — unscaled features distort those distances.
- The linear kernel is the right starting point for high-dimensional datasets; switch to RBF when you suspect non-linear class boundaries.
- Polynomial and sigmoid kernels require careful tuning of
degree,gamma, andCto be competitive; cross-validation is necessary to find good values. - A high recall with low precision (as seen in the polynomial and sigmoid runs) is a warning sign that the model is predicting the majority class rather than learning the true boundary.
Next steps:
- Explore K-Nearest Neighbors to see how a distance-based classifier compares to SVM's margin-based approach on the same data.
- Read Logistic Regression with Python to understand a probabilistic binary classifier that is faster to train and easier to interpret than SVM.
- Try the RBF kernel (
kernel='rbf') with a grid search overCandgammausingGridSearchCVto see how much accuracy improves over the polynomial results above.
