SVM with Python: Support Vector Machines

Support Vector Machines are one of the most mathematically elegant algorithms in machine learning — they find the single decision boundary that is as far as possible from every class. That "maximum margin" property makes SVMs unusually robust to new data and highly effective in high-dimensional spaces such as text classification, image recognition, and bioinformatics.

In this tutorial you will build SVM classifiers on the Wisconsin Breast Cancer dataset, a 569-sample, 30-feature dataset where the goal is to classify tumours as malignant or benign. You will train three kernel variants — linear, polynomial, and sigmoid — compare their accuracy, and visualise each result with a confusion matrix heatmap.

Prerequisites: Python 3.x, scikit-learn, NumPy, Pandas, Matplotlib, Seaborn.

What is a Support Vector Machine?

A Support Vector Machine (SVM) is a supervised binary classification algorithm. Given labelled training points in an $N$ -dimensional space, SVM finds an $(N - 1)$ -dimensional hyperplane — a flat decision boundary — that best separates the two classes. For two-dimensional data this hyperplane is a straight line; for three dimensions it is a plane; in higher dimensions it generalises to a hyperplane.

The key insight is that many valid separating hyperplanes exist for any linearly separable dataset. SVM's answer to this ambiguity is to choose the one hyperplane that maximises the margin — the gap between the boundary and the closest training points on either side.

Support Vectors, Hyperplane, and Margin

Three concepts are central to understanding how SVM operates:

Support Vectors are the training points that lie closest to the decision boundary. Only these points determine where the boundary is placed; all other training examples have no influence on it.
Hyperplane is the decision surface that separates samples of different classes. SVM positions it to be equidistant from the nearest points of each class.
Margin is the perpendicular distance between the hyperplane and the nearest support vectors on each side. A larger margin generally means better generalisation — SVM maximises this quantity during training.

How SVM Finds the Best Boundary

SVM evaluates candidate hyperplanes and selects the one with the greatest distance to the nearest data points on both sides. A boundary that is far from all training examples is less likely to misclassify slightly shifted test points, which is why maximum-margin classifiers generalise well. Among all hyperplanes that separate the classes without error, SVM picks the unique one that maximises this safety margin.

Linear and Non-Linear Separation

Not all datasets can be separated by a straight line or flat plane.

SVM handles two situations:

Linear separation — the classes can be divided by a flat hyperplane directly in the original feature space. A linear kernel is used here and is computationally the cheapest option.
Non-linear separation — the classes overlap or curve in the original space and no flat boundary can separate them cleanly.

The Kernel Trick

When data is not linearly separable, SVM uses a kernel function to implicitly map the input into a higher-dimensional feature space where a linear boundary does exist. The "trick" is that this mapping is never computed explicitly — the kernel evaluates the inner product in the high-dimensional space directly, keeping computation tractable even in infinite dimensions.

Three kernels are commonly used in practice:

Linear kernel — equivalent to a standard dot product between two feature vectors. It is the fastest kernel and works well when the number of features is large relative to the number of samples, such as text classification tasks.
Polynomial kernel — a generalisation of the linear kernel that captures curved decision boundaries. The degree parameter controls how complex the boundary can be; a higher degree increases flexibility but also risk of overfitting.
Radial Basis Function (RBF) kernel — maps inputs into an infinite-dimensional space by measuring the exponential distance between samples. RBF is the default scikit-learn kernel and performs well across a wide range of problems.
Sigmoid kernel — originates from neural network theory. It behaves similarly to a two-layer perceptron activation and was historically popular but is less commonly recommended today.

Building the SVM Model

The following sections load the dataset, scale the features, split into train/test sets, and train three SVM variants.

Imports

Group all library imports at the top so every dependency is visible at a glance:

PYTHON

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, metrics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

Loading the Breast Cancer Dataset

Scikit-learn ships with the Wisconsin Breast Cancer dataset ready to use. Load it and inspect its keys:

PYTHON

cancer = datasets.load_breast_cancer()
cancer.keys()
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])
print(cancer.DESCR)

The printed description gives a full summary of the dataset's 569 samples, 30 features, and class distribution:

PYTHON

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is Mean Radius, field
        10 is Radius SE, field 20 is Worst Radius.

        - class:
                - WDBC-Malignant
                - WDBC-Benign

    :Summary Statistics:

    ===================================== ====== ======
                                           Min    Max
    ===================================== ====== ======
    radius (mean):                        6.981  28.11
    texture (mean):                       9.71   39.28
    perimeter (mean):                     43.79  188.5
    area (mean):                          143.5  2501.0
    smoothness (mean):                    0.053  0.163
    compactness (mean):                   0.019  0.345
    concavity (mean):                     0.0    0.427
    concave points (mean):                0.0    0.201
    symmetry (mean):                      0.106  0.304
    fractal dimension (mean):             0.05   0.097
    radius (standard error):              0.112  2.873
    texture (standard error):             0.36   4.885
    perimeter (standard error):           0.757  21.98
    area (standard error):                6.802  542.2
    smoothness (standard error):          0.002  0.031
    compactness (standard error):         0.002  0.135
    concavity (standard error):           0.0    0.396
    concave points (standard error):      0.0    0.053
    symmetry (standard error):            0.008  0.079
    fractal dimension (standard error):   0.001  0.03
    radius (worst):                       7.93   36.04
    texture (worst):                      12.02  49.54
    perimeter (worst):                    50.41  251.2
    area (worst):                         185.2  4254.0
    smoothness (worst):                   0.071  0.223
    compactness (worst):                  0.027  1.058
    concavity (worst):                    0.0    1.252
    concave points (worst):               0.0    0.291
    symmetry (worst):                     0.156  0.664
    fractal dimension (worst):            0.055  0.208
    ===================================== ====== ======

    :Missing Attribute Values: None

    :Class Distribution: 212 - Malignant, 357 - Benign

Confirm the two target class names and inspect the first five feature names:

PYTHON

cancer.target_names
array(['malignant', 'benign'], dtype='<U9')
cancer.feature_names[: 5]

OUTPUT

array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness'], dtype='<U23')

Check the total number of features:

PYTHON

cancer.feature_names.shape

OUTPUT

(30,)

Extract the feature matrix X and label vector y, then verify their shapes:

PYTHON

X = cancer.data
y = cancer.target
X.shape, y.shape

OUTPUT

((569, 30), (569,))

Print the first two rows of X to see the raw numeric values:

PYTHON

X[: 2]

OUTPUT

array([[1.799e+01, 1.038e+01, 1.228e+02, 1.001e+03, 1.184e-01, 2.776e-01,
        3.001e-01, 1.471e-01, 2.419e-01, 7.871e-02, 1.095e+00, 9.053e-01,
        8.589e+00, 1.534e+02, 6.399e-03, 4.904e-02, 5.373e-02, 1.587e-02,
        3.003e-02, 6.193e-03, 2.538e+01, 1.733e+01, 1.846e+02, 2.019e+03,
        1.622e-01, 6.656e-01, 7.119e-01, 2.654e-01, 4.601e-01, 1.189e-01],
       [2.057e+01, 1.777e+01, 1.329e+02, 1.326e+03, 8.474e-02, 7.864e-02,
        8.690e-02, 7.017e-02, 1.812e-01, 5.667e-02, 5.435e-01, 7.339e-01,
        3.398e+00, 7.408e+01, 5.225e-03, 1.308e-02, 1.860e-02, 1.340e-02,
        1.389e-02, 3.532e-03, 2.499e+01, 2.341e+01, 1.588e+02, 1.956e+03,
        1.238e-01, 1.866e-01, 2.416e-01, 1.860e-01, 2.750e-01, 8.902e-02]])

Check the first ten labels — 0 is malignant, 1 is benign:

PYTHON

y[: 10]

OUTPUT

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Feature Standardisation

SVM is sensitive to the scale of input features because it measures distances between data points. StandardScaler — which transforms each feature to have zero mean and unit variance — is therefore essential before training. Without scaling, features with larger numeric ranges (such as area, which can exceed 2000) would dominate the margin calculation over features with values near zero.

Apply StandardScaler to the full feature matrix:

PYTHON

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled[2:2]

OUTPUT

array([], shape=(0, 30), dtype=float64)

The empty slice confirms the scaler ran successfully. The scaled matrix has the same shape as X but every feature now has mean 0 and standard deviation 1.

Splitting the Data and Training Kernels

Split the scaled data into 80 % training and 20 % test sets, using stratification to preserve the class ratio in both splits:

PYTHON

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.2, random_state = 1, stratify = y)

With the data prepared, you can train each kernel variant in turn.

Linear Kernel SVM

A linear kernel is the best starting point when you have many features or expect the classes to be roughly separable without a curved boundary. It is also the fastest kernel to train. Import svm from scikit-learn and fit the classifier:

PYTHON

from sklearn import svm

PYTHON

clf = svm.SVC(kernel='linear')
clf.fit(X_train, y_train)
y_predict = clf.predict(X_test)

print('Accuracy: ', metrics.accuracy_score(y_test, y_predict))

print('Precision: ', metrics.precision_score(y_test, y_predict))
print('Recall: ', metrics.recall_score(y_test, y_predict))

print('Confusion Matrix')

mat = metrics.confusion_matrix(y_test, y_predict)
sns.heatmap(mat, square = True, annot = True, fmt = 'd', cbar = False,
           xticklabels=cancer.target_names,
           yticklabels=cancer.target_names)

plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

OUTPUT

Accuracy:  0.9649122807017544
Precision:  0.9594594594594594
Recall:  0.9861111111111112
Confusion Matrix

The linear kernel achieves 96.5 % accuracy. The recall of 0.986 means the model correctly identifies nearly all benign tumours — important in a medical context where false negatives carry high risk.

Counting Class Distribution in the Test Set

You can use np.unique() with return_counts=True to count how many samples of each class appear in the test set. This is useful for sanity-checking whether stratification worked correctly:

PYTHON

element, count = np.unique(y_test, return_counts=True)
element, count

OUTPUT

(array([0, 1]), array([42, 72], dtype=int64))

The test set contains 42 malignant and 72 benign samples, matching the original 212:357 class ratio. Now re-run the linear kernel on the raw (unscaled) data to observe the effect of skipping standardisation:

PYTHON

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1, stratify = y)

clf = svm.SVC(kernel='linear')
clf.fit(X_train, y_train)
y_predict = clf.predict(X_test)

print('Accuracy: ', metrics.accuracy_score(y_test, y_predict))

OUTPUT

Accuracy:  0.9649122807017544

The accuracy happens to be the same on this dataset, but you should always scale before training SVMs to avoid scale-dominated margins in general.

Polynomial Kernel SVM

A polynomial kernel introduces a degree parameter that controls the complexity of the decision boundary. It is a non-stationary kernel, meaning it behaves differently across the input space, and works best when all features are normalised to a similar range. Here you train with degree=5 and a high gamma value:

PYTHON

clf = svm.SVC(kernel='poly', degree = 5, gamma = 100)
clf.fit(X_train, y_train)
y_predict = clf.predict(X_test)

print('Accuracy: ', metrics.accuracy_score(y_test, y_predict))

print('Precision: ', metrics.precision_score(y_test, y_predict))
print('Recall: ', metrics.recall_score(y_test, y_predict))

print('Confusion Matrix')

mat = metrics.confusion_matrix(y_test, y_predict)
sns.heatmap(mat, square = True, annot = True, fmt = 'd', cbar = False,
           xticklabels=cancer.target_names,
           yticklabels=cancer.target_names)

plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

OUTPUT

Accuracy:  0.631578947368421
Precision:  0.631578947368421
Recall:  1.0

The polynomial kernel with these hyper-parameters achieves only 63.2 % accuracy. A recall of 1.0 with precision of 0.63 means the model is predicting benign for almost every sample — it is classifying correctly by chance rather than learning a meaningful boundary. Tuning degree and gamma with cross-validation would be necessary to make the polynomial kernel competitive on this dataset.

Sigmoid Kernel SVM

The sigmoid kernel is inspired by the activation function used in neural networks. It was historically popular for SVMs but has largely been superseded by RBF in practice. To use it, set kernel='sigmoid' in SVC:

PYTHON

clf = svm.SVC(kernel='sigmoid', gamma = 200, C = 10000)
clf.fit(X_train, y_train)
y_predict = clf.predict(X_test)

print('Accuracy: ', metrics.accuracy_score(y_test, y_predict))

print('Precision: ', metrics.precision_score(y_test, y_predict))
print('Recall: ', metrics.recall_score(y_test, y_predict))

print('Confusion Matrix')

mat = metrics.confusion_matrix(y_test, y_predict)
sns.heatmap(mat, square = True, annot = True, fmt = 'd', cbar = False,
           xticklabels=cancer.target_names,
           yticklabels=cancer.target_names)

plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

OUTPUT

Accuracy:  0.631578947368421
Precision:  0.631578947368421
Recall:  1.0

The sigmoid kernel produces the same accuracy and recall pattern as the polynomial kernel under these settings — both are predicting the majority class and failing to separate malignant from benign. The linear kernel remains the strongest performer on this dataset without additional tuning.

Conclusion

In this tutorial you trained Support Vector Machine classifiers on the Wisconsin Breast Cancer dataset using three kernel functions. The linear kernel delivered 96.5 % accuracy with high recall (0.986), making it the most effective choice here. Both the polynomial and sigmoid kernels achieved only 63.2 % accuracy under the chosen hyper-parameters, demonstrating that kernel choice and tuning matter significantly — the same algorithm can produce wildly different results depending on its configuration.

Key takeaways:

SVM finds the decision boundary that maximises the margin between classes, making it more robust than algorithms that simply minimise training error.
Feature standardisation is essential before training SVMs because the algorithm measures distances between points — unscaled features distort those distances.
The linear kernel is the right starting point for high-dimensional datasets; switch to RBF when you suspect non-linear class boundaries.
Polynomial and sigmoid kernels require careful tuning of degree, gamma, and C to be competitive; cross-validation is necessary to find good values.
A high recall with low precision (as seen in the polynomial and sigmoid runs) is a warning sign that the model is predicting the majority class rather than learning the true boundary.

Next steps:

Explore K-Nearest Neighbors to see how a distance-based classifier compares to SVM's margin-based approach on the same data.
Read Logistic Regression with Python to understand a probabilistic binary classifier that is faster to train and easier to interpret than SVM.
Try the RBF kernel (kernel='rbf') with a grid search over C and gamma using GridSearchCV to see how much accuracy improves over the polynomial results above.

SVM with Python: Support Vector Machines

Topics You Will Master

What is a Support Vector Machine?

Support Vectors, Hyperplane, and Margin

How SVM Finds the Best Boundary

Linear and Non-Linear Separation

The Kernel Trick

Building the SVM Model

Imports

Loading the Breast Cancer Dataset

Feature Standardisation

Splitting the Data and Training Kernels

Linear Kernel SVM

Counting Class Distribution in the Test Set

Polynomial Kernel SVM

Sigmoid Kernel SVM

Conclusion

Latest recommendations you might like

LinkedIn Auto Connect Bot

Dimensionality Reduction with LDA and PCA in Python

Find this tutorial useful?

Discussion & Comments