K-Means Clustering in Python

K-Means clustering is one of the most widely used unsupervised machine learning algorithms. Unlike supervised methods, it finds structure in data without any labels — making it ideal for customer segmentation, anomaly detection, and exploratory analysis.

The algorithm groups data into $k$ clusters by iteratively assigning each point to its nearest centroid and then recomputing each centroid as the mean of its assigned points. The word "means" in K-Means refers to this averaging step.

In this tutorial you will apply K-Means to a synthetic two-dimensional dataset and then to the Iris dataset. You will choose the right value of $k$ using the elbow method and visualize the resulting cluster assignments.

Prerequisites: Python 3.x, scikit-learn, Pandas, NumPy, Matplotlib, Seaborn.

Types of Clustering

Machine learning is broadly divided into supervised learning (labelled data), unsupervised learning (no labels), and semi-supervised learning (a mix of both). Clustering falls entirely in the unsupervised category.

Within clustering there are two assignment styles:

Hard clustering — each data point belongs to exactly one cluster.
Soft clustering — each data point has a probability of belonging to every cluster.

Types of Clustering Algorithms

There are four major families of clustering algorithms, each with a different mathematical idea of what makes a good cluster.

Connectivity-Based Clustering

Connectivity-based (hierarchical) clustering groups points that are spatially close to one another. The main assumption is that nearby points are more similar than distant ones. These methods are sensitive to outliers, which can appear as spurious clusters or cause real clusters to merge incorrectly.

The diagram below shows how a connectivity-based algorithm separates two groups by drawing a decision boundary between them:

Connectivity-based clustering diagram showing two colored point clouds separated by a boundary line

Centroid-Based Clustering

In centroid-based clustering, each cluster is represented by a central point called a centroid. The centroid is the mean position of all points in the cluster and does not need to be an actual data point. K-Means is the canonical centroid-based algorithm.

The diagram below shows two clusters, each with a centroid marker at its center of mass:

Centroid-based clustering diagram showing two clusters with centroid markers labeled Cluster 1, Cluster 2, and Centroids

Distribution-Based Clustering

Distribution-based methods model each cluster as a statistical distribution — most commonly a Gaussian. The Gaussian Mixture Model (GMM) fitted with the Expectation-Maximization algorithm is the best-known example. These models have strong theoretical foundations but can overfit when the number of clusters is too high.

The comparison below shows Original Data, K-Means, and EM clustering applied to the same "mouse" shaped dataset, illustrating how EM can follow curved boundaries:

Distribution-based clustering comparison showing original mouse-shaped data alongside k-means and EM clustering results

Density-Based Clustering

Density-based methods — such as DBSCAN — define clusters as dense regions of points separated by sparse regions. They can discover arbitrarily shaped clusters and naturally label sparse points as noise, unlike K-Means which always forces every point into a cluster.

The diagram below contrasts DBSCAN and K-Means on the same non-convex dataset, showing that DBSCAN correctly separates the irregular shapes while K-Means misassigns the overlapping region:

Comparison of DBSCAN and K-Means clustering on a non-convex dataset showing that DBSCAN correctly identifies irregular cluster shapes

Dataset and Problem Understanding

Start by importing the libraries you will need throughout this tutorial:

PYTHON

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Load the synthetic dataset using pd.read_csv. The dataset has two numeric features, x and y, and a ground-truth cluster column you will use only for comparison:

PYTHON

data = pd.read_csv('data.csv', index_col = 0)
data.head()

OUTPUT

	x	y	cluster
0	-8.482852	-5.603349	2
1	-7.751632	-8.405334	2
2	-10.967098	-9.032782	2
3	-11.999447	-7.606734	2
4	-1.736810	10.478015	1

Check the distribution of the ground-truth cluster labels with value_counts():

PYTHON

data['cluster'].value_counts()

OUTPUT

1    67
0    67
2    66
Name: cluster, dtype: int64

The dataset contains three roughly balanced groups: 67 points in clusters 0 and 1 each, and 66 in cluster 2.

Plot a scatter chart to see the natural groupings in the data. The %matplotlib inline magic you set earlier tells Jupyter to render the chart inside the notebook:

PYTHON

plt.scatter(data['x'], data['y'], c = data['cluster'], cmap = 'viridis')
plt.xlabel('X-values')
plt.ylabel('Y-values')
plt.title('Formation of cluster')
plt.show()

The scatter plot below shows three clearly separated blobs colored by their ground-truth label:

Scatter plot titled Formation of cluster showing three distinct color-coded clusters of data points on X-Y axes

Training K-Means on the Dataset

The K-Means algorithm is a simple, efficient method capable of clustering data in just a few iterations. It is an unsupervised technique — it discovers structure without using the cluster label column.

Import KMeans and StandardScaler from scikit-learn:

PYTHON

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

Separate the features from the label:

PYTHON

X = data[['x', 'y']]
y = data['cluster']

Standardize X before training. When features have different units or variances, K-Means — which relies on Euclidean distance — will be dominated by the larger-scale feature. StandardScaler transforms each feature to have mean 0 and standard deviation 1:

PYTHON

scaler = StandardScaler()
X = scaler.fit_transform(X)
X[: 5]

OUTPUT

array([[-1.01200363, -0.60606415],
       [-0.86550679, -1.04265203],
       [-1.5097118 , -1.14041707],
       [-1.71653856, -0.91821912],
       [ 0.33953731,  1.89963378]])

Write the standardized values back into the DataFrame so subsequent plots use the scaled coordinates:

PYTHON

data[['x', 'y']] = X

Fitting with k = 2

Start with k=2 to see how K-Means behaves when given fewer clusters than the true number. You will correct this after applying the elbow method:

PYTHON

k=2
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X)

PYTHON

KMeans(n_clusters=2, random_state=42)

Retrieve the two centroid coordinates the algorithm converged on:

PYTHON

center = kmeans.cluster_centers_
center

OUTPUT

array([[-1.30618271, -0.87560626],
       [ 0.64334372,  0.43126875]])

Plot the cluster assignments alongside the centroids, represented as red stars:

PYTHON

plt.scatter(data['x'], data['y'], c = kmeans.labels_, cmap = 'viridis')
plt.xlabel('X-values')
plt.ylabel('Y-values')
plt.title('Formation of cluster with centroids')
for i, point in enumerate(center):
    plt.plot(center[i][0], center[i][1], '*r--', linewidth=2, markersize=18)

With only two clusters, the model merges what are actually three blobs. The two red star centroids are visible, but you can already see that the upper-right region contains points that should belong to a third group:

Scatter plot titled Formation of cluster with centroids showing two color-coded clusters and two red star markers at their centroid positions

Plot the ground-truth labels to confirm that three clusters exist in the data:

PYTHON

plt.scatter(data['x'], data['y'], c = data['cluster'], cmap = 'viridis')
plt.xlabel('X-values')
plt.ylabel('Y-values')
plt.title('Formation of 3-cluster')
plt.show()

The true three-cluster structure is now visible, confirming that k=2 was too low:

Scatter plot titled Formation of 3-cluster showing three distinct color-coded point clouds on X-Y axes

Choosing the Right Value of K

The K-Means convergence process follows four steps.

Assuming inputs $x_{1}, x_{2}, \dots, x_{n}$ :

Step 1 — Pick $k$ random points as initial cluster centers (centroids).
Step 2 — Assign each $x_{i}$ to the nearest centroid by Euclidean distance.
Step 3 — Recompute each centroid as the mean of all points assigned to it.
Step 4 — Repeat Steps 2 and 3 until no assignment changes.

The diagram below illustrates this convergence sequence from random initialization (Iteration 1) through successive refinements until the clusters stabilize:

Six-panel diagram showing K-Means centroid movement across iterations 1, 2, 3, 6, 9, and convergence with three colored clusters and cross-shaped centroid markers

When you already know the right number of clusters, use that value directly. When you do not, use the Elbow Method.

The Elbow Method

The elbow method plots SSE (Sum of Squared Errors, also called inertia) against different values of $k$ and looks for the point where the curve bends sharply — the "elbow." Beyond that point, adding more clusters yields diminishing returns.

S S E = i = 1 \sum k x_{j} \in S_{i} \sum ∥ x_{j} - μ_{i} ∥^{2}

Where:

$k$ — number of clusters
$S_{i}$ — the set of points assigned to cluster $i$
$x_{j}$ — a data point in cluster $i$
$μ_{i}$ — the centroid (mean) of cluster $i$
$∥ \cdot ∥^{2}$ — squared Euclidean distance

The diagram below shows an example elbow curve annotated with an arrow pointing to the optimal $k$ :

Elbow method diagram showing SSE dropping steeply from k=1 to k=3 then flattening, with a red circle and arrow marking the elbow at k=3

Computing SSE for the Synthetic Dataset

Fit K-Means for every $k$ from 1 to 9 and record kmeans.inertia_ — scikit-learn's name for SSE — at each step:

PYTHON

SSE = []
index = range(1,10)
for i in index:
    kmeans = KMeans(n_clusters=i, random_state=42)
    kmeans.fit(X)
    SSE.append(kmeans.inertia_)
    print(kmeans.inertia_)

OUTPUT

400.0000000000001
156.41078579574975
44.057048453292815
36.726387118666075
31.01642761531463
25.39959047192452
22.547184727743304
19.92343817847746
17.295836404824982

The inertia drops sharply from $k = 1$ to $k = 3$ , then the rate of decrease slows considerably. Plot the curve to see the elbow clearly:

PYTHON

plt.plot(index, SSE)
plt.xlabel('K')
plt.ylabel('SEE')
plt.title('SSE with respect to K')
plt.show()

The elbow in the SSE curve below falls at $k = 3$ , confirming the three natural groups visible in the scatter plot:

Line chart titled SSE with respect to K showing a steep drop in inertia from k=1 to k=3 followed by a gradual flattening, indicating the optimal k is 3

Fitting with k = 3

Now re-train with the correct number of clusters and plot the result with three centroid markers:

PYTHON

k=3
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X)
KMeans(n_clusters=3, random_state=42)
center = kmeans.cluster_centers_
plt.xlabel('X-values')
plt.ylabel('Y-values')
plt.title('Formation of clusters with 3-centroids')

plt.scatter(data['x'], data['y'], c = kmeans.labels_, cmap = 'viridis')
for i, point in enumerate(center):
plt.plot(center[i][0], center[i][1], '*r--', linewidth=2, markersize=18)

With k=3, each blob is assigned its own cluster and the three red star centroids sit at the geometric center of each group:

Scatter plot titled Formation of clusters with 3-centroids showing three color-coded clusters with red star centroid markers at the center of each group

Applying K-Means to the Iris Dataset

To see how K-Means performs on a real-world dataset, apply the same workflow to the Iris dataset, which contains four measurements of 150 iris flowers across three species.

Import datasets from scikit-learn and load the Iris data:

PYTHON

from sklearn import datasets

Load and scale the features:

PYTHON

iris = datasets.load_iris()
X = iris.data
scaler = StandardScaler()
X = scaler.fit_transform(X)
iris.target_names

OUTPUT

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

The Iris dataset contains three species, so you expect the elbow method to suggest $k = 3$ . Compute SSE for $k$ from 1 to 9:

PYTHON

SSE = []
index = range(1,10)
for i in index:
    kmeans = KMeans(n_clusters=i, random_state=42)
    kmeans.fit(X)
    SSE.append(kmeans.inertia_)
    print(kmeans.inertia_)

OUTPUT

600.0
222.36170496502302
139.82049635974974
114.41256181896094
90.92751382392049
80.0224959955744
71.81624598106144
62.28749580350205
54.8110520315013

Plot SSE versus $k$ to locate the elbow:

PYTHON

plt.plot(index, SSE)
plt.xlabel('K')
plt.ylabel('SSE')
plt.title('Variation of SSE with respect to K')
plt.show()

The elbow curve for the Iris dataset bends most noticeably around $k = 3$ , which aligns exactly with the three true species in the data:

Line chart titled Variation of SSE with respect to K for the Iris dataset showing inertia dropping steeply from k=1 to k=3 then flattening

Conclusion

In this tutorial you applied K-Means clustering to a synthetic three-blob dataset and to the Iris dataset. You used the elbow method — plotting SSE (inertia) against values of $k$ from 1 to 9 — to identify the optimal cluster count of 3 in both cases. After scaling the features with StandardScaler, the trained model clearly separated all three groups, with centroid markers confirming each cluster's geometric center.

Key takeaways:

K-Means is an unsupervised algorithm that iterates between assigning points to their nearest centroid and recomputing centroids as cluster means until convergence.
Always standardize features before running K-Means — the algorithm uses Euclidean distance, so unscaled features with large variance will dominate assignments.
The elbow method finds the right $k$ by plotting SSE against cluster count and selecting the value where marginal gains begin to flatten.
Inertia (kmeans.inertia_) is scikit-learn's name for SSE — the lower it is, the tighter the clusters, but excessively low inertia with high $k$ signals overfitting.
K-Means assumes convex, roughly equal-sized clusters; for irregular shapes consider density-based methods like DBSCAN.

Next steps:

Explore K-Nearest Neighbors to see how distance-based reasoning applies to a supervised classification problem.
Read Ensemble Learning for a broader view of combining multiple models to improve prediction quality.
Experiment with different values of init in KMeans ('k-means++' vs 'random') to see how centroid initialization affects convergence speed and final cluster quality.

K-Means Clustering in Python

Topics You Will Master

Types of Clustering

Types of Clustering Algorithms

Connectivity-Based Clustering

Centroid-Based Clustering

Distribution-Based Clustering

Density-Based Clustering

Dataset and Problem Understanding

Training K-Means on the Dataset

Fitting with k = 2

Choosing the Right Value of K

The Elbow Method

Computing SSE for the Synthetic Dataset

Fitting with k = 3

Applying K-Means to the Iris Dataset

Conclusion

Latest recommendations you might like

LinkedIn Auto Connect Bot

Dimensionality Reduction with LDA and PCA in Python

Find this tutorial useful?

Discussion & Comments