#clustering#k-means clustering#machine learning#pandas#python#seaborn#sklearn

K-Means Clustering in Python

Learn how K-Means clustering works and implement it in Python with scikit-learn. Covers centroid initialization, the elbow method for choosing K, inertia, and cluster visualization.

May 17, 2026 at 8:15 PM10 min readFollowFollow (Hindi)

Topics You Will Master

How the K-Means algorithm assigns data points to clusters and moves centroids to convergence
The four types of clustering algorithms and when each is appropriate
How to select the optimal number of clusters using the elbow method and SSE (inertia)
How to implement KMeans with scikit-learn, including feature scaling with StandardScaler
How to visualize cluster assignments and centroids with matplotlib
Best For

Python developers and data scientists who understand supervised learning basics and want to explore unsupervised techniques for grouping unlabelled data.

Expected Outcome

A trained K-Means model applied to a synthetic dataset and the Iris dataset, with optimal K chosen via the elbow method and cluster assignments visualized as scatter plots with centroids.

K-Means clustering is one of the most widely used unsupervised machine learning algorithms. Unlike supervised methods, it finds structure in data without any labels — making it ideal for customer segmentation, anomaly detection, and exploratory analysis.

The algorithm groups data into clusters by iteratively assigning each point to its nearest centroid and then recomputing each centroid as the mean of its assigned points. The word "means" in K-Means refers to this averaging step.

In this tutorial you will apply K-Means to a synthetic two-dimensional dataset and then to the Iris dataset. You will choose the right value of using the elbow method and visualize the resulting cluster assignments.

Prerequisites: Python 3.x, scikit-learn, Pandas, NumPy, Matplotlib, Seaborn.

Types of Clustering

Machine learning is broadly divided into supervised learning (labelled data), unsupervised learning (no labels), and semi-supervised learning (a mix of both). Clustering falls entirely in the unsupervised category.

Within clustering there are two assignment styles:

  • Hard clustering — each data point belongs to exactly one cluster.
  • Soft clustering — each data point has a probability of belonging to every cluster.

Types of Clustering Algorithms

There are four major families of clustering algorithms, each with a different mathematical idea of what makes a good cluster.

Connectivity-Based Clustering

Connectivity-based (hierarchical) clustering groups points that are spatially close to one another. The main assumption is that nearby points are more similar than distant ones. These methods are sensitive to outliers, which can appear as spurious clusters or cause real clusters to merge incorrectly.

The diagram below shows how a connectivity-based algorithm separates two groups by drawing a decision boundary between them:

Connectivity-based clustering diagram showing two colored point clouds separated by a boundary line

Centroid-Based Clustering

In centroid-based clustering, each cluster is represented by a central point called a centroid. The centroid is the mean position of all points in the cluster and does not need to be an actual data point. K-Means is the canonical centroid-based algorithm.

The diagram below shows two clusters, each with a centroid marker at its center of mass:

Centroid-based clustering diagram showing two clusters with centroid markers labeled Cluster 1, Cluster 2, and Centroids

Distribution-Based Clustering

Distribution-based methods model each cluster as a statistical distribution — most commonly a Gaussian. The Gaussian Mixture Model (GMM) fitted with the Expectation-Maximization algorithm is the best-known example. These models have strong theoretical foundations but can overfit when the number of clusters is too high.

The comparison below shows Original Data, K-Means, and EM clustering applied to the same "mouse" shaped dataset, illustrating how EM can follow curved boundaries:

Distribution-based clustering comparison showing original mouse-shaped data alongside k-means and EM clustering results

Density-Based Clustering

Density-based methods — such as DBSCAN — define clusters as dense regions of points separated by sparse regions. They can discover arbitrarily shaped clusters and naturally label sparse points as noise, unlike K-Means which always forces every point into a cluster.

The diagram below contrasts DBSCAN and K-Means on the same non-convex dataset, showing that DBSCAN correctly separates the irregular shapes while K-Means misassigns the overlapping region:

Comparison of DBSCAN and K-Means clustering on a non-convex dataset showing that DBSCAN correctly identifies irregular cluster shapes

Dataset and Problem Understanding

Start by importing the libraries you will need throughout this tutorial:

PYTHON
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Load the synthetic dataset using pd.read_csv. The dataset has two numeric features, x and y, and a ground-truth cluster column you will use only for comparison:

PYTHON
data = pd.read_csv('data.csv', index_col = 0)
data.head()
OUTPUT
xycluster
0-8.482852-5.6033492
1-7.751632-8.4053342
2-10.967098-9.0327822
3-11.999447-7.6067342
4-1.73681010.4780151

Check the distribution of the ground-truth cluster labels with value_counts():

PYTHON
data['cluster'].value_counts()
OUTPUT
1    67
0    67
2    66
Name: cluster, dtype: int64

The dataset contains three roughly balanced groups: 67 points in clusters 0 and 1 each, and 66 in cluster 2.

Plot a scatter chart to see the natural groupings in the data. The %matplotlib inline magic you set earlier tells Jupyter to render the chart inside the notebook:

PYTHON
plt.scatter(data['x'], data['y'], c = data['cluster'], cmap = 'viridis')
plt.xlabel('X-values')
plt.ylabel('Y-values')
plt.title('Formation of cluster')
plt.show()

The scatter plot below shows three clearly separated blobs colored by their ground-truth label:

Scatter plot titled Formation of cluster showing three distinct color-coded clusters of data points on X-Y axes

Training K-Means on the Dataset

The K-Means algorithm is a simple, efficient method capable of clustering data in just a few iterations. It is an unsupervised technique — it discovers structure without using the cluster label column.

Import KMeans and StandardScaler from scikit-learn:

PYTHON
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

Separate the features from the label:

PYTHON
X = data[['x', 'y']]
y = data['cluster']

Standardize X before training. When features have different units or variances, K-Means — which relies on Euclidean distance — will be dominated by the larger-scale feature. StandardScaler transforms each feature to have mean 0 and standard deviation 1:

PYTHON
scaler = StandardScaler()
X = scaler.fit_transform(X)
X[: 5]
OUTPUT
array([[-1.01200363, -0.60606415],
       [-0.86550679, -1.04265203],
       [-1.5097118 , -1.14041707],
       [-1.71653856, -0.91821912],
       [ 0.33953731,  1.89963378]])

Write the standardized values back into the DataFrame so subsequent plots use the scaled coordinates:

PYTHON
data[['x', 'y']] = X

Fitting with k = 2

Start with k=2 to see how K-Means behaves when given fewer clusters than the true number. You will correct this after applying the elbow method:

PYTHON
k=2
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X)
PYTHON
KMeans(n_clusters=2, random_state=42)

Retrieve the two centroid coordinates the algorithm converged on:

PYTHON
center = kmeans.cluster_centers_
center
OUTPUT
array([[-1.30618271, -0.87560626],
       [ 0.64334372,  0.43126875]])

Plot the cluster assignments alongside the centroids, represented as red stars:

PYTHON
plt.scatter(data['x'], data['y'], c = kmeans.labels_, cmap = 'viridis')
plt.xlabel('X-values')
plt.ylabel('Y-values')
plt.title('Formation of cluster with centroids')
for i, point in enumerate(center):
    plt.plot(center[i][0], center[i][1], '*r--', linewidth=2, markersize=18)

With only two clusters, the model merges what are actually three blobs. The two red star centroids are visible, but you can already see that the upper-right region contains points that should belong to a third group:

Scatter plot titled Formation of cluster with centroids showing two color-coded clusters and two red star markers at their centroid positions

Plot the ground-truth labels to confirm that three clusters exist in the data:

PYTHON
plt.scatter(data['x'], data['y'], c = data['cluster'], cmap = 'viridis')
plt.xlabel('X-values')
plt.ylabel('Y-values')
plt.title('Formation of 3-cluster')
plt.show()

The true three-cluster structure is now visible, confirming that k=2 was too low:

Scatter plot titled Formation of 3-cluster showing three distinct color-coded point clouds on X-Y axes

Choosing the Right Value of K

The K-Means convergence process follows four steps.

Assuming inputs :

  • Step 1 — Pick random points as initial cluster centers (centroids).
  • Step 2 — Assign each to the nearest centroid by Euclidean distance.
  • Step 3 — Recompute each centroid as the mean of all points assigned to it.
  • Step 4 — Repeat Steps 2 and 3 until no assignment changes.

The diagram below illustrates this convergence sequence from random initialization (Iteration 1) through successive refinements until the clusters stabilize:

Six-panel diagram showing K-Means centroid movement across iterations 1, 2, 3, 6, 9, and convergence with three colored clusters and cross-shaped centroid markers

When you already know the right number of clusters, use that value directly. When you do not, use the Elbow Method.

The Elbow Method

The elbow method plots SSE (Sum of Squared Errors, also called inertia) against different values of and looks for the point where the curve bends sharply — the "elbow." Beyond that point, adding more clusters yields diminishing returns.

Where:

  • — number of clusters
  • — the set of points assigned to cluster
  • — a data point in cluster
  • — the centroid (mean) of cluster
  • — squared Euclidean distance

The diagram below shows an example elbow curve annotated with an arrow pointing to the optimal :

Elbow method diagram showing SSE dropping steeply from k=1 to k=3 then flattening, with a red circle and arrow marking the elbow at k=3

Computing SSE for the Synthetic Dataset

Fit K-Means for every from 1 to 9 and record kmeans.inertia_ — scikit-learn's name for SSE — at each step:

PYTHON
SSE = []
index = range(1,10)
for i in index:
    kmeans = KMeans(n_clusters=i, random_state=42)
    kmeans.fit(X)
    SSE.append(kmeans.inertia_)
    print(kmeans.inertia_)
OUTPUT
400.0000000000001
156.41078579574975
44.057048453292815
36.726387118666075
31.01642761531463
25.39959047192452
22.547184727743304
19.92343817847746
17.295836404824982

The inertia drops sharply from to , then the rate of decrease slows considerably. Plot the curve to see the elbow clearly:

PYTHON
plt.plot(index, SSE)
plt.xlabel('K')
plt.ylabel('SEE')
plt.title('SSE with respect to K')
plt.show()

The elbow in the SSE curve below falls at , confirming the three natural groups visible in the scatter plot:

Line chart titled SSE with respect to K showing a steep drop in inertia from k=1 to k=3 followed by a gradual flattening, indicating the optimal k is 3

Fitting with k = 3

Now re-train with the correct number of clusters and plot the result with three centroid markers:

PYTHON
k=3
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X)
KMeans(n_clusters=3, random_state=42)
center = kmeans.cluster_centers_
plt.xlabel('X-values')
plt.ylabel('Y-values')
plt.title('Formation of clusters with 3-centroids')

plt.scatter(data['x'], data['y'], c = kmeans.labels_, cmap = 'viridis')
for i, point in enumerate(center):
plt.plot(center[i][0], center[i][1], '*r--', linewidth=2, markersize=18)

With k=3, each blob is assigned its own cluster and the three red star centroids sit at the geometric center of each group:

Scatter plot titled Formation of clusters with 3-centroids showing three color-coded clusters with red star centroid markers at the center of each group

Applying K-Means to the Iris Dataset

To see how K-Means performs on a real-world dataset, apply the same workflow to the Iris dataset, which contains four measurements of 150 iris flowers across three species.

Import datasets from scikit-learn and load the Iris data:

PYTHON
from sklearn import datasets

Load and scale the features:

PYTHON
iris = datasets.load_iris()
X = iris.data
scaler = StandardScaler()
X = scaler.fit_transform(X)
iris.target_names
OUTPUT
array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

The Iris dataset contains three species, so you expect the elbow method to suggest . Compute SSE for from 1 to 9:

PYTHON
SSE = []
index = range(1,10)
for i in index:
    kmeans = KMeans(n_clusters=i, random_state=42)
    kmeans.fit(X)
    SSE.append(kmeans.inertia_)
    print(kmeans.inertia_)
OUTPUT
600.0
222.36170496502302
139.82049635974974
114.41256181896094
90.92751382392049
80.0224959955744
71.81624598106144
62.28749580350205
54.8110520315013

Plot SSE versus to locate the elbow:

PYTHON
plt.plot(index, SSE)
plt.xlabel('K')
plt.ylabel('SSE')
plt.title('Variation of SSE with respect to K')
plt.show()

The elbow curve for the Iris dataset bends most noticeably around , which aligns exactly with the three true species in the data:

Line chart titled Variation of SSE with respect to K for the Iris dataset showing inertia dropping steeply from k=1 to k=3 then flattening

Conclusion

In this tutorial you applied K-Means clustering to a synthetic three-blob dataset and to the Iris dataset. You used the elbow method — plotting SSE (inertia) against values of from 1 to 9 — to identify the optimal cluster count of 3 in both cases. After scaling the features with StandardScaler, the trained model clearly separated all three groups, with centroid markers confirming each cluster's geometric center.

Key takeaways:

  • K-Means is an unsupervised algorithm that iterates between assigning points to their nearest centroid and recomputing centroids as cluster means until convergence.
  • Always standardize features before running K-Means — the algorithm uses Euclidean distance, so unscaled features with large variance will dominate assignments.
  • The elbow method finds the right by plotting SSE against cluster count and selecting the value where marginal gains begin to flatten.
  • Inertia (kmeans.inertia_) is scikit-learn's name for SSE — the lower it is, the tighter the clusters, but excessively low inertia with high signals overfitting.
  • K-Means assumes convex, roughly equal-sized clusters; for irregular shapes consider density-based methods like DBSCAN.

Next steps:

  • Explore K-Nearest Neighbors to see how distance-based reasoning applies to a supervised classification problem.
  • Read Ensemble Learning for a broader view of combining multiple models to improve prediction quality.
  • Experiment with different values of init in KMeans ('k-means++' vs 'random') to see how centroid initialization affects convergence speed and final cluster quality.

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments