# K-Mean Clustering in Python | Machine Learning | KGP Talkie

## What is K-Mean Clustering?

Machine Learning can broadly be classified into three types:

- Supervised Learning
- Unsupervised Learning
- Semi-supervised Learning

`K-means`

algorithm identifies `k`

number of centroids, and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible. The `‘means’`

in the `K-means`

refers to averaging of the data; that is, finding the centroid.

## Types of clustering

- Hard clustering
- Soft clustering

## Type of Clustering Algorithms

**Connectivity-based clustering**

The main idea behind this `clustering`

is that data points that are `closer`

in the data space are more `related (similar)`

than to data points farther away. They are also not very robust towards outliers, which might show up as additional `clusters`

or even cause other `clusters`

to merge.

#### Centroid-based clustering

In this type of clustering, clusters are represented by a central `vector`

or a `centroid`

. This `centroid`

might not necessarily be a member of the dataset.

#### Distribution-based clustering

These models have a strong theoritical foundation, however they often suffer from overfitting. Gaussian mixture models, using the expectation-maximization algorithm is a famous distribution based clustering method.

`Density-based methods`

search the data space for areas of varied density of data points.

## Dataset and Problem Understanding

At first we will be importing certain libraries which we will need to work on the given dataset.

`Pandas`

which offers data structures and operation for manipulating numerical tables.

`Seaborn`

and `matplotlib`

for data visualizations.

Also `numpy`

for working on array.

import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline

Fetching the data using `pandas`

.

data = pd.read_csv('data.csv', index_col = 0) data.head()

x | y | cluster | |
---|---|---|---|

0 | -8.482852 | -5.603349 | 2 |

1 | -7.751632 | -8.405334 | 2 |

2 | -10.967098 | -9.032782 | 2 |

3 | -11.999447 | -7.606734 | 2 |

4 | -1.736810 | 10.478015 | 1 |

Let’s look at the top five rows using the Dataframe’s `head() method`

. you can find out what all categories exist and how many instances(row) belong to each category by using `value_counts() method`

.

data['cluster'].value_counts()

1 67 0 67 2 66 Name: cluster, dtype: int64

So, we have 67 dataset each which belong to cluster `1`

and cluster `0`

and `66`

dataset belong to cluster `2`

.

Now we can a `scatter plot`

of the dataset and visualize the cluster formed.

We have `matplotlib`

library which we have already impoted above to plot the dataset. So before you can plot anything, you need to specify which backend `Matplotlib`

should use. The simplest option is to use Jupyter’s magic command `%matplotlib inline`

. This tells Jupyter to set up Matplotlib so it uses Jupyter’s own backend.

plt.scatter(data['x'], data['y'], c = data['cluster'], cmap = 'viridis') plt.xlabel('X-values') plt.ylabel('Y-values') plt.title('Formation of cluster') plt.show()

## K-Means for clustering

The `K-Means algorithm`

is a simple algorithm capable of clustering the same kind of dataset very quickly and efficiently, often in just a few iterations.Its an `unsupervised machine learning technique`

.

Let’s train a `K-Means cluster`

on this dataset. It will try to find each `blob's center`

and assign each instance to the closed blob.

Importing K – Means from `sklearn cluster`

at first.

from sklearn.cluster import KMeans from sklearn.preprocessing import StandardScaler

X = data[['x', 'y']] y = data['cluster']

We `standardize`

the dataset before training the algorithm because the variable can of incomparable units (eg one in cm other in kg) so we should standardize variables, ofcourse. Also when the data show quite a different variances it is a good practice to standardize the data

scaler = StandardScaler() X = scaler.fit_transform(X) X[: 5]

array([[-1.01200363, -0.60606415], [-0.86550679, -1.04265203], [-1.5097118 , -1.14041707], [-1.71653856, -0.91821912], [ 0.33953731, 1.89963378]])

Now we impute back the standardized data in our dataset and train the algorithm.

data[['x', 'y']] = X

We are taking the `number of clusters k`

as two for time being.

Note that you have to specify the number of clusters `k`

that the algorithm must find. In this dataset, it is pretty obvious from looking at the data that `k`

should be set to `3`

, but we choose it `2`

as of now. Also in general it is not that easy to find the number of clusters.

Later we will see the method to find the optimal number of cluster for a dataset.

k=2 kmeans = KMeans(n_clusters=k, random_state=42) kmeans.fit(X)

KMeans(n_clusters=2, random_state=42)

We will take a look at the two centroid that the algorithm found:

center = kmeans.cluster_centers_ center

array([[-1.30618271, -0.87560626], [ 0.64334372, 0.43126875]])

Now plotting the dataset and each centroid.

plt.scatter(data['x'], data['y'], c = kmeans.labels_, cmap = 'viridis') plt.xlabel('X-values') plt.ylabel('Y-values') plt.title('Formation of cluster with centroids') for i, point in enumerate(center): plt.plot(center[i][0], center[i][1], '*r--', linewidth=2, markersize=18)

With number of cluster as 2 the`centroid`

(marked as red star above) are pretty obivious.

plt.scatter(data['x'], data['y'], c = data['cluster'], cmap = 'viridis') plt.xlabel('X-values') plt.ylabel('Y-values') plt.title('Formation of 3-cluster') plt.show()

here we can easily visualize `3 clusters`

. So, now we will see how we can find out the right value for K (number of cluster)

## How do I choose right value of k?

you need to understand how it works?

Assuming we have inputs x_1, x_2, x_3, …, x_nx

**Step 1**– Pick K random points as cluster centers called centroids.**Step 2**– Assign each x_i to nearest cluster by calculating its distance to each centroid.**Step 3**– Find new cluster center by taking the average of the assigned points.**Step 4**– Repeat Step 2 and 3 until none of the cluster assignments change.

**Most important, when to stop increasing K?**

We often know the value of K. In that case we use the value of K. Else we use the `Elbow Method`

.

### Error Sum of Squares (SSE)

It is the sum of the `squared`

differences between each observation and its group’s mean. We can use it as a measure of variation within a `cluster`

. All cases within a cluster are identical then `SSE`

would be equal to `0`

. We run the algorithm for different values of K(say K = 10 to 1) and plot the K values against `SSE(Sum of Squared Errors)`

. And select the value of K for the elbow point as shown in the figure.

So we try value for k between 1 to 10 and use `elbow method`

as explaind above.

SSE = [] index = range(1,10) for i in index: kmeans = KMeans(n_clusters=i, random_state=42) kmeans.fit(X) SSE.append(kmeans.inertia_) print(kmeans.inertia_)

400.0000000000001 156.41078579574975 44.057048453292815 36.726387118666075 31.01642761531463 25.39959047192452 22.547184727743304 19.92343817847746 17.295836404824982

Here the `metric inertia`

is nothing but the mean squared distance between each instance and its closest centroid.

plt.plot(index, SSE) plt.xlabel('K') plt.ylabel('SEE') plt.title('SSE with respect to K') plt.show()

From the graph, we can observe with increasing the `K`

value `SSE`

will decrease. So, like mentioned before the value of k should be `3`

. Now lets do it again with k=3

k=3 kmeans = KMeans(n_clusters=k, random_state=42) kmeans.fit(X) KMeans(n_clusters=3, random_state=42) center = kmeans.cluster_centers_ plt.xlabel('X-values') plt.ylabel('Y-values') plt.title('Formation of clusters with 3-centroids') plt.scatter(data['x'], data['y'], c = kmeans.labels_, cmap = 'viridis') for i, point in enumerate(center): plt.plot(center[i][0], center[i][1], '*r--', linewidth=2, markersize=18)

## Let’s go ahead and explore it in a little bit more detail

Use Iris dataset

from sklearn import datasets

iris = datasets.load_iris() X = iris.data scaler = StandardScaler() X = scaler.fit_transform(X) iris.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

SSE = [] index = range(1,10) for i in index: kmeans = KMeans(n_clusters=i, random_state=42) kmeans.fit(X) SSE.append(kmeans.inertia_) print(kmeans.inertia_)

600.0 222.36170496502302 139.82049635974974 114.41256181896094 90.92751382392049 80.0224959955744 71.81624598106144 62.28749580350205 54.8110520315013

plt.plot(index, SSE) plt.xlabel('K') plt.ylabel('SSE') plt.title('Variation of SSE with respect to K') plt.show()