Unsupervised Learning7 min read

K-Means Clustering

Find natural groups by moving centroids closer

algorithm:Iterative · assign then update · repeattime complexity:O(n × k × i) · n points, k clusters, i iterationstype:Unsupervised · no labels needed

You're throwing a house party and you've set up 3 food stations: pizza, tacos, and sushi. At the start, you just guess where to put them — one in the living room, one in the kitchen, one on the patio.

As people arrive, they naturally cluster around their favorite food. You notice 20 people crowded around the pizza in the living room, but only 3 near the tacos in the kitchen — because most taco-lovers are actually hanging out on the patio.

So you move the taco station to the center of where the taco crowd actually is. Then people shift again, and you adjust again. After a few rounds, the stations settle into the perfect spots that minimize everyone's walking distance.

That's K-Means clustering. Place k center points (centroids), assign each data point to its nearest centroid, move the centroids to the center of their assigned points, and repeat until nothing changes.

The algorithm, step by step

Choose k — decide how many clusters you want (that's the "K" in K-Means)
Initialize — randomly place k centroids in your data space
Assign — each data point joins the cluster of the nearest centroid
Update — move each centroid to the mean (average) position of all its points
Repeat steps 3-4 until the centroids stop moving (convergence)

How do you pick k?

The most popular method is the Elbow Method. Run K-Means with k=1, 2, 3, 4... and plot the total distance of points from their centroids (called inertia). The plot looks like a bent arm — the "elbow" where the curve stops improving dramatically is usually the right k.

K-Means Clustering in Action

from sklearn.cluster import KMeans
import numpy as np

# Generate sample data: 3 natural groups
np.random.seed(42)
group1 = np.random.randn(30, 2) + [0, 0]   # centered at (0,0)
group2 = np.random.randn(30, 2) + [5, 5]   # centered at (5,5)
group3 = np.random.randn(30, 2) + [10, 0]  # centered at (10,0)
X = np.vstack([group1, group2, group3])

# Run K-Means with k=3
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
kmeans.fit(X)

print(f"Cluster centers:\n{np.round(kmeans.cluster_centers_, 2)}")
print(f"Labels (first 10): {kmeans.labels_[:10]}")
print(f"Inertia: {kmeans.inertia_:.2f}")

Output

Cluster centers:
[[ 0.05 -0.11]
 [ 5.01  5.12]
 [ 9.92 -0.03]]
Labels (first 10): [0 0 0 0 0 0 0 0 0 0]
Inertia: 168.43

Limitations to watch for

You must choose k upfront — the algorithm won't figure out the number of clusters for you
Sensitive to initialization — bad starting positions can lead to bad clusters (fix: run it multiple times with different starts, like sklearn's n_init=10)
Assumes spherical clusters — struggles with elongated or oddly-shaped groups
Sensitive to scale — always normalize your data first, or a feature with large values will dominate the distance calculation

Real-world uses

Customer segmentation — group shoppers by behavior for targeted marketing
Image compression — reduce millions of colors to k representative colors
Anomaly detection — points far from any centroid might be outliers

Note: K-Means is unsupervised — no labels needed. You give it data and say 'find 3 groups.' It discovers the structure on its own. This makes it incredibly useful when you don't have labeled data (which is most real-world scenarios).