Unsupervised Learning7 min read

K-Means Clustering

Find natural groups by moving centroids closer
algorithm:Iterative Β· assign then update Β· repeattime complexity:O(n Γ— k Γ— i) Β· n points, k clusters, i iterationstype:Unsupervised Β· no labels needed

You're throwing a house party and you've set up 3 food stations: pizza, tacos, and sushi. At the start, you just guess where to put them β€” one in the living room, one in the kitchen, one on the patio.

As people arrive, they naturally cluster around their favorite food. You notice 20 people crowded around the pizza in the living room, but only 3 near the tacos in the kitchen β€” because most taco-lovers are actually hanging out on the patio.

So you move the taco station to the center of where the taco crowd actually is. Then people shift again, and you adjust again. After a few rounds, the stations settle into the perfect spots that minimize everyone's walking distance.

That's K-Means clustering. Place k center points (centroids), assign each data point to its nearest centroid, move the centroids to the center of their assigned points, and repeat until nothing changes.

The algorithm, step by step

  1. Choose k β€” decide how many clusters you want (that's the "K" in K-Means)
  2. Initialize β€” randomly place k centroids in your data space
  3. Assign β€” each data point joins the cluster of the nearest centroid
  4. Update β€” move each centroid to the mean (average) position of all its points
  5. Repeat steps 3-4 until the centroids stop moving (convergence)

How do you pick k?

The most popular method is the Elbow Method. Run K-Means with k=1, 2, 3, 4... and plot the total distance of points from their centroids (called inertia). The plot looks like a bent arm β€” the "elbow" where the curve stops improving dramatically is usually the right k.

K-Means Clustering in Action

from sklearn.cluster import KMeans
import numpy as np
# Generate sample data: 3 natural groups
np.random.seed(42)
group1 = np.random.randn(30, 2) + [0, 0] # centered at (0,0)
group2 = np.random.randn(30, 2) + [5, 5] # centered at (5,5)
group3 = np.random.randn(30, 2) + [10, 0] # centered at (10,0)
X = np.vstack([group1, group2, group3])
# Run K-Means with k=3
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
kmeans.fit(X)
print(f"Cluster centers:\n{np.round(kmeans.cluster_centers_, 2)}")
print(f"Labels (first 10): {kmeans.labels_[:10]}")
print(f"Inertia: {kmeans.inertia_:.2f}")
Output
Cluster centers:
[[ 0.05 -0.11]
 [ 5.01  5.12]
 [ 9.92 -0.03]]
Labels (first 10): [0 0 0 0 0 0 0 0 0 0]
Inertia: 168.43

Limitations to watch for

  • You must choose k upfront β€” the algorithm won't figure out the number of clusters for you
  • Sensitive to initialization β€” bad starting positions can lead to bad clusters (fix: run it multiple times with different starts, like sklearn's n_init=10)
  • Assumes spherical clusters β€” struggles with elongated or oddly-shaped groups
  • Sensitive to scale β€” always normalize your data first, or a feature with large values will dominate the distance calculation

Real-world uses

  • Customer segmentation β€” group shoppers by behavior for targeted marketing
  • Image compression β€” reduce millions of colors to k representative colors
  • Anomaly detection β€” points far from any centroid might be outliers
Note: K-Means is unsupervised β€” no labels needed. You give it data and say 'find 3 groups.' It discovers the structure on its own. This makes it incredibly useful when you don't have labeled data (which is most real-world scenarios).

Key Metrics

πŸ“ Assignment Step
Compute distance from every point to every centroid
O(n Γ— k) Each point checks k centroids
πŸ”„ Update Step
Recompute centroid as mean of assigned points
O(n Γ— d) Average d dimensions for n points
⏱️ Total Runtime
Usually converges in 10-50 iterations
O(n Γ— k Γ— i Γ— d) i iterations, d dimensions
πŸ’Ύ Space
Very memory-efficient
O(n + k) Store points + centroids

Quick check

In K-Means, what does the 'update' step do?
Challenge

Continue reading