K-Means Clustering
You're throwing a house party and you've set up 3 food stations: pizza, tacos, and sushi. At the start, you just guess where to put them β one in the living room, one in the kitchen, one on the patio.
As people arrive, they naturally cluster around their favorite food. You notice 20 people crowded around the pizza in the living room, but only 3 near the tacos in the kitchen β because most taco-lovers are actually hanging out on the patio.
So you move the taco station to the center of where the taco crowd actually is. Then people shift again, and you adjust again. After a few rounds, the stations settle into the perfect spots that minimize everyone's walking distance.
That's K-Means clustering. Place k center points (centroids), assign each data point to its nearest centroid, move the centroids to the center of their assigned points, and repeat until nothing changes.
The algorithm, step by step
- Choose k β decide how many clusters you want (that's the "K" in K-Means)
- Initialize β randomly place k centroids in your data space
- Assign β each data point joins the cluster of the nearest centroid
- Update β move each centroid to the mean (average) position of all its points
- Repeat steps 3-4 until the centroids stop moving (convergence)
How do you pick k?
The most popular method is the Elbow Method. Run K-Means with k=1, 2, 3, 4... and plot the total distance of points from their centroids (called inertia). The plot looks like a bent arm β the "elbow" where the curve stops improving dramatically is usually the right k.
K-Means Clustering in Action
Limitations to watch for
- You must choose k upfront β the algorithm won't figure out the number of clusters for you
- Sensitive to initialization β bad starting positions can lead to bad clusters (fix: run it multiple times with different starts, like sklearn's
n_init=10) - Assumes spherical clusters β struggles with elongated or oddly-shaped groups
- Sensitive to scale β always normalize your data first, or a feature with large values will dominate the distance calculation
Real-world uses
- Customer segmentation β group shoppers by behavior for targeted marketing
- Image compression β reduce millions of colors to k representative colors
- Anomaly detection β points far from any centroid might be outliers