PCA (Principal Component Analysis)
Imagine you take a photo of a mountain range. The original image is 10 megapixels β millions of tiny dots of color. Now compress it to a thumbnail. You lose some detail, but you can still clearly see the mountains, the sky, the trees. The important shapes are preserved; only the fine grain is gone.
PCA does the same thing to data. If you have a dataset with 100 features (columns), PCA can squash it down to, say, 10 features β while keeping most of the meaningful information. The new features are combinations of the originals, chosen to capture the maximum amount of variation in the data.
Why reduce dimensions?
- Speed β fewer features = faster training. Going from 1,000 features to 50 can cut training time dramatically.
- Noise reduction β many features are just noise. PCA strips them away, often improving model performance.
- Visualization β you can't plot 100 dimensions, but you can plot 2 or 3. PCA lets you see your data.
- Curse of dimensionality β in high-dimensional space, everything is far apart. Models struggle. Reducing dimensions helps.
How PCA works (intuitively)
Think of a cloud of data points in 3D space. PCA asks: "If I had to flatten this 3D cloud onto a 2D sheet of paper, what angle should I hold the paper to capture the most spread?"
- Find the direction with the most variance (most spread). That's PC1 (Principal Component 1).
- Find the next direction with the most variance, perpendicular to PC1. That's PC2.
- Keep going for as many components as you need.
Each component is a new axis β a blend of the original features β oriented to capture maximum information.
PCA: From High to Low Dimensions
How many components to keep?
There's no single right answer, but a common rule: keep enough components to explain 90-95% of the total variance. Plot the cumulative explained variance and look for where the curve flattens out.
In the Iris example above, just 2 components captured 97.8% of the variance from 4 original features. That's remarkably efficient β we lost only 2.2% of the information while cutting the dimensions in half.
Things to watch out for
- Scale your data first β PCA is sensitive to feature scales. A feature measured in millions will dominate one measured in decimals. Always standardize first.
- Interpretability β the new components are blends of original features, so they're harder to interpret. "PC1" doesn't have a clean real-world meaning like "height" or "income."
- Linear only β PCA captures linear relationships. For non-linear structure, look into t-SNE or UMAP (but these are for visualization, not as model inputs).