Unsupervised Learning8 min read

PCA (Principal Component Analysis)

Squash 100 features down to the ones that matter
goal:Reduce dimensions Β· keep maximum variancemethod:Find directions of greatest spreadresult:Fewer features Β· faster training Β· less noise

Imagine you take a photo of a mountain range. The original image is 10 megapixels β€” millions of tiny dots of color. Now compress it to a thumbnail. You lose some detail, but you can still clearly see the mountains, the sky, the trees. The important shapes are preserved; only the fine grain is gone.

PCA does the same thing to data. If you have a dataset with 100 features (columns), PCA can squash it down to, say, 10 features β€” while keeping most of the meaningful information. The new features are combinations of the originals, chosen to capture the maximum amount of variation in the data.

Why reduce dimensions?

  • Speed β€” fewer features = faster training. Going from 1,000 features to 50 can cut training time dramatically.
  • Noise reduction β€” many features are just noise. PCA strips them away, often improving model performance.
  • Visualization β€” you can't plot 100 dimensions, but you can plot 2 or 3. PCA lets you see your data.
  • Curse of dimensionality β€” in high-dimensional space, everything is far apart. Models struggle. Reducing dimensions helps.

How PCA works (intuitively)

Think of a cloud of data points in 3D space. PCA asks: "If I had to flatten this 3D cloud onto a 2D sheet of paper, what angle should I hold the paper to capture the most spread?"

  1. Find the direction with the most variance (most spread). That's PC1 (Principal Component 1).
  2. Find the next direction with the most variance, perpendicular to PC1. That's PC2.
  3. Keep going for as many components as you need.

Each component is a new axis β€” a blend of the original features β€” oriented to capture maximum information.

PCA: From High to Low Dimensions

from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
import numpy as np
# Iris dataset: 4 features
X, y = load_iris(return_X_y=True)
print(f"Original shape: {X.shape}") # 150 samples, 4 features
# Reduce to 2 dimensions
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)
print(f"Reduced shape: {X_reduced.shape}")
# How much info did we keep?
print(f"\nVariance explained per component: {np.round(pca.explained_variance_ratio_, 3)}")
print(f"Total variance kept: {sum(pca.explained_variance_ratio_):.1%}")
# The first 2 points in the new 2D space
print(f"\nFirst 2 samples (2D): {np.round(X_reduced[:2], 2)}")
Output
Original shape: (150, 4)
Reduced shape:  (150, 2)

Variance explained per component: [0.925 0.053]
Total variance kept: 97.8%

First 2 samples (2D): [[-2.68  0.32]
 [-2.71 -0.18]]

How many components to keep?

There's no single right answer, but a common rule: keep enough components to explain 90-95% of the total variance. Plot the cumulative explained variance and look for where the curve flattens out.

In the Iris example above, just 2 components captured 97.8% of the variance from 4 original features. That's remarkably efficient β€” we lost only 2.2% of the information while cutting the dimensions in half.

Things to watch out for

  • Scale your data first β€” PCA is sensitive to feature scales. A feature measured in millions will dominate one measured in decimals. Always standardize first.
  • Interpretability β€” the new components are blends of original features, so they're harder to interpret. "PC1" doesn't have a clean real-world meaning like "height" or "income."
  • Linear only β€” PCA captures linear relationships. For non-linear structure, look into t-SNE or UMAP (but these are for visualization, not as model inputs).
Note: PCA isn't just for speed. By removing noisy dimensions, it can actually IMPROVE model accuracy. A model trained on 50 well-chosen PCA components often outperforms one trained on 500 raw features full of noise.

Key Metrics

πŸ“Š Compute Covariance Matrix
The most expensive step for high-dimensional data
O(n Γ— dΒ²) n samples, d features
πŸ”’ Eigenvalue Decomposition
Can be expensive when d is very large (1000+)
O(dΒ³) d = number of features
πŸ“‰ Transform Data
Matrix multiplication β€” fast once eigenvectors are found
O(n Γ— d Γ— k) k = components kept

Quick check

PCA reduced 4 features to 2, keeping 97.8% of variance. What happened to the other 2.2%?
Challenge

Continue reading