Regularization
Imagine a teacher grading essays. Two students answer a question correctly, but:
- Student A writes a clear, simple 3-sentence answer.
- Student B writes a rambling 3-page essay that technically contains the right answer buried somewhere in the middle.
A good teacher gives bonus marks for simplicity. Both answers are correct, but the simpler one is easier to understand, more likely to generalize to new questions, and less likely to contain hidden mistakes.
Regularization is that bonus-for-simplicity applied to machine learning. It adds a penalty to the model's loss function that discourages overly complex solutions. The model is told: "Sure, you can have a complicated answer, but it'll cost you."
Why models overfit
Without regularization, a model will do whatever it takes to reduce training error β even if that means learning crazy, fragile patterns. A polynomial might develop enormous coefficients (like +50,000xΒ³ - 49,999xΒ²) just to perfectly hit every training point. It looks great on paper, but try it on new data and it'll go haywire.
How regularization works
The model normally minimizes Loss = Error on training data. With regularization, it minimizes:
Loss = Error on training data + Ξ» Γ Complexity penalty
That Ξ» (lambda) controls how much you penalize complexity. High Ξ» β simpler model, more bias. Low Ξ» β complex model, more variance.
L1 vs L2 Regularization
- L1 (Lasso): Penalty = sum of |weights|. Drives some weights all the way to zero, effectively removing features. Great for feature selection.
- L2 (Ridge): Penalty = sum of weightsΒ². Shrinks weights toward zero but doesn't eliminate them. Keeps all features but prevents any from dominating.
- Elastic Net: Mix of L1 and L2. Best of both worlds.
Regularization in Action: Ridge vs Lasso
Choosing the right lambda
Lambda (the regularization strength) is a hyperparameter β you choose it, not the algorithm. Too small and you get overfitting. Too large and you get underfitting (the model is so penalized it can barely learn anything).
The standard approach: use cross-validation to try different lambda values and pick the one that gives the best test performance.
Regularization everywhere
Regularization isn't just for linear models. It shows up everywhere:
- Dropout in neural networks β randomly turns off neurons during training
- Max depth limits in decision trees β prevents trees from growing too deep
- Early stopping β stop training before the model memorizes the data