Practical ML7 min read

Regularization

Penalise complexity so the model stays honest

L1 (Lasso):Drives some weights to zero · feature selectionL2 (Ridge):Shrinks all weights · prevents any from dominatingno regularization:Risk of overfitting · model memorizes noise

Imagine a teacher grading essays. Two students answer a question correctly, but:

Student A writes a clear, simple 3-sentence answer.
Student B writes a rambling 3-page essay that technically contains the right answer buried somewhere in the middle.

A good teacher gives bonus marks for simplicity. Both answers are correct, but the simpler one is easier to understand, more likely to generalize to new questions, and less likely to contain hidden mistakes.

Regularization is that bonus-for-simplicity applied to machine learning. It adds a penalty to the model's loss function that discourages overly complex solutions. The model is told: "Sure, you can have a complicated answer, but it'll cost you."

Why models overfit

Without regularization, a model will do whatever it takes to reduce training error — even if that means learning crazy, fragile patterns. A polynomial might develop enormous coefficients (like +50,000x³ - 49,999x²) just to perfectly hit every training point. It looks great on paper, but try it on new data and it'll go haywire.

How regularization works

The model normally minimizes Loss = Error on training data. With regularization, it minimizes:

Loss = Error on training data + λ × Complexity penalty

That λ (lambda) controls how much you penalize complexity. High λ → simpler model, more bias. Low λ → complex model, more variance.

L1 vs L2 Regularization

L1 (Lasso): Penalty = sum of |weights|. Drives some weights all the way to zero, effectively removing features. Great for feature selection.
L2 (Ridge): Penalty = sum of weights². Shrinks weights toward zero but doesn't eliminate them. Keeps all features but prevents any from dominating.
Elastic Net: Mix of L1 and L2. Best of both worlds.

Regularization in Action: Ridge vs Lasso

from sklearn.linear_model import LinearRegression, Ridge, Lasso
import numpy as np

# Noisy data: true relationship is y = 2x
np.random.seed(42)
X = np.random.randn(50, 5)  # 5 features, only 1 matters
y = 2 * X[:, 0] + np.random.randn(50) * 0.5

# No regularization — all weights might be large
lr = LinearRegression().fit(X, y)
print("No reg weights:", np.round(lr.coef_, 3))

# L2 (Ridge) — shrinks all weights
ridge = Ridge(alpha=1.0).fit(X, y)
print("Ridge weights: ", np.round(ridge.coef_, 3))

# L1 (Lasso) — pushes irrelevant weights to zero
lasso = Lasso(alpha=0.3).fit(X, y)
print("Lasso weights: ", np.round(lasso.coef_, 3))

Output

No reg weights: [ 2.024 -0.031  0.095 -0.078  0.141]
Ridge weights:  [ 1.923 -0.024  0.085 -0.068  0.121]
Lasso weights:  [ 1.779  0.     0.     0.     0.   ]

Choosing the right lambda

Lambda (the regularization strength) is a hyperparameter — you choose it, not the algorithm. Too small and you get overfitting. Too large and you get underfitting (the model is so penalized it can barely learn anything).

The standard approach: use cross-validation to try different lambda values and pick the one that gives the best test performance.

Regularization everywhere

Regularization isn't just for linear models. It shows up everywhere:

Dropout in neural networks — randomly turns off neurons during training
Max depth limits in decision trees — prevents trees from growing too deep
Early stopping — stop training before the model memorizes the data

Note: Look at the Lasso output: it set 4 out of 5 weights to exactly zero, keeping only the one feature that actually matters. This is powerful — Lasso doubles as a feature selector, automatically telling you which inputs are useful.