Gradient Descent
Imagine you're blindfolded on a hilly landscape. You can't see anything, but you can feel the slope under your feet. Your goal: find the lowest valley.
What do you do? Simple β you feel which direction goes downhill, and take a step that way. Then you feel again, step again. And again. Eventually, you reach the bottom of a valley.
That's gradient descent. The "landscape" is the loss function β a surface where height represents how wrong the model is. The "slope" is the gradient β the mathematical direction of steepest increase. And you step in the opposite direction (downhill) to reduce the error.
The gradient: your compass
The gradient is just a fancy word for the direction of steepest ascent. If you're on a hill and the gradient points northeast and upward, then stepping southwest and downward (the negative gradient) is your fastest route to a lower point.
In math terms, for each weight w in the network:
w_new = w_old - learning_rate * (dLoss/dw)
The learning rate controls how big each step is. Think of it as your stride length on that blindfolded hike.
The learning rate: Goldilocks problem
This is the most important knob to tune:
- Too large: You take giant steps and keep overshooting the valley β bouncing from hillside to hillside, never settling down. The loss oscillates wildly or even explodes.
- Too small: You take tiny baby steps. You'll eventually reach the bottom, but it'll take forever. Training a model could take weeks instead of hours.
- Just right: You converge smoothly to the minimum. Start larger and decrease over time (this is called a learning rate schedule).
Three flavors of gradient descent
Batch Gradient Descent
Compute the gradient using all training examples before taking one step. Very accurate direction, but expensive for large datasets. Like surveying the entire landscape before each step.
Stochastic Gradient Descent (SGD)
Compute the gradient using one random example and immediately step. Very fast but noisy β like taking directions from random strangers. The path zigzags, but you still trend downhill.
Mini-batch Gradient Descent
The sweet spot. Use a small batch (32, 64, 128 examples) to estimate the gradient. Less noisy than SGD, much faster than batch. This is what everyone actually uses in practice.