Neural Networks8 min read

Gradient Descent

Roll downhill to find the lowest point
per step:O(n * d) for batch, O(d) for stochasticconvergence:Depends on learning rate and landscape shapevariants:Batch, Mini-batch, Stochastic (SGD)

Imagine you're blindfolded on a hilly landscape. You can't see anything, but you can feel the slope under your feet. Your goal: find the lowest valley.

What do you do? Simple β€” you feel which direction goes downhill, and take a step that way. Then you feel again, step again. And again. Eventually, you reach the bottom of a valley.

That's gradient descent. The "landscape" is the loss function β€” a surface where height represents how wrong the model is. The "slope" is the gradient β€” the mathematical direction of steepest increase. And you step in the opposite direction (downhill) to reduce the error.

The gradient: your compass

The gradient is just a fancy word for the direction of steepest ascent. If you're on a hill and the gradient points northeast and upward, then stepping southwest and downward (the negative gradient) is your fastest route to a lower point.

In math terms, for each weight w in the network:

w_new = w_old - learning_rate * (dLoss/dw)

The learning rate controls how big each step is. Think of it as your stride length on that blindfolded hike.

The learning rate: Goldilocks problem

This is the most important knob to tune:

  • Too large: You take giant steps and keep overshooting the valley β€” bouncing from hillside to hillside, never settling down. The loss oscillates wildly or even explodes.
  • Too small: You take tiny baby steps. You'll eventually reach the bottom, but it'll take forever. Training a model could take weeks instead of hours.
  • Just right: You converge smoothly to the minimum. Start larger and decrease over time (this is called a learning rate schedule).

Three flavors of gradient descent

Batch Gradient Descent

Compute the gradient using all training examples before taking one step. Very accurate direction, but expensive for large datasets. Like surveying the entire landscape before each step.

Stochastic Gradient Descent (SGD)

Compute the gradient using one random example and immediately step. Very fast but noisy β€” like taking directions from random strangers. The path zigzags, but you still trend downhill.

Mini-batch Gradient Descent

The sweet spot. Use a small batch (32, 64, 128 examples) to estimate the gradient. Less noisy than SGD, much faster than batch. This is what everyone actually uses in practice.

Gradient Descent: Finding the Best Line

import numpy as np
# Simple dataset: y β‰ˆ 2x + 1
X = np.array([1, 2, 3, 4, 5], dtype=float)
y = np.array([3, 5, 7, 9, 11], dtype=float)
# Start with random weight and bias
w = 0.0
b = 0.0
lr = 0.01
for epoch in range(100):
# Forward pass: predictions
y_pred = w * X + b
# Loss: Mean Squared Error
loss = np.mean((y - y_pred) ** 2)
# Gradients (how loss changes w.r.t. w and b)
dw = -2 * np.mean(X * (y - y_pred))
db = -2 * np.mean(y - y_pred)
# Update (step downhill)
w -= lr * dw
b -= lr * db
if epoch % 20 == 0:
print(f"Epoch {epoch:3d}: w={w:.3f}, b={b:.3f}, loss={loss:.4f}")
print(f"\nLearned: y = {w:.2f}x + {b:.2f} (target: y = 2x + 1)")
Output
Epoch   0: w=0.140, b=0.040, loss=65.0000
Epoch  20: w=1.727, b=0.543, loss=0.4274
Epoch  40: w=1.926, b=0.795, loss=0.0656
Epoch  60: w=1.970, b=0.899, loss=0.0149
Epoch  80: w=1.987, b=0.949, loss=0.0039

Learned: y = 1.99x + 0.97  (target: y = 2x + 1)
Note: Gradient descent doesn't guarantee finding the global minimum β€” you might get stuck in a local minimum (a valley that's not the deepest). In practice, for neural networks with millions of parameters, local minima are rarely a problem because the loss landscape has many paths to good solutions. The real enemies are saddle points and plateaus.

Key Metrics

Batch GD
Uses all n examples β€” accurate but expensive
Slow per step O(n * d) per step
Stochastic GD
Uses 1 example β€” noisy but fast
Fast per step O(d) per step
Mini-batch GD
Best of both worlds β€” the standard in practice
Balanced O(batch * d) per step
Convergence
Too high β†’ diverge, too low β†’ crawl, just right β†’ smooth
Varies Depends on learning rate

Quick check

In the blindfolded hiker analogy, what does the 'learning rate' represent?
Challenge

Continue reading