Neural Networks8 min read

Backpropagation

The error flows backward so the network can learn
time:O(total weights) per example β€” same cost as forward passmemory:O(total activations) β€” must store intermediate valueskey insight:Chain rule of calculus applied layer by layer

You just got your exam back. You scored 60% β€” not great. Now you want to figure out what went wrong. You trace backward through your preparation: the final answers were wrong because your calculations were off, your calculations were off because you misunderstood a formula, and you misunderstood the formula because you skipped that chapter.

You've just done backpropagation in your head. You started with the error (bad grade), traced it backward through each stage, and figured out exactly what to fix and by how much.

That's precisely how neural networks learn. The error at the output flows backward through the network, telling each weight: "Here's how much you contributed to the mistake, and here's how you should change."

Step 1: Forward pass β€” make a prediction

First, data flows forward through the network: input layer, hidden layers, output layer. At the end, you get a prediction. Compare it to the actual answer, and you get an error (loss).

Step 2: Backward pass β€” assign blame

Now the error flows backward. At each layer, we calculate: "How much did each weight contribute to the error?" This is done using the chain rule from calculus.

Think of a chain of dominoes. The output error depends on the last layer's weights, which depend on the second-to-last layer's weights, which depend on... all the way back to the input. The chain rule lets us calculate the blame at each link in this chain.

Step 3: Update β€” fix the mistakes

Once we know each weight's share of the blame (its gradient), we nudge it in the opposite direction. Made the error bigger? Move this way. Made it smaller? Keep going that way. This is gradient descent in action.

The chain rule: the heart of backprop

Suppose you have a chain of functions: y = f(g(h(x))). The chain rule says:

dy/dx = (dy/df) * (df/dg) * (dg/dx)

In neural network terms: the gradient of the loss with respect to an early-layer weight is the product of all the gradients along the path from that weight to the output.

It's like a game of telephone, but with numbers. The error signal gets passed back, multiplied at each step by the local gradient. Weights that contributed more to the error get larger updates.

Backpropagation: Training a Tiny Network

import numpy as np
def sigmoid(x): return 1 / (1 + np.exp(-x))
def sigmoid_deriv(x): return x * (1 - x)
np.random.seed(1)
# XOR problem β€” impossible for a single perceptron!
X = np.array([[0,0],[0,1],[1,0],[1,1]])
y = np.array([[0],[1],[1],[0]])
# Network: 2 β†’ 4 hidden β†’ 1 output
W1 = np.random.randn(2, 4) * 0.5
b1 = np.zeros((1, 4))
W2 = np.random.randn(4, 1) * 0.5
b2 = np.zeros((1, 1))
lr = 1.0
for epoch in range(5000):
# Forward pass
hidden = sigmoid(X @ W1 + b1)
output = sigmoid(hidden @ W2 + b2)
# Backward pass
output_error = y - output
output_delta = output_error * sigmoid_deriv(output)
hidden_error = output_delta @ W2.T
hidden_delta = hidden_error * sigmoid_deriv(hidden)
# Update weights
W2 += hidden.T @ output_delta * lr
b2 += np.sum(output_delta, axis=0) * lr
W1 += X.T @ hidden_delta * lr
b1 += np.sum(hidden_delta, axis=0) * lr
print("After training:")
for i in range(4):
print(f"{X[i]} β†’ {output[i][0]:.3f} (expected {y[i][0]})")
Output
After training:
[0 0] β†’ 0.018 (expected 0)
[0 1] β†’ 0.982 (expected 1)
[1 0] β†’ 0.982 (expected 1)
[1 1] β†’ 0.019 (expected 0)
Note: Backpropagation solved the XOR problem that killed the first AI hype wave. By allowing multi-layer networks to learn, it opened the door to everything we call 'deep learning' today. It was published in 1986, but didn't take off until GPUs made it practical around 2012.

Key Metrics

Forward Pass
Matrix multiply at each layer
O(weights) O(W)
Backward Pass
Same cost as forward β€” compute gradients layer by layer
O(weights) O(W)
Memory
Must store all intermediate values for gradient computation
Moderate-High O(activations)
Vanishing Gradients
Early layers may learn very slowly β€” solved by ReLU and residual connections
Risk in deep nets Gradients shrink exponentially

Quick check

What is the 'backward' in backpropagation referring to?
Challenge

Continue reading