Backpropagation

You just got your exam back. You scored 60% — not great. Now you want to figure out what went wrong. You trace backward through your preparation: the final answers were wrong because your calculations were off, your calculations were off because you misunderstood a formula, and you misunderstood the formula because you skipped that chapter.

You've just done backpropagation in your head. You started with the error (bad grade), traced it backward through each stage, and figured out exactly what to fix and by how much.

That's precisely how neural networks learn. The error at the output flows backward through the network, telling each weight: "Here's how much you contributed to the mistake, and here's how you should change."

Step 1: Forward pass — make a prediction

First, data flows forward through the network: input layer, hidden layers, output layer. At the end, you get a prediction. Compare it to the actual answer, and you get an error (loss).

Step 2: Backward pass — assign blame

Now the error flows backward. At each layer, we calculate: "How much did each weight contribute to the error?" This is done using the chain rule from calculus.

Think of a chain of dominoes. The output error depends on the last layer's weights, which depend on the second-to-last layer's weights, which depend on... all the way back to the input. The chain rule lets us calculate the blame at each link in this chain.

Step 3: Update — fix the mistakes

Once we know each weight's share of the blame (its gradient), we nudge it in the opposite direction. Made the error bigger? Move this way. Made it smaller? Keep going that way. This is gradient descent in action.

The chain rule: the heart of backprop

Suppose you have a chain of functions: y = f(g(h(x))). The chain rule says:

dy/dx = (dy/df) * (df/dg) * (dg/dx)

In neural network terms: the gradient of the loss with respect to an early-layer weight is the product of all the gradients along the path from that weight to the output.

It's like a game of telephone, but with numbers. The error signal gets passed back, multiplied at each step by the local gradient. Weights that contributed more to the error get larger updates.

Backpropagation: Training a Tiny Network

import numpy as np

def sigmoid(x): return 1 / (1 + np.exp(-x))
def sigmoid_deriv(x): return x * (1 - x)

np.random.seed(1)

# XOR problem — impossible for a single perceptron!
X = np.array([[0,0],[0,1],[1,0],[1,1]])
y = np.array([[0],[1],[1],[0]])

# Network: 2 → 4 hidden → 1 output
W1 = np.random.randn(2, 4) * 0.5
b1 = np.zeros((1, 4))
W2 = np.random.randn(4, 1) * 0.5
b2 = np.zeros((1, 1))
lr = 1.0

for epoch in range(5000):
    # Forward pass
    hidden = sigmoid(X @ W1 + b1)
    output = sigmoid(hidden @ W2 + b2)
    
    # Backward pass
    output_error = y - output
    output_delta = output_error * sigmoid_deriv(output)
    
    hidden_error = output_delta @ W2.T
    hidden_delta = hidden_error * sigmoid_deriv(hidden)
    
    # Update weights
    W2 += hidden.T @ output_delta * lr
    b2 += np.sum(output_delta, axis=0) * lr
    W1 += X.T @ hidden_delta * lr
    b1 += np.sum(hidden_delta, axis=0) * lr

print("After training:")
for i in range(4):
    print(f"{X[i]} → {output[i][0]:.3f} (expected {y[i][0]})")

Output

After training:
[0 0] → 0.018 (expected 0)
[0 1] → 0.982 (expected 1)
[1 0] → 0.982 (expected 1)
[1 1] → 0.019 (expected 0)

Key Metrics

Forward Pass

Matrix multiply at each layer

O(weights) O(W)

Backward Pass

Same cost as forward — compute gradients layer by layer

O(weights) O(W)

Memory

Must store all intermediate values for gradient computation

Moderate-High O(activations)

Vanishing Gradients

Early layers may learn very slowly — solved by ReLU and residual connections