Backpropagation
You just got your exam back. You scored 60% β not great. Now you want to figure out what went wrong. You trace backward through your preparation: the final answers were wrong because your calculations were off, your calculations were off because you misunderstood a formula, and you misunderstood the formula because you skipped that chapter.
You've just done backpropagation in your head. You started with the error (bad grade), traced it backward through each stage, and figured out exactly what to fix and by how much.
That's precisely how neural networks learn. The error at the output flows backward through the network, telling each weight: "Here's how much you contributed to the mistake, and here's how you should change."
Step 1: Forward pass β make a prediction
First, data flows forward through the network: input layer, hidden layers, output layer. At the end, you get a prediction. Compare it to the actual answer, and you get an error (loss).
Step 2: Backward pass β assign blame
Now the error flows backward. At each layer, we calculate: "How much did each weight contribute to the error?" This is done using the chain rule from calculus.
Think of a chain of dominoes. The output error depends on the last layer's weights, which depend on the second-to-last layer's weights, which depend on... all the way back to the input. The chain rule lets us calculate the blame at each link in this chain.
Step 3: Update β fix the mistakes
Once we know each weight's share of the blame (its gradient), we nudge it in the opposite direction. Made the error bigger? Move this way. Made it smaller? Keep going that way. This is gradient descent in action.
The chain rule: the heart of backprop
Suppose you have a chain of functions: y = f(g(h(x))). The chain rule says:
dy/dx = (dy/df) * (df/dg) * (dg/dx)
In neural network terms: the gradient of the loss with respect to an early-layer weight is the product of all the gradients along the path from that weight to the output.
It's like a game of telephone, but with numbers. The error signal gets passed back, multiplied at each step by the local gradient. Weights that contributed more to the error get larger updates.