Activation Functions
Think about light switches in a house. You've got different types:
- A classic switch: either fully on or fully off. Click. No in-between.
- A dimmer switch: smoothly goes from off to on. You can set it to 30%, 50%, 80%.
- A full-range dimmer: goes from negative (inverted?) to positive brightness, with off in the middle.
- A one-way dimmer: off when pushed below zero, then linearly brighter as you push up.
These correspond to the four major activation functions in neural networks: step, sigmoid, tanh, and ReLU. Each transforms a neuron's raw output in a different way.
Why do we need activation functions at all?
Here's the critical insight. Without an activation function, each layer just does a linear transformation: multiply by weights, add bias. And a chain of linear functions is still just a linear function.
It doesn't matter if you stack 100 layers β without activation functions, the entire network collapses into a single layer. You'd only ever learn straight-line relationships.
Activation functions introduce non-linearity. They're what allow neural networks to bend and curve, learning complex patterns like "this region is a cat" instead of just "more pixels = more cat."
The four key activation functions
1. Step Function (the OG)
Output = 0 if input < 0, else 1. Binary: on or off. Used in the original perceptron. Problem: the gradient is zero everywhere (flat), so backpropagation can't learn through it.
2. Sigmoid: the S-curve
sigmoid(x) = 1 / (1 + e^(-x))
Smoothly maps any number to (0, 1). Great for outputs representing probabilities. Problem: for very large or very small inputs, the gradient nearly vanishes β the vanishing gradient problem.
3. Tanh: the centered S-curve
tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))
Like sigmoid, but outputs range from -1 to +1. The zero-centered output makes optimization easier. Still suffers from vanishing gradients at the extremes.
4. ReLU: the modern workhorse
ReLU(x) = max(0, x)
Dead simple: if negative, output 0. If positive, pass it through unchanged. It's fast, doesn't saturate for positive values, and largely solved the vanishing gradient problem. This is why deep learning took off.