Activation Functions

Think about light switches in a house. You've got different types:

A classic switch: either fully on or fully off. Click. No in-between.
A dimmer switch: smoothly goes from off to on. You can set it to 30%, 50%, 80%.
A full-range dimmer: goes from negative (inverted?) to positive brightness, with off in the middle.
A one-way dimmer: off when pushed below zero, then linearly brighter as you push up.

These correspond to the four major activation functions in neural networks: step, sigmoid, tanh, and ReLU. Each transforms a neuron's raw output in a different way.

Why do we need activation functions at all?

Here's the critical insight. Without an activation function, each layer just does a linear transformation: multiply by weights, add bias. And a chain of linear functions is still just a linear function.

It doesn't matter if you stack 100 layers — without activation functions, the entire network collapses into a single layer. You'd only ever learn straight-line relationships.

Activation functions introduce non-linearity. They're what allow neural networks to bend and curve, learning complex patterns like "this region is a cat" instead of just "more pixels = more cat."

The four key activation functions

1. Step Function (the OG)

Output = 0 if input < 0, else 1. Binary: on or off. Used in the original perceptron. Problem: the gradient is zero everywhere (flat), so backpropagation can't learn through it.

2. Sigmoid: the S-curve

sigmoid(x) = 1 / (1 + e^(-x))

Smoothly maps any number to (0, 1). Great for outputs representing probabilities. Problem: for very large or very small inputs, the gradient nearly vanishes — the vanishing gradient problem.

3. Tanh: the centered S-curve

tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))

Like sigmoid, but outputs range from -1 to +1. The zero-centered output makes optimization easier. Still suffers from vanishing gradients at the extremes.

4. ReLU: the modern workhorse

ReLU(x) = max(0, x)

Dead simple: if negative, output 0. If positive, pass it through unchanged. It's fast, doesn't saturate for positive values, and largely solved the vanishing gradient problem. This is why deep learning took off.

All Four Activation Functions

import numpy as np

def step(x):    return np.where(x >= 0, 1, 0)
def sigmoid(x): return 1 / (1 + np.exp(-x))
def tanh(x):    return np.tanh(x)
def relu(x):    return np.maximum(0, x)

# Test with a range of values
values = np.array([-3.0, -1.0, 0.0, 1.0, 3.0])

print("Input:  ", values)
print("Step:   ", step(values))
print("Sigmoid:", np.round(sigmoid(values), 3))
print("Tanh:   ", np.round(tanh(values), 3))
print("ReLU:   ", relu(values))

print("\nDerivatives at x=1.0:")
x = 1.0
print(f"  Sigmoid: {sigmoid(x) * (1 - sigmoid(x)):.4f}")
print(f"  Tanh:    {1 - tanh(x)**2:.4f}")
print(f"  ReLU:    1 (always 1 for x > 0)")

Output

Input:   [-3. -1.  0.  1.  3.]
Step:    [0 0 1 1 1]
Sigmoid: [0.047 0.269 0.5   0.731 0.953]
Tanh:    [-0.995 -0.762  0.     0.762  0.995]
ReLU:    [0. 0. 0. 1. 3.]

Derivatives at x=1.0:
  Sigmoid: 0.1966
  Tanh:    0.4200
  ReLU:    1 (always 1 for x > 0)

Key Metrics

ReLU

Just a comparison — default choice for hidden layers

Fastest max(0, x)

Sigmoid

Good for output probabilities, bad for hidden layers (vanishing gradients)

Moderate 1/(1+e^-x)

Tanh

Zero-centered, better than sigmoid for hidden layers

Moderate (e^x-e^-x)/(e^x+e^-x)

Leaky ReLU

Fixes dying neuron problem — small slope for negatives

Fast max(0.01x, x)