Neural Networks7 min read

Activation Functions

ReLU, sigmoid, tanh β€” the on/off switches of neurons
computation:O(1) per neuron β€” simple math operationsimportance:Critical β€” without them, networks can only learn linear patternsmost popular:ReLU for hidden layers, softmax for classification output

Think about light switches in a house. You've got different types:

  • A classic switch: either fully on or fully off. Click. No in-between.
  • A dimmer switch: smoothly goes from off to on. You can set it to 30%, 50%, 80%.
  • A full-range dimmer: goes from negative (inverted?) to positive brightness, with off in the middle.
  • A one-way dimmer: off when pushed below zero, then linearly brighter as you push up.

These correspond to the four major activation functions in neural networks: step, sigmoid, tanh, and ReLU. Each transforms a neuron's raw output in a different way.

Why do we need activation functions at all?

Here's the critical insight. Without an activation function, each layer just does a linear transformation: multiply by weights, add bias. And a chain of linear functions is still just a linear function.

It doesn't matter if you stack 100 layers β€” without activation functions, the entire network collapses into a single layer. You'd only ever learn straight-line relationships.

Activation functions introduce non-linearity. They're what allow neural networks to bend and curve, learning complex patterns like "this region is a cat" instead of just "more pixels = more cat."

The four key activation functions

1. Step Function (the OG)

Output = 0 if input < 0, else 1. Binary: on or off. Used in the original perceptron. Problem: the gradient is zero everywhere (flat), so backpropagation can't learn through it.

2. Sigmoid: the S-curve

sigmoid(x) = 1 / (1 + e^(-x))

Smoothly maps any number to (0, 1). Great for outputs representing probabilities. Problem: for very large or very small inputs, the gradient nearly vanishes β€” the vanishing gradient problem.

3. Tanh: the centered S-curve

tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))

Like sigmoid, but outputs range from -1 to +1. The zero-centered output makes optimization easier. Still suffers from vanishing gradients at the extremes.

4. ReLU: the modern workhorse

ReLU(x) = max(0, x)

Dead simple: if negative, output 0. If positive, pass it through unchanged. It's fast, doesn't saturate for positive values, and largely solved the vanishing gradient problem. This is why deep learning took off.

All Four Activation Functions

import numpy as np
def step(x): return np.where(x >= 0, 1, 0)
def sigmoid(x): return 1 / (1 + np.exp(-x))
def tanh(x): return np.tanh(x)
def relu(x): return np.maximum(0, x)
# Test with a range of values
values = np.array([-3.0, -1.0, 0.0, 1.0, 3.0])
print("Input: ", values)
print("Step: ", step(values))
print("Sigmoid:", np.round(sigmoid(values), 3))
print("Tanh: ", np.round(tanh(values), 3))
print("ReLU: ", relu(values))
print("\nDerivatives at x=1.0:")
x = 1.0
print(f" Sigmoid: {sigmoid(x) * (1 - sigmoid(x)):.4f}")
print(f" Tanh: {1 - tanh(x)**2:.4f}")
print(f" ReLU: 1 (always 1 for x > 0)")
Output
Input:   [-3. -1.  0.  1.  3.]
Step:    [0 0 1 1 1]
Sigmoid: [0.047 0.269 0.5   0.731 0.953]
Tanh:    [-0.995 -0.762  0.     0.762  0.995]
ReLU:    [0. 0. 0. 1. 3.]

Derivatives at x=1.0:
  Sigmoid: 0.1966
  Tanh:    0.4200
  ReLU:    1 (always 1 for x > 0)
Note: ReLU has a weakness: 'dying neurons.' If a neuron's input is always negative, it always outputs 0, and its gradient is always 0 β€” it can never recover. Variants like Leaky ReLU (which outputs 0.01x for negative inputs instead of 0) fix this by keeping a small gradient flowing.

Key Metrics

ReLU
Just a comparison β€” default choice for hidden layers
Fastest max(0, x)
Sigmoid
Good for output probabilities, bad for hidden layers (vanishing gradients)
Moderate 1/(1+e^-x)
Tanh
Zero-centered, better than sigmoid for hidden layers
Moderate (e^x-e^-x)/(e^x+e^-x)
Leaky ReLU
Fixes dying neuron problem β€” small slope for negatives
Fast max(0.01x, x)

Quick check

Why would a neural network with no activation functions be useless for complex tasks?
Challenge

Continue reading