Convolutions
The Magnifying Glass
Imagine holding a small magnifying glass over a photograph and sliding it across the image, one small patch at a time. At each position, you examine the tiny region under the glass and ask: "Is there an edge here? A corner? A texture?"
That's a convolution. It's a small filter (also called a kernel) that slides across an image, performing a simple mathematical operation at each position. The result is a new image that highlights specific patterns.
How It Works
A convolution kernel is a tiny grid of numbers β often 3x3 or 5x5. You place it on top of a patch of the image, multiply each kernel value by the corresponding pixel value, and sum everything up. That sum becomes one pixel in the output.
Then you slide the kernel one pixel to the right and repeat. Slide across the whole image, slide down a row, repeat. You've scanned the entire image with your pattern detector.
The magic is in the kernel values. Different numbers detect different things:
- A kernel like
[[-1, 0, 1], [-1, 0, 1], [-1, 0, 1]]detects vertical edges - A kernel like
[[-1,-1,-1], [0,0,0], [1,1,1]]detects horizontal edges - A kernel of
[[1,1,1], [1,1,1], [1,1,1]] / 9creates a blur
Why Convolutions Are Perfect for Images
Convolutions have three properties that make them ideal for visual data:
1. Local Connectivity
Each output pixel depends only on a small local region of the input. This mirrors how vision works β to recognize an edge, you only need to look at neighboring pixels, not pixels on the other side of the image.
2. Weight Sharing
The same kernel is used across the entire image. This means an edge detector that works in the top-left corner also works in the bottom-right. You don't need separate detectors for every position. This dramatically reduces parameters compared to fully connected layers.
3. Translation Invariance
Because the same filter slides everywhere, a cat in the top-left produces the same response as a cat in the bottom-right. The network recognizes objects regardless of where they appear.
Convolution from Scratch
Beyond Simple Filters
In classic computer vision, engineers hand-designed kernels for specific tasks (Sobel for edges, Gaussian for blur). In deep learning, the network learns the kernel values automatically during training.
The network starts with random kernels and adjusts them through backpropagation until they detect the patterns most useful for the task. A network trained on faces might learn kernels for eyes, noses, and mouths. One trained on landscapes might learn kernels for sky gradients and tree textures.
This is the key insight: you don't design the filters β the data teaches the network what to look for.