Computer Vision10 min read

Convolutions

A sliding magnifying glass that detects patterns

scope:Core Conceptdifficulty:Intermediate

The Magnifying Glass

Imagine holding a small magnifying glass over a photograph and sliding it across the image, one small patch at a time. At each position, you examine the tiny region under the glass and ask: "Is there an edge here? A corner? A texture?"

That's a convolution. It's a small filter (also called a kernel) that slides across an image, performing a simple mathematical operation at each position. The result is a new image that highlights specific patterns.

How It Works

A convolution kernel is a tiny grid of numbers — often 3x3 or 5x5. You place it on top of a patch of the image, multiply each kernel value by the corresponding pixel value, and sum everything up. That sum becomes one pixel in the output.

Then you slide the kernel one pixel to the right and repeat. Slide across the whole image, slide down a row, repeat. You've scanned the entire image with your pattern detector.

The magic is in the kernel values. Different numbers detect different things:

A kernel like [[-1, 0, 1], [-1, 0, 1], [-1, 0, 1]] detects vertical edges
A kernel like [[-1,-1,-1], [0,0,0], [1,1,1]] detects horizontal edges
A kernel of [[1,1,1], [1,1,1], [1,1,1]] / 9 creates a blur

Why Convolutions Are Perfect for Images

Convolutions have three properties that make them ideal for visual data:

1. Local Connectivity

Each output pixel depends only on a small local region of the input. This mirrors how vision works — to recognize an edge, you only need to look at neighboring pixels, not pixels on the other side of the image.

2. Weight Sharing

The same kernel is used across the entire image. This means an edge detector that works in the top-left corner also works in the bottom-right. You don't need separate detectors for every position. This dramatically reduces parameters compared to fully connected layers.

3. Translation Invariance

Because the same filter slides everywhere, a cat in the top-left produces the same response as a cat in the bottom-right. The network recognizes objects regardless of where they appear.

Convolution from Scratch

import numpy as np

def convolve2d(image, kernel):
    """Apply a convolution kernel to a 2D image."""
    ih, iw = image.shape
    kh, kw = kernel.shape
    oh, ow = ih - kh + 1, iw - kw + 1
    output = np.zeros((oh, ow))
    
    for y in range(oh):
        for x in range(ow):
            # Extract patch and multiply with kernel
            patch = image[y:y+kh, x:x+kw]
            output[y, x] = np.sum(patch * kernel)
    
    return output

# A simple 6x6 "image" with a vertical edge
image = np.array([
    [0, 0, 0, 10, 10, 10],
    [0, 0, 0, 10, 10, 10],
    [0, 0, 0, 10, 10, 10],
    [0, 0, 0, 10, 10, 10],
    [0, 0, 0, 10, 10, 10],
    [0, 0, 0, 10, 10, 10],
])

# Vertical edge detector
edge_kernel = np.array([
    [-1, 0, 1],
    [-1, 0, 1],
    [-1, 0, 1]
])

# Blur kernel
blur_kernel = np.ones((3, 3)) / 9

print("Original image:")
print(image)

print("\nVertical edge detection:")
edge_result = convolve2d(image, edge_kernel)
print(edge_result.astype(int))

print("\nBlurred image:")
blur_result = convolve2d(image, blur_kernel)
print(np.round(blur_result, 1))

Output

Original image:
[[ 0  0  0 10 10 10]
 [ 0  0  0 10 10 10]
 [ 0  0  0 10 10 10]
 [ 0  0  0 10 10 10]
 [ 0  0  0 10 10 10]
 [ 0  0  0 10 10 10]]

Vertical edge detection:
[[ 0 30 30  0]
 [ 0 30 30  0]
 [ 0 30 30  0]
 [ 0 30 30  0]]

Blurred image:
[[ 0.   0.   3.3  6.7 10.  10. ]
 [ 0.   0.   3.3  6.7 10.  10. ]
 [ 0.   0.   3.3  6.7 10.  10. ]
 [ 0.   0.   3.3  6.7 10.  10. ]]

Note: A 3x3 kernel has only 9 learnable parameters, yet it can scan an entire image of any size. A fully connected layer connecting every input pixel to every output pixel for a 224x224 image would need 2.5 billion parameters. Convolutions achieve the same job with a tiny fraction of the parameters — that's why they're so efficient.

Beyond Simple Filters

In classic computer vision, engineers hand-designed kernels for specific tasks (Sobel for edges, Gaussian for blur). In deep learning, the network learns the kernel values automatically during training.

The network starts with random kernels and adjusts them through backpropagation until they detect the patterns most useful for the task. A network trained on faces might learn kernels for eyes, noses, and mouths. One trained on landscapes might learn kernels for sky gradients and tree textures.

This is the key insight: you don't design the filters — the data teaches the network what to look for.

Quick check

What does a convolution kernel do?

Challenge

Images as Numbers

Every picture is just a grid of numbers

→

Convolutional Neural Networks (CNNs)

Stack convolutions to see edges, then shapes, then objects

→

Transformers

The architecture behind modern AI — attention is all you need

→