Computer Vision11 min read

Convolutional Neural Networks (CNNs)

Stack convolutions to see edges, then shapes, then objects

scope:Core Conceptdifficulty:Intermediate

Building Vision Layer by Layer

Your visual cortex doesn't jump from raw light to "that's a golden retriever" in one step. It builds up understanding gradually:

Layer 1: Detects simple edges — lines and boundaries
Layer 2: Combines edges into shapes — circles, rectangles, curves
Layer 3: Combines shapes into parts — eyes, ears, wheels, windows
Deeper layers: Combines parts into whole objects — faces, cars, dogs, buildings

A Convolutional Neural Network (CNN) mirrors this hierarchy exactly. It stacks multiple convolutional layers, where each layer builds on the patterns detected by the previous one. The first layer sees edges. The second sees shapes. The deep layers see objects. It's a hierarchy of pattern detectors, learned entirely from data.

The CNN Recipe

A typical CNN stacks three types of layers:

Convolutional layers — detect patterns using learned kernels
Pooling layers — shrink the image by summarizing small regions (e.g., max pooling takes the largest value in each 2x2 block). This reduces computation and adds some translation invariance.
Fully connected layers — at the end, flatten everything into a vector and classify it ("this is a cat" vs "this is a dog")

The Famous CNNs

Each milestone CNN pushed the boundaries:

LeNet-5 (1998) — Yann LeCun

The pioneer. Used to read handwritten digits on bank checks. Just 5 layers, but it proved convolutions work for vision.

AlexNet (2012) — The Deep Learning Revolution

Won the ImageNet competition by a landslide, dropping error from 26% to 16%. Two key ingredients: deeper architecture (8 layers) and GPU training. This single result kickstarted the modern deep learning era.

VGGNet (2014) — Depth Wins

Pushed to 19 layers. Showed that deeper networks learn better features. Simple philosophy: just keep stacking 3x3 convolutions.

ResNet (2015) — Skip Connections

152 layers! The key innovation: skip connections that let information bypass layers. Without them, very deep networks couldn't train (the gradients would vanish). ResNet proved that with the right architecture, deeper is almost always better.

Building a Simple CNN Concept

import numpy as np

def relu(x):
    return np.maximum(0, x)

def max_pool_2x2(feature_map):
    """Shrink by taking max of each 2x2 block."""
    h, w = feature_map.shape
    pooled = np.zeros((h // 2, w // 2))
    for i in range(0, h, 2):
        for j in range(0, w, 2):
            pooled[i//2, j//2] = np.max(feature_map[i:i+2, j:j+2])
    return pooled

def conv2d(image, kernel):
    ih, iw = image.shape
    kh, kw = kernel.shape
    output = np.zeros((ih - kh + 1, iw - kw + 1))
    for y in range(output.shape[0]):
        for x in range(output.shape[1]):
            output[y, x] = np.sum(image[y:y+kh, x:x+kw] * kernel)
    return output

# Simulate a mini CNN pipeline
image = np.array([
    [0, 0, 0, 0, 10, 10, 10, 10],
    [0, 0, 0, 0, 10, 10, 10, 10],
    [0, 0, 0, 0, 10, 10, 10, 10],
    [0, 0, 0, 0, 10, 10, 10, 10],
    [10, 10, 10, 10, 0, 0, 0, 0],
    [10, 10, 10, 10, 0, 0, 0, 0],
    [10, 10, 10, 10, 0, 0, 0, 0],
    [10, 10, 10, 10, 0, 0, 0, 0],
], dtype=float)

edge_kernel = np.array([[-1, 0, 1], [-1, 0, 1], [-1, 0, 1]], dtype=float)

print("Step 1: Convolution (edge detection)")
conv_out = conv2d(image, edge_kernel)
print(f"  Shape: {image.shape} → {conv_out.shape}")

print("\nStep 2: ReLU (remove negatives)")
relu_out = relu(conv_out)
print(f"  Negative values removed")

print("\nStep 3: Max Pooling (shrink)")
pool_out = max_pool_2x2(relu_out)
print(f"  Shape: {relu_out.shape} → {pool_out.shape}")
print(f"  Result:\n{pool_out.astype(int)}")

print("\nStep 4: Flatten for classification")
flat = pool_out.flatten()
print(f"  Shape: {pool_out.shape} → ({len(flat)},)")
print(f"  Vector: {flat.astype(int)}")

Output

Step 1: Convolution (edge detection)
  Shape: (8, 8) → (6, 6)

Step 2: ReLU (remove negatives)
  Negative values removed

Step 3: Max Pooling (shrink)
  Shape: (6, 6) → (3, 3)
  Result:
[[30 30  0]
 [30 30  0]
 [ 0 30 30]]

Step 4: Flatten for classification
  Shape: (3, 3) → (9,)
  Vector: [30 30  0 30 30  0  0 30 30]

Note: The AlexNet moment (2012) is often called the "Big Bang" of deep learning. Before it, computer vision relied on hand-crafted features designed by PhD researchers. AlexNet showed that a CNN trained end-to-end on raw pixels could crush the competition. Almost overnight, the entire field pivoted to deep learning. The paper has been cited over 120,000 times.

CNNs Beyond Image Classification

CNNs aren't just for labeling images. They power:

Object detection (YOLO, Faster R-CNN) — draw bounding boxes around every object in an image
Semantic segmentation — label every single pixel (this pixel is sky, this pixel is road, this pixel is car)
Face recognition — identify who's in a photo
Medical imaging — detect tumors in X-rays, classify skin lesions, analyze retinal scans
Self-driving cars — understand the road, detect pedestrians, read signs

More recently, Vision Transformers (ViT) have started challenging CNN dominance by applying the transformer architecture to images. But CNNs remain the foundation of computer vision and are still widely used, especially on mobile devices where their efficiency shines.

Quick check

What do deeper layers in a CNN typically detect?

Challenge

Convolutions

A sliding magnifying glass that detects patterns

→

Images as Numbers

Every picture is just a grid of numbers

→

Transformers

The architecture behind modern AI — attention is all you need

→