Computer Vision11 min read

Convolutional Neural Networks (CNNs)

Stack convolutions to see edges, then shapes, then objects
scope:Core Conceptdifficulty:Intermediate

Building Vision Layer by Layer

Your visual cortex doesn't jump from raw light to "that's a golden retriever" in one step. It builds up understanding gradually:

  • Layer 1: Detects simple edges β€” lines and boundaries
  • Layer 2: Combines edges into shapes β€” circles, rectangles, curves
  • Layer 3: Combines shapes into parts β€” eyes, ears, wheels, windows
  • Deeper layers: Combines parts into whole objects β€” faces, cars, dogs, buildings

A Convolutional Neural Network (CNN) mirrors this hierarchy exactly. It stacks multiple convolutional layers, where each layer builds on the patterns detected by the previous one. The first layer sees edges. The second sees shapes. The deep layers see objects. It's a hierarchy of pattern detectors, learned entirely from data.

The CNN Recipe

A typical CNN stacks three types of layers:

  1. Convolutional layers β€” detect patterns using learned kernels
  2. Pooling layers β€” shrink the image by summarizing small regions (e.g., max pooling takes the largest value in each 2x2 block). This reduces computation and adds some translation invariance.
  3. Fully connected layers β€” at the end, flatten everything into a vector and classify it ("this is a cat" vs "this is a dog")

The Famous CNNs

Each milestone CNN pushed the boundaries:

LeNet-5 (1998) β€” Yann LeCun

The pioneer. Used to read handwritten digits on bank checks. Just 5 layers, but it proved convolutions work for vision.

AlexNet (2012) β€” The Deep Learning Revolution

Won the ImageNet competition by a landslide, dropping error from 26% to 16%. Two key ingredients: deeper architecture (8 layers) and GPU training. This single result kickstarted the modern deep learning era.

VGGNet (2014) β€” Depth Wins

Pushed to 19 layers. Showed that deeper networks learn better features. Simple philosophy: just keep stacking 3x3 convolutions.

ResNet (2015) β€” Skip Connections

152 layers! The key innovation: skip connections that let information bypass layers. Without them, very deep networks couldn't train (the gradients would vanish). ResNet proved that with the right architecture, deeper is almost always better.

Building a Simple CNN Concept

import numpy as np
def relu(x):
return np.maximum(0, x)
def max_pool_2x2(feature_map):
"""Shrink by taking max of each 2x2 block."""
h, w = feature_map.shape
pooled = np.zeros((h // 2, w // 2))
for i in range(0, h, 2):
for j in range(0, w, 2):
pooled[i//2, j//2] = np.max(feature_map[i:i+2, j:j+2])
return pooled
def conv2d(image, kernel):
ih, iw = image.shape
kh, kw = kernel.shape
output = np.zeros((ih - kh + 1, iw - kw + 1))
for y in range(output.shape[0]):
for x in range(output.shape[1]):
output[y, x] = np.sum(image[y:y+kh, x:x+kw] * kernel)
return output
# Simulate a mini CNN pipeline
image = np.array([
[0, 0, 0, 0, 10, 10, 10, 10],
[0, 0, 0, 0, 10, 10, 10, 10],
[0, 0, 0, 0, 10, 10, 10, 10],
[0, 0, 0, 0, 10, 10, 10, 10],
[10, 10, 10, 10, 0, 0, 0, 0],
[10, 10, 10, 10, 0, 0, 0, 0],
[10, 10, 10, 10, 0, 0, 0, 0],
[10, 10, 10, 10, 0, 0, 0, 0],
], dtype=float)
edge_kernel = np.array([[-1, 0, 1], [-1, 0, 1], [-1, 0, 1]], dtype=float)
print("Step 1: Convolution (edge detection)")
conv_out = conv2d(image, edge_kernel)
print(f" Shape: {image.shape} β†’ {conv_out.shape}")
print("\nStep 2: ReLU (remove negatives)")
relu_out = relu(conv_out)
print(f" Negative values removed")
print("\nStep 3: Max Pooling (shrink)")
pool_out = max_pool_2x2(relu_out)
print(f" Shape: {relu_out.shape} β†’ {pool_out.shape}")
print(f" Result:\n{pool_out.astype(int)}")
print("\nStep 4: Flatten for classification")
flat = pool_out.flatten()
print(f" Shape: {pool_out.shape} β†’ ({len(flat)},)")
print(f" Vector: {flat.astype(int)}")
Output
Step 1: Convolution (edge detection)
  Shape: (8, 8) β†’ (6, 6)

Step 2: ReLU (remove negatives)
  Negative values removed

Step 3: Max Pooling (shrink)
  Shape: (6, 6) β†’ (3, 3)
  Result:
[[30 30  0]
 [30 30  0]
 [ 0 30 30]]

Step 4: Flatten for classification
  Shape: (3, 3) β†’ (9,)
  Vector: [30 30  0 30 30  0  0 30 30]
Note: The AlexNet moment (2012) is often called the "Big Bang" of deep learning. Before it, computer vision relied on hand-crafted features designed by PhD researchers. AlexNet showed that a CNN trained end-to-end on raw pixels could crush the competition. Almost overnight, the entire field pivoted to deep learning. The paper has been cited over 120,000 times.

CNNs Beyond Image Classification

CNNs aren't just for labeling images. They power:

  • Object detection (YOLO, Faster R-CNN) β€” draw bounding boxes around every object in an image
  • Semantic segmentation β€” label every single pixel (this pixel is sky, this pixel is road, this pixel is car)
  • Face recognition β€” identify who's in a photo
  • Medical imaging β€” detect tumors in X-rays, classify skin lesions, analyze retinal scans
  • Self-driving cars β€” understand the road, detect pedestrians, read signs

More recently, Vision Transformers (ViT) have started challenging CNN dominance by applying the transformer architecture to images. But CNNs remain the foundation of computer vision and are still widely used, especially on mobile devices where their efficiency shines.

Quick check

What do deeper layers in a CNN typically detect?
Challenge

Continue reading