Convolutional Neural Networks (CNNs)
Building Vision Layer by Layer
Your visual cortex doesn't jump from raw light to "that's a golden retriever" in one step. It builds up understanding gradually:
- Layer 1: Detects simple edges β lines and boundaries
- Layer 2: Combines edges into shapes β circles, rectangles, curves
- Layer 3: Combines shapes into parts β eyes, ears, wheels, windows
- Deeper layers: Combines parts into whole objects β faces, cars, dogs, buildings
A Convolutional Neural Network (CNN) mirrors this hierarchy exactly. It stacks multiple convolutional layers, where each layer builds on the patterns detected by the previous one. The first layer sees edges. The second sees shapes. The deep layers see objects. It's a hierarchy of pattern detectors, learned entirely from data.
The CNN Recipe
A typical CNN stacks three types of layers:
- Convolutional layers β detect patterns using learned kernels
- Pooling layers β shrink the image by summarizing small regions (e.g., max pooling takes the largest value in each 2x2 block). This reduces computation and adds some translation invariance.
- Fully connected layers β at the end, flatten everything into a vector and classify it ("this is a cat" vs "this is a dog")
The Famous CNNs
Each milestone CNN pushed the boundaries:
LeNet-5 (1998) β Yann LeCun
The pioneer. Used to read handwritten digits on bank checks. Just 5 layers, but it proved convolutions work for vision.
AlexNet (2012) β The Deep Learning Revolution
Won the ImageNet competition by a landslide, dropping error from 26% to 16%. Two key ingredients: deeper architecture (8 layers) and GPU training. This single result kickstarted the modern deep learning era.
VGGNet (2014) β Depth Wins
Pushed to 19 layers. Showed that deeper networks learn better features. Simple philosophy: just keep stacking 3x3 convolutions.
ResNet (2015) β Skip Connections
152 layers! The key innovation: skip connections that let information bypass layers. Without them, very deep networks couldn't train (the gradients would vanish). ResNet proved that with the right architecture, deeper is almost always better.
Building a Simple CNN Concept
CNNs Beyond Image Classification
CNNs aren't just for labeling images. They power:
- Object detection (YOLO, Faster R-CNN) β draw bounding boxes around every object in an image
- Semantic segmentation β label every single pixel (this pixel is sky, this pixel is road, this pixel is car)
- Face recognition β identify who's in a photo
- Medical imaging β detect tumors in X-rays, classify skin lesions, analyze retinal scans
- Self-driving cars β understand the road, detect pedestrians, read signs
More recently, Vision Transformers (ViT) have started challenging CNN dominance by applying the transformer architecture to images. But CNNs remain the foundation of computer vision and are still widely used, especially on mobile devices where their efficiency shines.