Computer Vision9 min read

Images as Numbers

Every picture is just a grid of numbers

scope:Core Conceptdifficulty:Beginner

Zoom In Far Enough

Open any photo on your phone and zoom in as far as you can. Keep going. Eventually, you'll see tiny colored squares — pixels. Every digital image, from selfies to satellite photos, is just a grid of pixels.

And each pixel? Just three numbers: one for Red, one for Green, one for Blue. Mix them together like paint, and you get any color.

[255, 0, 0] = pure red
[0, 255, 0] = pure green
[0, 0, 255] = pure blue
[255, 255, 0] = yellow (red + green)
[255, 255, 255] = white (all colors)
[0, 0, 0] = black (no light)

A typical smartphone photo is about 4000 x 3000 pixels. That's 12 million pixels, each with 3 color values — 36 million numbers to describe a single picture.

Why This Matters for AI

Once an image is just a grid of numbers, computers can do math on it. And math is what neural networks do best. Brightness? Average the numbers. Edges? Subtract neighboring pixels. Blur? Average a neighborhood. Every image operation is just arithmetic on a grid.

Grayscale: The Simple Version

Color images have 3 channels (R, G, B). Grayscale images have just 1 — a brightness value from 0 (black) to 255 (white). They're easier to work with and many classic computer vision algorithms start with grayscale.

Image as a Matrix (or Tensor)

To a computer, an image is a 3D array (also called a tensor):

Height x Width x Channels
A 28x28 grayscale image → shape (28, 28, 1)
A 1920x1080 color image → shape (1080, 1920, 3)

This is the fundamental data structure for all computer vision. When you feed an image to a neural network, you're really feeding it a tensor of numbers.

Resolution and Information

More pixels = more detail = more numbers. A 4K image (3840x2160) has 8.3 million pixels — compared to a tiny 28x28 thumbnail (784 pixels) often used in AI experiments. The MNIST handwritten digit dataset uses 28x28 grayscale images. Simple, but enough for a neural network to recognize digits with 99%+ accuracy.

Working with Images as Numbers

import numpy as np

# Create a tiny 4x4 RGB image
image = np.zeros((4, 4, 3), dtype=np.uint8)

# Paint some pixels
image[0, 0] = [255, 0, 0]    # Red
image[0, 1] = [0, 255, 0]    # Green
image[0, 2] = [0, 0, 255]    # Blue
image[0, 3] = [255, 255, 0]  # Yellow
image[1, :] = [128, 128, 128]  # Gray row
image[2, :] = [255, 255, 255]  # White row
image[3, :] = [0, 0, 0]       # Black row

print(f"Image shape: {image.shape}  (height, width, RGB)")
print(f"Total numbers: {image.size}")
print(f"\nTop-left pixel (Red): {image[0, 0]}")
print(f"Gray pixel: {image[1, 0]}")

# Convert to grayscale using standard formula
def to_grayscale(img):
    return (0.299 * img[:,:,0] + 0.587 * img[:,:,1] + 0.114 * img[:,:,2]).astype(np.uint8)

gray = to_grayscale(image)
print(f"\nGrayscale shape: {gray.shape}")
print(f"Grayscale values:\n{gray}")

# Detect edges (simple difference between neighbors)
print(f"\nBrightness change between rows:")
for i in range(len(gray) - 1):
    diff = gray[i+1].astype(int) - gray[i].astype(int)
    print(f"  Row {i}→{i+1}: {diff}")

Output

Image shape: (4, 4, 3)  (height, width, RGB)
Total numbers: 48

Top-left pixel (Red): [255   0   0]
Gray pixel: [128 128 128]

Grayscale shape: (4, 4)
Grayscale values:
[[ 76 149  29 226]
 [128 128 128 128]
 [255 255 255 255]
 [  0   0   0   0]]

Brightness change between rows:
  Row 0→1: [  52  -21   99  -98]
  Row 1→2: [127 127 127 127]
  Row 2→3: [-255 -255 -255 -255]

Note: Why 0-255? Each color channel is stored in 1 byte (8 bits), which can hold values from 0 to 255. That gives 256 levels per channel, and 256 x 256 x 256 = 16.7 million possible colors. That's more than enough — the human eye can only distinguish about 10 million colors.

From Pixels to Understanding

A single pixel tells you almost nothing. It's just a color. But patterns of pixels form edges. Groups of edges form shapes. Collections of shapes form objects. This hierarchy — from numbers to pixels to edges to shapes to objects to scenes — is exactly what neural networks learn to build, layer by layer.

The journey from "grid of numbers" to "that's a photo of a golden retriever at the beach" is the story of computer vision. And it all starts with the humble pixel.

Quick check

How many numbers does a single pixel in a color image contain?

Challenge

Convolutions

A sliding magnifying glass that detects patterns

→

Convolutional Neural Networks (CNNs)

Stack convolutions to see edges, then shapes, then objects

→

What is AI?

A robot waiter, a chess engine, and Siri walk into a bar — what makes them intelligent?

→