Computer Vision9 min read

Images as Numbers

Every picture is just a grid of numbers
scope:Core Conceptdifficulty:Beginner

Zoom In Far Enough

Open any photo on your phone and zoom in as far as you can. Keep going. Eventually, you'll see tiny colored squares β€” pixels. Every digital image, from selfies to satellite photos, is just a grid of pixels.

And each pixel? Just three numbers: one for Red, one for Green, one for Blue. Mix them together like paint, and you get any color.

  • [255, 0, 0] = pure red
  • [0, 255, 0] = pure green
  • [0, 0, 255] = pure blue
  • [255, 255, 0] = yellow (red + green)
  • [255, 255, 255] = white (all colors)
  • [0, 0, 0] = black (no light)

A typical smartphone photo is about 4000 x 3000 pixels. That's 12 million pixels, each with 3 color values β€” 36 million numbers to describe a single picture.

Why This Matters for AI

Once an image is just a grid of numbers, computers can do math on it. And math is what neural networks do best. Brightness? Average the numbers. Edges? Subtract neighboring pixels. Blur? Average a neighborhood. Every image operation is just arithmetic on a grid.

Grayscale: The Simple Version

Color images have 3 channels (R, G, B). Grayscale images have just 1 β€” a brightness value from 0 (black) to 255 (white). They're easier to work with and many classic computer vision algorithms start with grayscale.

Image as a Matrix (or Tensor)

To a computer, an image is a 3D array (also called a tensor):

  • Height x Width x Channels
  • A 28x28 grayscale image β†’ shape (28, 28, 1)
  • A 1920x1080 color image β†’ shape (1080, 1920, 3)

This is the fundamental data structure for all computer vision. When you feed an image to a neural network, you're really feeding it a tensor of numbers.

Resolution and Information

More pixels = more detail = more numbers. A 4K image (3840x2160) has 8.3 million pixels β€” compared to a tiny 28x28 thumbnail (784 pixels) often used in AI experiments. The MNIST handwritten digit dataset uses 28x28 grayscale images. Simple, but enough for a neural network to recognize digits with 99%+ accuracy.

Working with Images as Numbers

import numpy as np
# Create a tiny 4x4 RGB image
image = np.zeros((4, 4, 3), dtype=np.uint8)
# Paint some pixels
image[0, 0] = [255, 0, 0] # Red
image[0, 1] = [0, 255, 0] # Green
image[0, 2] = [0, 0, 255] # Blue
image[0, 3] = [255, 255, 0] # Yellow
image[1, :] = [128, 128, 128] # Gray row
image[2, :] = [255, 255, 255] # White row
image[3, :] = [0, 0, 0] # Black row
print(f"Image shape: {image.shape} (height, width, RGB)")
print(f"Total numbers: {image.size}")
print(f"\nTop-left pixel (Red): {image[0, 0]}")
print(f"Gray pixel: {image[1, 0]}")
# Convert to grayscale using standard formula
def to_grayscale(img):
return (0.299 * img[:,:,0] + 0.587 * img[:,:,1] + 0.114 * img[:,:,2]).astype(np.uint8)
gray = to_grayscale(image)
print(f"\nGrayscale shape: {gray.shape}")
print(f"Grayscale values:\n{gray}")
# Detect edges (simple difference between neighbors)
print(f"\nBrightness change between rows:")
for i in range(len(gray) - 1):
diff = gray[i+1].astype(int) - gray[i].astype(int)
print(f" Row {i}β†’{i+1}: {diff}")
Output
Image shape: (4, 4, 3)  (height, width, RGB)
Total numbers: 48

Top-left pixel (Red): [255   0   0]
Gray pixel: [128 128 128]

Grayscale shape: (4, 4)
Grayscale values:
[[ 76 149  29 226]
 [128 128 128 128]
 [255 255 255 255]
 [  0   0   0   0]]

Brightness change between rows:
  Row 0β†’1: [  52  -21   99  -98]
  Row 1β†’2: [127 127 127 127]
  Row 2β†’3: [-255 -255 -255 -255]
Note: Why 0-255? Each color channel is stored in 1 byte (8 bits), which can hold values from 0 to 255. That gives 256 levels per channel, and 256 x 256 x 256 = 16.7 million possible colors. That's more than enough β€” the human eye can only distinguish about 10 million colors.

From Pixels to Understanding

A single pixel tells you almost nothing. It's just a color. But patterns of pixels form edges. Groups of edges form shapes. Collections of shapes form objects. This hierarchy β€” from numbers to pixels to edges to shapes to objects to scenes β€” is exactly what neural networks learn to build, layer by layer.

The journey from "grid of numbers" to "that's a photo of a golden retriever at the beach" is the story of computer vision. And it all starts with the humble pixel.

Quick check

How many numbers does a single pixel in a color image contain?
Challenge

Continue reading