Huffman Encoding
In Morse code, E is a single dot (·) because it's the most common letter in English. Z is dash-dash-dot-dot (−−··) because you almost never use it. Frequent characters get short codes; rare ones get long ones. Every character is still unambiguous.
Huffman encoding is the same idea, made mathematically optimal. It assigns binary codewords to characters so that the total number of bits used to encode a typical message is minimized — and it proves this optimality, not just guesses at it.
Prefix-Free Codes
The key constraint: no codeword can be a prefix of another. If A is encoded as 01, no other character can start with 01. This lets us decode a stream of bits without needing delimiters — when you've hit a complete codeword, you know it. You never misread a short code as the beginning of a long one.
Every prefix-free code corresponds to a full binary tree where each leaf is a character. The codeword is the path from root to leaf (left = 0, right = 1). The encoding cost equals the sum of (character frequency × leaf depth).
The Greedy Algorithm
The key insight: in the optimal tree, the two rarest characters are deepest and are siblings. If they weren't, you could swap them with whatever is deepest — since they're the rarest, making them deeper can only help (or keep equal) the total cost.
This gives a beautiful bottom-up construction:
1. Create a leaf node for every character. Set each node's weight = character frequency. Put all nodes in a min-heap.
2. While more than one node remains: extract the two lightest nodes A and B. Create a new internal node with weight = A.weight + B.weight, with A and B as children. Insert back into the heap.
3. The last remaining node is the root of the Huffman tree.
After building the tree, assign 0/1 as you traverse left/right to generate each character's codeword.
An Example
Characters: a(5), b(9), c(12), d(13), e(16), f(45). Frequencies sum to 100 bits of input. Huffman tree gives: f=0 (1 bit), c=100 (3 bits), d=101 (3 bits), a=1100 (4 bits), b=1101 (4 bits), e=111 (3 bits). Total encoded size = 45×1 + 12×3 + 13×3 + 5×4 + 9×4 + 16×3 = 45+36+39+20+36+48 = 224 bits. Storing each character with a fixed 3-bit code would take 300 bits. Huffman saves ~25%.
Decoding
To decode: read bits one at a time and traverse the tree. 0 → go left, 1 → go right. When you land on a leaf, output its character and return to the root. Repeat for the next bit. The prefix-free property guarantees this always produces the unique correct decoding.