Greedy Algorithms8 min read

Huffman Encoding

Give shorter Morse code to the letters you use most — save space on every message
build tree:\(O(n \log n)\)space:\(O(n)\)optimality:Provably optimal prefix-free code

In Morse code, E is a single dot (·) because it's the most common letter in English. Z is dash-dash-dot-dot (−−··) because you almost never use it. Frequent characters get short codes; rare ones get long ones. Every character is still unambiguous.

Huffman encoding is the same idea, made mathematically optimal. It assigns binary codewords to characters so that the total number of bits used to encode a typical message is minimized — and it proves this optimality, not just guesses at it.

Prefix-Free Codes

The key constraint: no codeword can be a prefix of another. If A is encoded as 01, no other character can start with 01. This lets us decode a stream of bits without needing delimiters — when you've hit a complete codeword, you know it. You never misread a short code as the beginning of a long one.

Every prefix-free code corresponds to a full binary tree where each leaf is a character. The codeword is the path from root to leaf (left = 0, right = 1). The encoding cost equals the sum of (character frequency × leaf depth).

The Greedy Algorithm

The key insight: in the optimal tree, the two rarest characters are deepest and are siblings. If they weren't, you could swap them with whatever is deepest — since they're the rarest, making them deeper can only help (or keep equal) the total cost.

This gives a beautiful bottom-up construction:

1. Create a leaf node for every character. Set each node's weight = character frequency. Put all nodes in a min-heap.
2. While more than one node remains: extract the two lightest nodes A and B. Create a new internal node with weight = A.weight + B.weight, with A and B as children. Insert back into the heap.
3. The last remaining node is the root of the Huffman tree.

After building the tree, assign 0/1 as you traverse left/right to generate each character's codeword.

An Example

Characters: a(5), b(9), c(12), d(13), e(16), f(45). Frequencies sum to 100 bits of input. Huffman tree gives: f=0 (1 bit), c=100 (3 bits), d=101 (3 bits), a=1100 (4 bits), b=1101 (4 bits), e=111 (3 bits). Total encoded size = 45×1 + 12×3 + 13×3 + 5×4 + 9×4 + 16×3 = 45+36+39+20+36+48 = 224 bits. Storing each character with a fixed 3-bit code would take 300 bits. Huffman saves ~25%.

Decoding

To decode: read bits one at a time and traverse the tree. 0 → go left, 1 → go right. When you land on a leaf, output its character and return to the root. Repeat for the next bit. The prefix-free property guarantees this always produces the unique correct decoding.

Huffman tree construction: greedily merge the two lightest nodes until one root remains
Note: The two rarest characters are always siblings at the deepest level in any optimal Huffman tree. This one geometric fact, plus the greedy bottom-up merging, is all you need. The min-heap just makes the repeated "find the two lightest" steps fast.

Huffman Tree — Build & Generate Codes

import heapq
from collections import defaultdict
class Node:
def __init__(self, ch, freq, left=None, right=None):
self.ch, self.freq = ch, freq
self.left, self.right = left, right
def __lt__(self, other): return self.freq < other.freq
def build_huffman(freq):
heap = [Node(ch, f) for ch, f in freq.items()]
heapq.heapify(heap)
while len(heap) > 1:
a = heapq.heappop(heap)
b = heapq.heappop(heap)
heapq.heappush(heap, Node(None, a.freq+b.freq, a, b))
return heap[0]
def get_codes(root):
codes = {}
def dfs(node, code):
if node.ch is not None:
codes[node.ch] = code or '0'
return
dfs(node.left, code + '0')
dfs(node.right, code + '1')
dfs(root, '')
return codes
freq = {'a':5,'b':9,'c':12,'d':13,'e':16,'f':45}
root = build_huffman(freq)
codes = get_codes(root)
for ch, code in sorted(codes.items()):
print(f" {ch}: {code} ({freq[ch]} occurrences)")
original_bits = sum(freq[c] * 3 for c in freq) # 3-bit fixed
huffman_bits = sum(freq[c] * len(codes[c]) for c in freq)
print(f"Fixed 3-bit: {original_bits} bits")
print(f"Huffman: {huffman_bits} bits")
Output
  a: 1100 (5 occurrences)
  b: 1101 (9 occurrences)
  c: 100 (12 occurrences)
  d: 101 (13 occurrences)
  e: 111 (16 occurrences)
  f: 0 (45 occurrences)
Fixed 3-bit: 300 bits
Huffman:     224 bits

Complexity

🏗️ Build initial heapOne pass to create n leaf nodes and heapify
Heapify n leaves\(O(n)\)
🔀 Greedy tree constructionEach merge: 2 heap extractions + 1 insertion = \(O(\log n)\); done n−1 times
n − 1 merges\(O(n \log n)\)
✍️ Encode a messageL = total characters in the message; look up each character's code
Linear in message length\(O(L)\)
📖 Decode a messageEach bit navigates one tree edge; depth ≤ n − 1 worst case
Linear with tree traversal\(O(L · depth)\)
📦 Spacen leaves + n − 1 internal nodes = 2n − 1 nodes total
Linear alphabet size\(O(n)\)

Quick check

Why must the two rarest characters be siblings at the deepest level in any optimal Huffman tree?

Continue reading

Greedy Pattern
Always grab the biggest slice of pizza — sometimes that's optimal, sometimes it isn't
Bloom Filter
Probably yes, definitely no — a memory-sipping membership checker
Task Scheduling
Do homework by closest deadline first — miss nothing, earn the most