Huffman Encoding

In Morse code, E is a single dot (·) because it's the most common letter in English. Z is dash-dash-dot-dot (−−··) because you almost never use it. Frequent characters get short codes; rare ones get long ones. Every character is still unambiguous.

Huffman encoding is the same idea, made mathematically optimal. It assigns binary codewords to characters so that the total number of bits used to encode a typical message is minimized — and it proves this optimality, not just guesses at it.

Prefix-Free Codes

The key constraint: no codeword can be a prefix of another. If A is encoded as 01, no other character can start with 01. This lets us decode a stream of bits without needing delimiters — when you've hit a complete codeword, you know it. You never misread a short code as the beginning of a long one.

Every prefix-free code corresponds to a full binary tree where each leaf is a character. The codeword is the path from root to leaf (left = 0, right = 1). The encoding cost equals the sum of (character frequency × leaf depth).

The Greedy Algorithm

The key insight: in the optimal tree, the two rarest characters are deepest and are siblings. If they weren't, you could swap them with whatever is deepest — since they're the rarest, making them deeper can only help (or keep equal) the total cost.

This gives a beautiful bottom-up construction:

1. Create a leaf node for every character. Set each node's weight = character frequency. Put all nodes in a min-heap.
2. While more than one node remains: extract the two lightest nodes A and B. Create a new internal node with weight = A.weight + B.weight, with A and B as children. Insert back into the heap.
3. The last remaining node is the root of the Huffman tree.

After building the tree, assign 0/1 as you traverse left/right to generate each character's codeword.

An Example

Characters: a(5), b(9), c(12), d(13), e(16), f(45). Frequencies sum to 100 bits of input. Huffman tree gives: f=0 (1 bit), c=100 (3 bits), d=101 (3 bits), a=1100 (4 bits), b=1101 (4 bits), e=111 (3 bits). Total encoded size = 45×1 + 12×3 + 13×3 + 5×4 + 9×4 + 16×3 = 45+36+39+20+36+48 = 224 bits. Storing each character with a fixed 3-bit code would take 300 bits. Huffman saves ~25%.

Decoding

To decode: read bits one at a time and traverse the tree. 0 → go left, 1 → go right. When you land on a leaf, output its character and return to the root. Repeat for the next bit. The prefix-free property guarantees this always produces the unique correct decoding.

Huffman Tree — Build & Generate Codes

import heapq
from collections import defaultdict

class Node:
    def __init__(self, ch, freq, left=None, right=None):
        self.ch, self.freq = ch, freq
        self.left, self.right = left, right
    def __lt__(self, other): return self.freq < other.freq

def build_huffman(freq):
    heap = [Node(ch, f) for ch, f in freq.items()]
    heapq.heapify(heap)
    while len(heap) > 1:
        a = heapq.heappop(heap)
        b = heapq.heappop(heap)
        heapq.heappush(heap, Node(None, a.freq+b.freq, a, b))
    return heap[0]

def get_codes(root):
    codes = {}
    def dfs(node, code):
        if node.ch is not None:
            codes[node.ch] = code or '0'
            return
        dfs(node.left, code + '0')
        dfs(node.right, code + '1')
    dfs(root, '')
    return codes

freq = {'a':5,'b':9,'c':12,'d':13,'e':16,'f':45}
root = build_huffman(freq)
codes = get_codes(root)
for ch, code in sorted(codes.items()):
    print(f"  {ch}: {code} ({freq[ch]} occurrences)")

original_bits = sum(freq[c] * 3 for c in freq)  # 3-bit fixed
huffman_bits = sum(freq[c] * len(codes[c]) for c in freq)
print(f"Fixed 3-bit: {original_bits} bits")
print(f"Huffman:     {huffman_bits} bits")

Output

  a: 1100 (5 occurrences)
  b: 1101 (9 occurrences)
  c: 100 (12 occurrences)
  d: 101 (13 occurrences)
  e: 111 (16 occurrences)
  f: 0 (45 occurrences)
Fixed 3-bit: 300 bits
Huffman:     224 bits

Complexity

🏗️ Build initial heap

One pass to create n leaf nodes and heapify

Heapify n leaves \(O(n)\)

🔀 Greedy tree construction

Each merge: 2 heap extractions + 1 insertion = \(O(\log n)\); done n−1 times

n − 1 merges \(O(n \log n)\)

✍️ Encode a message

L = total characters in the message; look up each character's code

Linear in message length \(O(L)\)

📖 Decode a message

Each bit navigates one tree edge; depth ≤ n − 1 worst case

Linear with tree traversal \(O(L · depth)\)

📦 Space

n leaves + n − 1 internal nodes = 2n − 1 nodes total