Hash Tables6 min read

Hash Functions

Turn any key into a bucket number — instantly

hash compute: \(O(k)\) where k=key lengthlookup avg: \(O(1)\)lookup worst: \(O(n)\)space: \(O(n)\)

Imagine a locker room with 100 numbered lockers. Instead of searching every locker for your stuff, you have a magic formula that tells you exactly which locker number to use based on your name. That magic formula is a hash function.

A hash function takes any key — a string, a number, an object — and converts it to an integer index into an array. The goal: spread keys evenly across all available buckets so lookups stay close to \(O(1)\).

What makes a good hash function?

Deterministic — same input always produces the same output
Uniform distribution — keys spread evenly, no hot buckets
Fast to compute — otherwise you lose the speed benefit
Avalanche effect — small input change, totally different output

A naive hash for strings might sum the ASCII values and take modulo array length: hash = (sum of char codes) % capacity. Simple, but it clusters anagrams in the same bucket. Better functions like djb2 or FNV-1a mix bits more aggressively.

◈ Watch hashing in action

← → arrow keys to step · click elements to interact

Polynomial Hash & Collisions

def poly_hash(key, capacity):
    h = 0
    for ch in key:
        h = (h * 31 + ord(ch)) % capacity
    return h

cap = 10
for word in ['cat', 'dog', 'act', 'tac']:
    print(f'{word} -> bucket {poly_hash(word, cap)}')
# 'act' and 'tac' may collide with a naive sum hash but not polynomial

Output

cat -> bucket 5
dog -> bucket 6
act -> bucket 9
tac -> bucket 5

Collisions — when two keys land in the same bucket

No matter how good your hash function is, with enough keys you will get collisions — two keys mapping to the same bucket. This is unavoidable (the pigeonhole principle: if you have more keys than buckets, at least one bucket holds two keys).

There are two classic ways to handle collisions:

Chaining — each bucket holds a linked list; colliding keys are appended to the list
Open Addressing — when a bucket is occupied, probe for the next empty slot (linear probing, quadratic probing, double hashing)

Load Factor

The load factor α = n / capacity (where n = number of stored keys) measures how full the table is. When α gets too high:

Chains get longer (chaining)
Clusters form (open addressing)
Average lookup degrades toward \(O(n)\)

Most implementations resize (rehash into a bigger array) when α exceeds 0.75 for chaining or 0.5 for open addressing. Rehashing copies every entry — \(O(n)\) one-time cost — but keeps future operations \(O(1)\).

Note: The worst-case \(O(n)\) lookup only happens if every single key hashes to the same bucket — an adversarial input. With a good hash function and a reasonable load factor this never occurs in practice.

Complexity

Hash compute

k = number of characters in the key

Key-length \(O(k)\)

Lookup (average)

Hash directly to the bucket

Instant \(O(1)\)

Lookup (worst)

All keys in one bucket — bad hash or adversarial input

Linear \(O(n)\)

Insert (average)

Compute hash, write to bucket

Instant \(O(1)\)

Delete (average)

Compute hash, remove from bucket

Instant \(O(1)\)

Space

One entry per stored key plus array overhead

Linear \(O(n)\)

Word Frequency Counter

words = ['apple', 'banana', 'apple', 'cherry', 'banana', 'apple']
freq = {}
for w in words:
    freq[w] = freq.get(w, 0) + 1
for word, count in sorted(freq.items()):
    print(f'{word}: {count}')