Hash Tables6 min read

Hash Functions

Turn any key into a bucket number — instantly
hash compute: \(O(k)\) where k=key lengthlookup avg: \(O(1)\)lookup worst: \(O(n)\)space: \(O(n)\)

Imagine a locker room with 100 numbered lockers. Instead of searching every locker for your stuff, you have a magic formula that tells you exactly which locker number to use based on your name. That magic formula is a hash function.

A hash function takes any key — a string, a number, an object — and converts it to an integer index into an array. The goal: spread keys evenly across all available buckets so lookups stay close to \(O(1)\).

What makes a good hash function?

  • Deterministic — same input always produces the same output
  • Uniform distribution — keys spread evenly, no hot buckets
  • Fast to compute — otherwise you lose the speed benefit
  • Avalanche effect — small input change, totally different output

A naive hash for strings might sum the ASCII values and take modulo array length: hash = (sum of char codes) % capacity. Simple, but it clusters anagrams in the same bucket. Better functions like djb2 or FNV-1a mix bits more aggressively.

Watch hashing in action

← → arrow keys to step · click elements to interact

Polynomial Hash & Collisions

def poly_hash(key, capacity):
h = 0
for ch in key:
h = (h * 31 + ord(ch)) % capacity
return h
cap = 10
for word in ['cat', 'dog', 'act', 'tac']:
print(f'{word} -> bucket {poly_hash(word, cap)}')
# 'act' and 'tac' may collide with a naive sum hash but not polynomial
Output
cat -> bucket 5
dog -> bucket 6
act -> bucket 9
tac -> bucket 5

Collisions — when two keys land in the same bucket

No matter how good your hash function is, with enough keys you will get collisions — two keys mapping to the same bucket. This is unavoidable (the pigeonhole principle: if you have more keys than buckets, at least one bucket holds two keys).

There are two classic ways to handle collisions:

  • Chaining — each bucket holds a linked list; colliding keys are appended to the list
  • Open Addressing — when a bucket is occupied, probe for the next empty slot (linear probing, quadratic probing, double hashing)

Load Factor

The load factor α = n / capacity (where n = number of stored keys) measures how full the table is. When α gets too high:

  • Chains get longer (chaining)
  • Clusters form (open addressing)
  • Average lookup degrades toward \(O(n)\)

Most implementations resize (rehash into a bigger array) when α exceeds 0.75 for chaining or 0.5 for open addressing. Rehashing copies every entry — \(O(n)\) one-time cost — but keeps future operations \(O(1)\).

Note: The worst-case \(O(n)\) lookup only happens if every single key hashes to the same bucket — an adversarial input. With a good hash function and a reasonable load factor this never occurs in practice.

Complexity

Hash computek = number of characters in the key
Key-length\(O(k)\)
Lookup (average)Hash directly to the bucket
Instant\(O(1)\)
Lookup (worst)All keys in one bucket — bad hash or adversarial input
Linear\(O(n)\)
Insert (average)Compute hash, write to bucket
Instant\(O(1)\)
Delete (average)Compute hash, remove from bucket
Instant\(O(1)\)
SpaceOne entry per stored key plus array overhead
Linear\(O(n)\)

Word Frequency Counter

words = ['apple', 'banana', 'apple', 'cherry', 'banana', 'apple']
freq = {}
for w in words:
freq[w] = freq.get(w, 0) + 1
for word, count in sorted(freq.items()):
print(f'{word}: {count}')
Output
apple: 3
banana: 2
cherry: 1
Challenge

Quick check

What property guarantees that hash("cat") always returns the same bucket number?

Continue reading

Chaining
Collisions? Just grow a list at that bucket
Sets & Maps
Instant membership and key-value lookup — the everyday hash table
Frequency Patterns
Count once, answer anything — the hash map superpower
Consistent HashingSystem Design
Add or remove servers without reshuffling everything
Bloom Filter
Ask 'is this here?' in \(O(1)\) — maybe yes, definitely no
Dictionaries & SetsPython
Look things up by name instead of by number