Advanced / Specialized5 min read

Bloom Filter

Ask 'is this here?' in \(O(1)\) — maybe yes, definitely no
insert: \(O(k)\)query: \(O(k)\)false positives: Possiblefalse negatives: Neverspace: \(O(m)\) bits

Imagine you're a bouncer at a huge party. You have a tiny checklist — not names, just checkboxes. When someone joins the party, you tick a few boxes based on their name. When someone shows up at the door, you check those same boxes. If any box is unchecked: they definitely were not invited. If all boxes are checked: they were probably invited — but you might make a mistake.

That's a Bloom filter. It's a bit array (all zeros at the start) plus k independent hash functions. It can tell you with certainty that something is not in the set, but it can only say something is probably in the set.

How it works

To insert an element: run it through all k hash functions. Each function gives an index into the bit array. Set those k bits to 1.

To query an element: run it through all k hash functions, get the k positions. If every position is 1 — probably in the set. If any position is 0 — definitely not in the set.

Bloom filter query decision flow — a 0 bit guarantees absence, but all 1s only means 'maybe'
Watch 3 hash functions map a word to 3 positions in a bit array

← → arrow keys to step · click elements to interact

Bloom Filter: Add & Query

class BloomFilter:
def __init__(self, size=20):
self.size = size
self.bits = [0] * size
def _hashes(self, item):
h1 = hash(item) % self.size
h2 = hash(item + '_2') % self.size
h3 = hash(item + '_3') % self.size
return [h1, h2, h3]
def add(self, item):
for h in self._hashes(item):
self.bits[h] = 1
def might_contain(self, item):
return all(self.bits[h] for h in self._hashes(item))
bf = BloomFilter()
bf.add('apple')
bf.add('banana')
print(bf.might_contain('apple')) # True
print(bf.might_contain('banana')) # True
print(bf.might_contain('cherry')) # False (probably)
Output
True
True
False

False positives — why they happen

Different words can set the same bits. After many insertions, most bits are 1. A new word that was never inserted might still have all its k positions set to 1 — because other words happened to set those bits. This is a false positive: the filter says "probably yes" but the answer is actually "no".

The false positive rate depends on the filter size (m bits), the number of hash functions (k), and the number of elements inserted (n). More bits and well-chosen k reduce false positives.

No false negatives — ever

If a word was inserted, its k bits were set to 1 and will remain 1. Those bits can never be reset to 0 (a basic Bloom filter has no delete operation). So a query for an inserted word will always find all k bits set — the filter will never say "no" for something that is actually in the set.

Real-world uses

  • Spell checkers — flag words probably not in the dictionary without storing the full dictionary
  • CDN / web caches — quickly decide if a URL is cached before doing an expensive lookup
  • Databases — avoid disk reads for keys that definitely don't exist (e.g. Cassandra, HBase, LevelDB)
  • Browsers — Chrome's Safe Browsing uses a Bloom filter to check malicious URLs locally
Note: Bloom filters use far less memory than a hash set. Storing 1 million URLs in a hash set might take 50 MB. A Bloom filter with 1% false positive rate for the same data takes under 2 MB.

Complexity

Insert elementRun k hash functions, set k bits — k is a small constant
Fast\(O(k)\)
Query elementRun k hash functions, check k bits
Fast\(O(k)\)
Delete elementResetting bits would invalidate other elements' entries
Not supported
Spacem is the bit array size — typically 8–10 bits per element
Very small\(O(m)\) bits

Bloom Filter: False Positive Rate

import math
def false_positive_rate(n, m, k):
"""n=items inserted, m=bits, k=hash functions"""
return (1 - math.exp(-k * n / m)) ** k
# 1000 items, 10000 bits, 7 hash functions
fpr = false_positive_rate(1000, 10000, 7)
print(f"{fpr:.4f}") # ~0.0082 (0.82% false positive rate)
# Optimal k for given m and n
def optimal_k(m, n):
return round((m / n) * math.log(2))
print(optimal_k(10000, 1000)) # 7
Output
0.0082
7
Challenge

Quick check

A Bloom filter says an element is NOT in the set. How confident can you be?

Continue reading

Hash Functions
Turn any key into a bucket number — instantly
Chaining
Collisions? Just grow a list at that bucket
CachingSystem Design
Remember answers so you don't compute them twice