Randomized Algorithms7 min read

Bloom Filter

Probably yes, definitely no — a memory-sipping membership checker
insert:\(O(k)\) — set k bitsquery:\(O(k)\) — check k bitsfalse positives:Possible (tunable)false negatives:Never

Imagine a nightclub bouncer who checks guests against a list — but the list is written in invisible ink, and occasionally the ink bleeds and marks a name that was never there. The bouncer might turn away someone innocent (false positive), but he will never let in someone on the banned list (no false negatives).

That's a Bloom Filter: a lightning-fast, memory-tiny way to ask "is this thing in my set?" It might say "probably yes" when the answer is actually no — but if it says "definitely not," you can take that to the bank.

How it works

A Bloom filter is just a bit array of m bits (all zeros at the start) paired with k independent hash functions. Each hash function maps any input to a position in that array.

To insert an element, run it through all k hash functions and flip those bit positions to 1. That's it.

To query, run the same k hashes. If every bit at those positions is 1 → "probably in the set." If any bit is 0 → "definitely NOT in the set." A 0 can only exist if that position was never flipped — meaning this element was never inserted.

Why false positives happen — but false negatives can't

Over time, other elements flip bits all over the array. By coincidence, the exact positions your query needs may all be 1 already — set by different elements. That's a false positive. But once a bit is set to 1, it stays 1 forever. So if you inserted something, its bits are guaranteed to stay on — false negatives are structurally impossible.

Tuning the false positive rate

Given n expected insertions and a target false positive rate p, the optimal parameters are:
• Bit array size: m = -n·ln(p) / (ln 2)²
• Number of hash functions: k = (m/n)·ln 2 ≈ 0.693·m/n

For a 1% false positive rate, you need only about 9.6 bits per element. A regular hash set storing 64-bit keys uses 64 bits per element minimum — about 7× more space.

Why no deletions?

If you clear a bit to "delete" an element, you might wipe out evidence of other elements that shared that bit position. You'd create false negatives — exactly what makes Bloom filters useful breaks. Extensions like Counting Bloom Filters replace bits with small counters to allow deletions, at the cost of more memory.

Real-world uses

Google Chrome's Safe Browsing keeps a local Bloom filter of malicious URLs. A false positive just triggers a follow-up server check — a small cost. A false negative (missing a real threat) would be catastrophic, and the filter makes that impossible.

Databases (Cassandra, RocksDB, HBase) use Bloom filters before touching disk: "is this key in this SSTable?" A false positive wastes one disk read. A false negative would return wrong data. The filter eliminates false negatives entirely while cutting most unnecessary disk I/O.

Note: The two rules of Bloom Filters: (1) 'Definitely not' is always trustworthy. (2) 'Probably yes' might be wrong. Never use a Bloom filter when false positives are unacceptable — use one when false positives are merely annoying.

Bloom Filter — core insert & query

import math
import mmh3 # pip install mmh3
class BloomFilter:
def __init__(self, capacity, fp_rate=0.01):
self.m = math.ceil(-capacity * math.log(fp_rate) / (math.log(2)**2))
self.k = math.ceil((self.m / capacity) * math.log(2))
self.bits = bytearray(math.ceil(self.m / 8))
def _hashes(self, item):
return [mmh3.hash(item, seed) % self.m for seed in range(self.k)]
def add(self, item):
for pos in self._hashes(item):
self.bits[pos >> 3] |= (1 << (pos & 7))
def __contains__(self, item):
return all(
(self.bits[pos >> 3] >> (pos & 7)) & 1
for pos in self._hashes(item)
)
bf = BloomFilter(1000)
bf.add("alice")
bf.add("bob")
print("alice" in bf) # True (inserted)
print("charlie" in bf) # False (definitely not inserted)
print(f"Bit array size: {bf.m} bits = {bf.m // 8} bytes")
Output
True
False
Bit array size: 9586 bits = 1198 bytes

Complexity

➕ Insert an elementk is a small constant (typically 7–10); pure CPU work, no memory allocation
Set k bit positions\(O(k)\)
🔍 Query membershipShort-circuits on the first 0 bit found — often faster than \(O(k)\) in practice
Check k bit positions\(O(k)\)
⚠️ False positive rate~9.6 bits/element gives 1% FPR; ~14.4 bits/element gives 0.1% FPR
Tunable via m and k(1−e^(−kn/m))^k
✅ False negative rateSet bits never revert to 0; every inserted element always passes the query
Structurally impossibleP = 0
💾 Space vs hash setm ≈ −n·ln(p) / (ln 2\()^2\); storing actual keys costs 64+ bits each vs ~9.6 bits
~7× less at 1% FPR\(O(m)\) bits

Quick check

A Bloom filter says element X is NOT in the set. What can you conclude?

Continue reading

Skip List
Probabilistic BST alternative — \(O(\log n)\) expected
Las Vegas vs Monte Carlo
Correct-always vs probably-correct — two flavors of randomized algorithms