Randomized Algorithms7 min read

Bloom Filter

Probably yes, definitely no β€” a memory-sipping membership checker
insert:\(O(k)\) β€” set k bitsquery:\(O(k)\) β€” check k bitsfalse positives:Possible (tunable)false negatives:Never

Imagine a nightclub bouncer who checks guests against a list β€” but the list is written in invisible ink, and occasionally the ink bleeds and marks a name that was never there. The bouncer might turn away someone innocent (false positive), but he will never let in someone on the banned list (no false negatives).

That's a Bloom Filter: a lightning-fast, memory-tiny way to ask "is this thing in my set?" It might say "probably yes" when the answer is actually no β€” but if it says "definitely not," you can take that to the bank.

How it works

A Bloom filter is just a bit array of m bits (all zeros at the start) paired with k independent hash functions. Each hash function maps any input to a position in that array.

To insert an element, run it through all k hash functions and flip those bit positions to 1. That's it.

To query, run the same k hashes. If every bit at those positions is 1 β†’ "probably in the set." If any bit is 0 β†’ "definitely NOT in the set." A 0 can only exist if that position was never flipped β€” meaning this element was never inserted.

Why false positives happen β€” but false negatives can't

Over time, other elements flip bits all over the array. By coincidence, the exact positions your query needs may all be 1 already β€” set by different elements. That's a false positive. But once a bit is set to 1, it stays 1 forever. So if you inserted something, its bits are guaranteed to stay on β€” false negatives are structurally impossible.

Tuning the false positive rate

Given n expected insertions and a target false positive rate p, the optimal parameters are:
β€’ Bit array size: m = -nΒ·ln(p) / (ln 2)Β²
β€’ Number of hash functions: k = (m/n)Β·ln 2 β‰ˆ 0.693Β·m/n

For a 1% false positive rate, you need only about 9.6 bits per element. A regular hash set storing 64-bit keys uses 64 bits per element minimum β€” about 7Γ— more space.

Why no deletions?

If you clear a bit to "delete" an element, you might wipe out evidence of other elements that shared that bit position. You'd create false negatives β€” exactly what makes Bloom filters useful breaks. Extensions like Counting Bloom Filters replace bits with small counters to allow deletions, at the cost of more memory.

Real-world uses

Google Chrome's Safe Browsing keeps a local Bloom filter of malicious URLs. A false positive just triggers a follow-up server check β€” a small cost. A false negative (missing a real threat) would be catastrophic, and the filter makes that impossible.

Databases (Cassandra, RocksDB, HBase) use Bloom filters before touching disk: "is this key in this SSTable?" A false positive wastes one disk read. A false negative would return wrong data. The filter eliminates false negatives entirely while cutting most unnecessary disk I/O.

Note: The two rules of Bloom Filters: (1) 'Definitely not' is always trustworthy. (2) 'Probably yes' might be wrong. Never use a Bloom filter when false positives are unacceptable β€” use one when false positives are merely annoying.

Bloom Filter β€” core insert & query

import math
import mmh3 # pip install mmh3
class BloomFilter:
def __init__(self, capacity, fp_rate=0.01):
self.m = math.ceil(-capacity * math.log(fp_rate) / (math.log(2)**2))
self.k = math.ceil((self.m / capacity) * math.log(2))
self.bits = bytearray(math.ceil(self.m / 8))
def _hashes(self, item):
return [mmh3.hash(item, seed) % self.m for seed in range(self.k)]
def add(self, item):
for pos in self._hashes(item):
self.bits[pos >> 3] |= (1 << (pos & 7))
def __contains__(self, item):
return all(
(self.bits[pos >> 3] >> (pos & 7)) & 1
for pos in self._hashes(item)
)
bf = BloomFilter(1000)
bf.add("alice")
bf.add("bob")
print("alice" in bf) # True (inserted)
print("charlie" in bf) # False (definitely not inserted)
print(f"Bit array size: {bf.m} bits = {bf.m // 8} bytes")
Output
True
False
Bit array size: 9586 bits = 1198 bytes

Complexity

βž• Insert an element
k is a small constant (typically 7–10); pure CPU work, no memory allocation
Set k bit positions \(O(k)\)
πŸ” Query membership
Short-circuits on the first 0 bit found β€” often faster than \(O(k)\) in practice
Check k bit positions \(O(k)\)
⚠️ False positive rate
~9.6 bits/element gives 1% FPR; ~14.4 bits/element gives 0.1% FPR
Tunable via m and k (1βˆ’e^(βˆ’kn/m))^k
βœ… False negative rate
Set bits never revert to 0; every inserted element always passes the query
Structurally impossible P = 0
πŸ’Ύ Space vs hash set
m β‰ˆ βˆ’nΒ·ln(p) / (ln 2\()^2\); storing actual keys costs 64+ bits each vs ~9.6 bits
~7Γ— less at 1% FPR \(O(m)\) bits

Quick check

A Bloom filter says element X is NOT in the set. What can you conclude?

Continue reading