Bloom Filter
Imagine you're a bouncer at a huge party. You have a tiny checklist — not names, just checkboxes. When someone joins the party, you tick a few boxes based on their name. When someone shows up at the door, you check those same boxes. If any box is unchecked: they definitely were not invited. If all boxes are checked: they were probably invited — but you might make a mistake.
That's a Bloom filter. It's a bit array (all zeros at the start) plus k independent hash functions. It can tell you with certainty that something is not in the set, but it can only say something is probably in the set.
How it works
To insert an element: run it through all k hash functions. Each function gives an index into the bit array. Set those k bits to 1.
To query an element: run it through all k hash functions, get the k positions. If every position is 1 — probably in the set. If any position is 0 — definitely not in the set.
← → arrow keys to step · click elements to interact
Bloom Filter: Add & Query
False positives — why they happen
Different words can set the same bits. After many insertions, most bits are 1. A new word that was never inserted might still have all its k positions set to 1 — because other words happened to set those bits. This is a false positive: the filter says "probably yes" but the answer is actually "no".
The false positive rate depends on the filter size (m bits), the number of hash functions (k), and the number of elements inserted (n). More bits and well-chosen k reduce false positives.
No false negatives — ever
If a word was inserted, its k bits were set to 1 and will remain 1. Those bits can never be reset to 0 (a basic Bloom filter has no delete operation). So a query for an inserted word will always find all k bits set — the filter will never say "no" for something that is actually in the set.
Real-world uses
- Spell checkers — flag words probably not in the dictionary without storing the full dictionary
- CDN / web caches — quickly decide if a URL is cached before doing an expensive lookup
- Databases — avoid disk reads for keys that definitely don't exist (e.g. Cassandra, HBase, LevelDB)
- Browsers — Chrome's Safe Browsing uses a Bloom filter to check malicious URLs locally