Reservoir Sampling
Imagine you're hosting a raffle. Names are being read out one at a time from a very long list — so long you don't know how many names there are. You want to pick exactly k winners such that every name had an equal chance, regardless of when in the list it appeared.
The catch: you can only look at each name once, and you can only remember k names at a time. How do you ensure fairness?
That's the reservoir sampling problem — a randomized algorithm that shows up everywhere in big data — sampling rows from a huge database, picking representative events from a live log stream, A/B testing users you've never seen before.
Algorithm R: The Elegant Solution
Here's the full algorithm, surprisingly simple:
- Fill your reservoir with the first k items. Done — those k items are your current sample.
- For each subsequent item i (where i > k): generate a random integer j between 1 and i (inclusive). If j ≤ k, replace reservoir[j] with item i. Otherwise, discard item i.
- At the end of the stream, the reservoir contains your sample.
Why does this work? At any point in time, after seeing i items, each of the i items is in the reservoir with exactly probability k/i. Let's check:
- Base case (i = k): all k items are in the reservoir. Probability = k/k = 1. ✓
- Inductive step: item i+1 enters with probability k/(i+1). Each existing reservoir item survives with probability 1 − (1/k)·(k/(i+1)) = i/(i+1). So each of the prior i items stays with probability (k/i)·(i/(i+1)) = k/(i+1). ✓
By induction, every item ends the stream with exactly k/n probability — perfectly uniform, no matter what order names appeared or how long the stream was.
Algorithm L: Skip Ahead Faster
Algorithm L improves on Algorithm R by computing how many items to skip before the next swap, rather than rolling a die for every single item. The skip count follows a geometric distribution, so you process only \(O(k · \log(n/k)\)) random numbers instead of \(O(n)\) — dramatically faster when k is much smaller than n.
Reservoir Sampling (Algorithm R)
Where Is This Used?
- Database engines: approximate query processing — sample rows for quick statistical estimates
- Stream analytics: maintain a representative sample of an event log as it grows
- A/B testing: uniformly assign users to experiment groups as they arrive
- Network monitoring: sample packets from a live network stream for traffic analysis
- Machine learning: online mini-batch sampling from a data stream
Any time you have more data than you can store or process fully, reservoir sampling gives you a statistically sound representative slice — not an approximation of uniform sampling, but exactly uniform sampling.