Reservoir Sampling

Imagine you're hosting a raffle. Names are being read out one at a time from a very long list — so long you don't know how many names there are. You want to pick exactly k winners such that every name had an equal chance, regardless of when in the list it appeared.

The catch: you can only look at each name once, and you can only remember k names at a time. How do you ensure fairness?

That's the reservoir sampling problem — a randomized algorithm that shows up everywhere in big data — sampling rows from a huge database, picking representative events from a live log stream, A/B testing users you've never seen before.

Algorithm R: The Elegant Solution

Here's the full algorithm, surprisingly simple:

Fill your reservoir with the first k items. Done — those k items are your current sample.
For each subsequent item i (where i > k): generate a random integer j between 1 and i (inclusive). If j ≤ k, replace reservoir[j] with item i. Otherwise, discard item i.
At the end of the stream, the reservoir contains your sample.

Why does this work? At any point in time, after seeing i items, each of the i items is in the reservoir with exactly probability k/i. Let's check:

Base case (i = k): all k items are in the reservoir. Probability = k/k = 1. ✓
Inductive step: item i+1 enters with probability k/(i+1). Each existing reservoir item survives with probability 1 − (1/k)·(k/(i+1)) = i/(i+1). So each of the prior i items stays with probability (k/i)·(i/(i+1)) = k/(i+1). ✓

By induction, every item ends the stream with exactly k/n probability — perfectly uniform, no matter what order names appeared or how long the stream was.

Algorithm L: Skip Ahead Faster

Algorithm L improves on Algorithm R by computing how many items to skip before the next swap, rather than rolling a die for every single item. The skip count follows a geometric distribution, so you process only \(O(k · \log(n/k)\)) random numbers instead of \(O(n)\) — dramatically faster when k is much smaller than n.

Where Is This Used?

Database engines: approximate query processing — sample rows for quick statistical estimates
Stream analytics: maintain a representative sample of an event log as it grows
A/B testing: uniformly assign users to experiment groups as they arrive
Network monitoring: sample packets from a live network stream for traffic analysis
Machine learning: online mini-batch sampling from a data stream

Any time you have more data than you can store or process fully, reservoir sampling gives you a statistically sound representative slice — not an approximation of uniform sampling, but exactly uniform sampling.

Reservoir Sampling

Algorithm R: The Elegant Solution

Algorithm L: Skip Ahead Faster

Reservoir Sampling (Algorithm R)

Where Is This Used?

Complexity

Quick check