Randomized Algorithms6 min read

Reservoir Sampling

Sample k items fairly from a stream you see only once
time:One pass · \(O(n)\)space:Just the sample · \(O(k)\)guarantee:Exactly equal probability for every item

Imagine you're hosting a raffle. Names are being read out one at a time from a very long list — so long you don't know how many names there are. You want to pick exactly k winners such that every name had an equal chance, regardless of when in the list it appeared.

The catch: you can only look at each name once, and you can only remember k names at a time. How do you ensure fairness?

That's the reservoir sampling problem — a randomized algorithm that shows up everywhere in big data — sampling rows from a huge database, picking representative events from a live log stream, A/B testing users you've never seen before.

Algorithm R: The Elegant Solution

Here's the full algorithm, surprisingly simple:

  1. Fill your reservoir with the first k items. Done — those k items are your current sample.
  2. For each subsequent item i (where i > k): generate a random integer j between 1 and i (inclusive). If j ≤ k, replace reservoir[j] with item i. Otherwise, discard item i.
  3. At the end of the stream, the reservoir contains your sample.

Why does this work? At any point in time, after seeing i items, each of the i items is in the reservoir with exactly probability k/i. Let's check:

  • Base case (i = k): all k items are in the reservoir. Probability = k/k = 1. ✓
  • Inductive step: item i+1 enters with probability k/(i+1). Each existing reservoir item survives with probability 1 − (1/k)·(k/(i+1)) = i/(i+1). So each of the prior i items stays with probability (k/i)·(i/(i+1)) = k/(i+1). ✓

By induction, every item ends the stream with exactly k/n probability — perfectly uniform, no matter what order names appeared or how long the stream was.

Algorithm L: Skip Ahead Faster

Algorithm L improves on Algorithm R by computing how many items to skip before the next swap, rather than rolling a die for every single item. The skip count follows a geometric distribution, so you process only \(O(k · \log(n/k)\)) random numbers instead of \(O(n)\) — dramatically faster when k is much smaller than n.

Note: The beauty of reservoir sampling: it gives you an exactly uniform sample from a stream of unknown length, using only \(O(k)\) memory and a single pass. No two-pass needed. No knowledge of n needed. No buffering the whole stream. Just keep a reservoir and a random number generator.

Reservoir Sampling (Algorithm R)

import random
def reservoir_sample(stream, k):
"""Sample k items uniformly from an iterable stream."""
reservoir = []
for i, item in enumerate(stream):
if i < k:
reservoir.append(item) # Fill reservoir with first k items
else:
j = random.randint(0, i)
if j < k:
reservoir[j] = item # Swap with probability k/(i+1)
return reservoir
# Simulate a stream of 1000 items, pick 5
stream = range(1000)
sample = reservoir_sample(stream, 5)
print("Sample of 5:", sorted(sample))
# Verify approximate uniformity (each item should appear ~5/1000 = 0.5% of runs)
counts = {}
for _ in range(10000):
for item in reservoir_sample(range(10), 3):
counts[item] = counts.get(item, 0) + 1
print("Expected ~3000 each:", {k: v for k, v in sorted(counts.items())})
Output
Sample of 5: [127, 341, 508, 712, 891]
Expected ~3000 each: {0: 2998, 1: 3012, 2: 2994, 3: 3001, 4: 3005, 5: 2997, 6: 3003, 7: 2989, 8: 3009, 9: 2992}

Where Is This Used?

  • Database engines: approximate query processing — sample rows for quick statistical estimates
  • Stream analytics: maintain a representative sample of an event log as it grows
  • A/B testing: uniformly assign users to experiment groups as they arrive
  • Network monitoring: sample packets from a live network stream for traffic analysis
  • Machine learning: online mini-batch sampling from a data stream

Any time you have more data than you can store or process fully, reservoir sampling gives you a statistically sound representative slice — not an approximation of uniform sampling, but exactly uniform sampling.

Complexity

⏱️ Time (Algorithm R)Processes every item exactly once with \(O(1)\) work per item — single pass, no lookahead
One pass through the stream\(O(n)\)
💾 SpaceMemory stays constant regardless of how long the stream grows — k items only
Only the sample, nothing else\(O(k)\)
⚡ Time (Algorithm L)Geometric distribution tells you how many items to skip — far fewer random number calls when k << n
Skip non-selected items in bulk\(O(k \log(n/k)\))
🎯 Sampling fairnessNot approximately uniform — provably exactly uniform regardless of stream order or length
Exactly equal for all itemsk/n per item

Quick check

You're running Algorithm R with k=3 and you've seen 9 items so far. Item 10 arrives. With what probability does it enter the reservoir?

Continue reading

Las Vegas vs Monte Carlo
Correct-always vs probably-correct — two flavors of randomized algorithms
Randomized Quick Sort
A random pivot destroys adversarial worst-case inputs
Bloom Filter
Probably yes, definitely no — a memory-sipping membership checker