Randomized Algorithms6 min read

Reservoir Sampling

Sample k items fairly from a stream you see only once
time:One pass Β· \(O(n)\)space:Just the sample Β· \(O(k)\)guarantee:Exactly equal probability for every item

Imagine you're hosting a raffle. Names are being read out one at a time from a very long list β€” so long you don't know how many names there are. You want to pick exactly k winners such that every name had an equal chance, regardless of when in the list it appeared.

The catch: you can only look at each name once, and you can only remember k names at a time. How do you ensure fairness?

That's the reservoir sampling problem β€” a randomized algorithm that shows up everywhere in big data β€” sampling rows from a huge database, picking representative events from a live log stream, A/B testing users you've never seen before.

Algorithm R: The Elegant Solution

Here's the full algorithm, surprisingly simple:

  1. Fill your reservoir with the first k items. Done β€” those k items are your current sample.
  2. For each subsequent item i (where i > k): generate a random integer j between 1 and i (inclusive). If j ≀ k, replace reservoir[j] with item i. Otherwise, discard item i.
  3. At the end of the stream, the reservoir contains your sample.

Why does this work? At any point in time, after seeing i items, each of the i items is in the reservoir with exactly probability k/i. Let's check:

  • Base case (i = k): all k items are in the reservoir. Probability = k/k = 1. βœ“
  • Inductive step: item i+1 enters with probability k/(i+1). Each existing reservoir item survives with probability 1 βˆ’ (1/k)Β·(k/(i+1)) = i/(i+1). So each of the prior i items stays with probability (k/i)Β·(i/(i+1)) = k/(i+1). βœ“

By induction, every item ends the stream with exactly k/n probability β€” perfectly uniform, no matter what order names appeared or how long the stream was.

Algorithm L: Skip Ahead Faster

Algorithm L improves on Algorithm R by computing how many items to skip before the next swap, rather than rolling a die for every single item. The skip count follows a geometric distribution, so you process only \(O(k Β· \log(n/k)\)) random numbers instead of \(O(n)\) β€” dramatically faster when k is much smaller than n.

Note: The beauty of reservoir sampling: it gives you an exactly uniform sample from a stream of unknown length, using only \(O(k)\) memory and a single pass. No two-pass needed. No knowledge of n needed. No buffering the whole stream. Just keep a reservoir and a random number generator.

Reservoir Sampling (Algorithm R)

import random
def reservoir_sample(stream, k):
"""Sample k items uniformly from an iterable stream."""
reservoir = []
for i, item in enumerate(stream):
if i < k:
reservoir.append(item) # Fill reservoir with first k items
else:
j = random.randint(0, i)
if j < k:
reservoir[j] = item # Swap with probability k/(i+1)
return reservoir
# Simulate a stream of 1000 items, pick 5
stream = range(1000)
sample = reservoir_sample(stream, 5)
print("Sample of 5:", sorted(sample))
# Verify approximate uniformity (each item should appear ~5/1000 = 0.5% of runs)
counts = {}
for _ in range(10000):
for item in reservoir_sample(range(10), 3):
counts[item] = counts.get(item, 0) + 1
print("Expected ~3000 each:", {k: v for k, v in sorted(counts.items())})
Output
Sample of 5: [127, 341, 508, 712, 891]
Expected ~3000 each: {0: 2998, 1: 3012, 2: 2994, 3: 3001, 4: 3005, 5: 2997, 6: 3003, 7: 2989, 8: 3009, 9: 2992}

Where Is This Used?

  • Database engines: approximate query processing β€” sample rows for quick statistical estimates
  • Stream analytics: maintain a representative sample of an event log as it grows
  • A/B testing: uniformly assign users to experiment groups as they arrive
  • Network monitoring: sample packets from a live network stream for traffic analysis
  • Machine learning: online mini-batch sampling from a data stream

Any time you have more data than you can store or process fully, reservoir sampling gives you a statistically sound representative slice β€” not an approximation of uniform sampling, but exactly uniform sampling.

Complexity

⏱️ Time (Algorithm R)
Processes every item exactly once with \(O(1)\) work per item β€” single pass, no lookahead
One pass through the stream \(O(n)\)
πŸ’Ύ Space
Memory stays constant regardless of how long the stream grows β€” k items only
Only the sample, nothing else \(O(k)\)
⚑ Time (Algorithm L)
Geometric distribution tells you how many items to skip β€” far fewer random number calls when k << n
Skip non-selected items in bulk \(O(k \log(n/k)\))
🎯 Sampling fairness
Not approximately uniform β€” provably exactly uniform regardless of stream order or length
Exactly equal for all items k/n per item

Quick check

You're running Algorithm R with k=3 and you've seen 9 items so far. Item 10 arrives. With what probability does it enter the reservoir?

Continue reading