Web Services11 min read

Design a Rate Limiter

Stop the flood before it drowns your servers

scope:Real-World Systemdifficulty:Intermediate

Why Rate Limit?

Imagine a water faucet. Turn it on a little — great, you get water. Turn it on full blast with 50 hoses connected — the pipes burst. A rate limiter is the valve that controls the flow.

Without rate limiting, your API is vulnerable to:

Denial of Service (DoS) attacks: A malicious actor floods your servers with millions of requests, making the service unavailable for everyone.
Misbehaving clients: A buggy script accidentally sends 10,000 requests per second instead of 10.
Resource starvation: One heavy user consumes all your server capacity, leaving nothing for others.
Cost explosion: In cloud environments, every request costs money. Unchecked traffic can blow your budget overnight.

Rate limiting ensures fair usage and protects your system from being overwhelmed. Every major API (Google, Twitter, GitHub, Stripe) uses rate limiting.

▸ The problem: server overwhelmed by requests

Click chart to zoom

Core rate limiting decision flow — every algorithm follows this pattern, differing only in how the counter is tracked

Where to Put the Rate Limiter

You have a few options:

Client-side: The client limits its own requests. Unreliable — a malicious client can simply skip it.
Server-side: Your API server checks limits before processing requests. Common and straightforward.
Middleware / API Gateway: A separate layer (like Nginx, Kong, or AWS API Gateway) handles rate limiting before requests even reach your servers. This is the most common production approach — it protects your servers from the traffic entirely.

Most systems use the API Gateway approach. It's centralized, language-agnostic, and doesn't require changes to your application code.

Algorithm 1: Token Bucket

The most popular rate limiting algorithm. Think of it like an arcade token machine:

You have a bucket that holds tokens (say, max 10 tokens).
Tokens are added at a fixed rate (say, 1 token per second).
Each request costs 1 token. If there's a token in the bucket, the request goes through.
If the bucket is empty, the request is rejected (HTTP 429).
If the bucket is full, new tokens overflow (are discarded).

Why it's great: It allows short bursts of traffic (up to the bucket size) while enforcing a long-term average rate. A user with a bucket size of 10 and refill rate of 1/sec can send 10 requests instantly (burst) then 1 per second after that.

Used by: Amazon, Stripe, and most cloud APIs.

▸ Token bucket: tokens refill, requests consume

Token Bucket Implementation

import time

class TokenBucket:
    def __init__(self, capacity: int, refill_rate: float):
        """
        capacity: max tokens the bucket can hold
        refill_rate: tokens added per second
        """
        self.capacity = capacity
        self.refill_rate = refill_rate
        self.tokens = capacity
        self.last_refill = time.time()

    def _refill(self):
        now = time.time()
        elapsed = now - self.last_refill
        new_tokens = elapsed * self.refill_rate
        self.tokens = min(self.capacity, self.tokens + new_tokens)
        self.last_refill = now

    def allow_request(self) -> bool:
        self._refill()
        if self.tokens >= 1:
            self.tokens -= 1
            return True
        return False


class SlidingWindowCounter:
    def __init__(self, max_requests: int, window_seconds: int):
        self.max_requests = max_requests
        self.window = window_seconds
        self.prev_count = 0
        self.curr_count = 0
        self.window_start = time.time()

    def allow_request(self) -> bool:
        now = time.time()
        elapsed = now - self.window_start

        # Slide window if needed
        if elapsed >= self.window:
            self.prev_count = self.curr_count
            self.curr_count = 0
            self.window_start = now
            elapsed = 0

        # Weighted count: blend previous and current window
        weight = 1 - (elapsed / self.window)
        estimated = self.prev_count * weight + self.curr_count

        if estimated < self.max_requests:
            self.curr_count += 1
            return True
        return False


# Demo: Token Bucket
bucket = TokenBucket(capacity=5, refill_rate=1)  # 5 burst, 1/sec steady
results = []
for i in range(8):
    results.append("OK" if bucket.allow_request() else "DENIED")
print(f"Rapid-fire 8 requests: {results}")
# First 5 succeed (bucket was full), next 3 denied

time.sleep(2)  # Wait 2 seconds → 2 tokens refilled
results = [bucket.allow_request() for _ in range(3)]
print(f"After 2s wait, 3 requests: {['OK' if r else 'DENIED' for r in results]}")

Output

Rapid-fire 8 requests: ['OK', 'OK', 'OK', 'OK', 'OK', 'DENIED', 'DENIED', 'DENIED']
After 2s wait, 3 requests: ['OK', 'OK', 'DENIED']

Algorithm 2: Leaking Bucket

Similar to token bucket but processes requests at a fixed, constant rate. Think of a bucket with a hole in the bottom — water drips out at a steady rate, no matter how fast you pour water in.

Requests enter a queue (the bucket).
Requests are processed at a fixed rate (water leaking out).
If the queue is full, new requests are dropped.

Difference from Token Bucket: Token bucket allows bursts (you can use all tokens at once). Leaking bucket enforces a perfectly smooth, constant output rate. This is useful when you need predictable, steady processing — like sending emails or processing payments.

Algorithm 3: Fixed Window Counter

The simplest approach. Divide time into fixed windows (e.g., each minute). Count requests in the current window. If the count exceeds the limit, reject.

Window: 12:00:00 – 12:00:59 → Max 100 requests
Counter: 0, 1, 2, ... 100 → reject everything until 12:01:00

The problem: Boundary spikes. If a user sends 100 requests at 12:00:59 and 100 more at 12:01:00, they've sent 200 requests in 2 seconds — double the intended rate! Each individual window looks fine, but the burst across the boundary is a loophole.

▸ Sliding window: counting requests in a moving time window

Algorithm 4: Sliding Window

Fixes the boundary problem of fixed windows. Two variants:

Sliding Window Log: Keep a sorted log of all request timestamps. For each new request, remove timestamps older than the window, then count the remaining. If under the limit, allow. Accurate but memory-intensive — you store every timestamp.

Sliding Window Counter: A hybrid approach. Keep the count from the previous window and the current window. Estimate the current rate using a weighted average: estimated = prev_count × overlap_ratio + curr_count. It's an approximation but uses constant memory and handles boundaries smoothly.

The sliding window counter is the sweet spot for most systems — accurate enough, low memory, easy to implement in Redis.

Note: Quick cheat sheet for interviews: Token Bucket = most popular, allows bursts, used by AWS/Stripe. Leaking Bucket = constant output rate, good for steady processing. Fixed Window = simple but has boundary spike issues. Sliding Window = best accuracy, slightly more complex. Know all four but default to Token Bucket in design discussions.

Distributed Rate Limiting

The algorithms above work great on a single server. But what if you have 10 API servers behind a load balancer? Each server has its own counter — a user could hit all 10 servers and get 10x the rate limit!

Solutions:

Centralized store (Redis): All servers check and update the same counter in Redis (a hash-based key-value store). Use Redis's atomic operations (INCR, EXPIRE) to avoid race conditions. This is the most common approach.

Sticky sessions: Route each user to the same server (via IP hash). Then local rate limiting works. But this reduces load balancer flexibility and creates SPOFs.

Rate limit at the load balancer/gateway: Put the rate limiter before your servers. Tools like Nginx, Kong, and AWS API Gateway have built-in rate limiting with shared state.

Redis script for atomic rate limiting:

Use a Lua script in Redis to atomically check-and-increment. This avoids race conditions where two servers read the count as 99 (under limit 100) and both allow their request.

▸ Accept or reject: HTTP 429 Too Many Requests

The HTTP 429 Response

When a request is rate limited, return 429 Too Many Requests with helpful headers:

X-RateLimit-Limit: 100 — Your max allowed requests per window.
X-RateLimit-Remaining: 0 — How many you have left.
X-RateLimit-Reset: 1640000060 — When the window resets (Unix timestamp).
Retry-After: 30 — How many seconds to wait before retrying.

Good rate limiter responses are informative, not punitive. They tell the client exactly how to behave. This reduces unnecessary retry storms and makes your API developer-friendly.

▸ Distributed rate limiting with Redis

Note: Interview tip: When designing a rate limiter, mention the different scopes: per-user, per-IP, per-API-key, or global. A well-designed system uses multiple layers — global limits to protect infrastructure, per-user limits to ensure fairness, and per-endpoint limits for expensive operations (like sending emails).