Design a Rate Limiter
Why Rate Limit?
Imagine a water faucet. Turn it on a little — great, you get water. Turn it on full blast with 50 hoses connected — the pipes burst. A rate limiter is the valve that controls the flow.
Without rate limiting, your API is vulnerable to:
- Denial of Service (DoS) attacks: A malicious actor floods your servers with millions of requests, making the service unavailable for everyone.
- Misbehaving clients: A buggy script accidentally sends 10,000 requests per second instead of 10.
- Resource starvation: One heavy user consumes all your server capacity, leaving nothing for others.
- Cost explosion: In cloud environments, every request costs money. Unchecked traffic can blow your budget overnight.
Rate limiting ensures fair usage and protects your system from being overwhelmed. Every major API (Google, Twitter, GitHub, Stripe) uses rate limiting.
Where to Put the Rate Limiter
You have a few options:
- Client-side: The client limits its own requests. Unreliable — a malicious client can simply skip it.
- Server-side: Your API server checks limits before processing requests. Common and straightforward.
- Middleware / API Gateway: A separate layer (like Nginx, Kong, or AWS API Gateway) handles rate limiting before requests even reach your servers. This is the most common production approach — it protects your servers from the traffic entirely.
Most systems use the API Gateway approach. It's centralized, language-agnostic, and doesn't require changes to your application code.
Algorithm 1: Token Bucket
The most popular rate limiting algorithm. Think of it like an arcade token machine:
- You have a bucket that holds tokens (say, max 10 tokens).
- Tokens are added at a fixed rate (say, 1 token per second).
- Each request costs 1 token. If there's a token in the bucket, the request goes through.
- If the bucket is empty, the request is rejected (HTTP 429).
- If the bucket is full, new tokens overflow (are discarded).
Why it's great: It allows short bursts of traffic (up to the bucket size) while enforcing a long-term average rate. A user with a bucket size of 10 and refill rate of 1/sec can send 10 requests instantly (burst) then 1 per second after that.
Used by: Amazon, Stripe, and most cloud APIs.
Token Bucket Implementation
Algorithm 2: Leaking Bucket
Similar to token bucket but processes requests at a fixed, constant rate. Think of a bucket with a hole in the bottom — water drips out at a steady rate, no matter how fast you pour water in.
- Requests enter a queue (the bucket).
- Requests are processed at a fixed rate (water leaking out).
- If the queue is full, new requests are dropped.
Difference from Token Bucket: Token bucket allows bursts (you can use all tokens at once). Leaking bucket enforces a perfectly smooth, constant output rate. This is useful when you need predictable, steady processing — like sending emails or processing payments.
Algorithm 3: Fixed Window Counter
The simplest approach. Divide time into fixed windows (e.g., each minute). Count requests in the current window. If the count exceeds the limit, reject.
- Window: 12:00:00 – 12:00:59 → Max 100 requests
- Counter: 0, 1, 2, ... 100 → reject everything until 12:01:00
The problem: Boundary spikes. If a user sends 100 requests at 12:00:59 and 100 more at 12:01:00, they've sent 200 requests in 2 seconds — double the intended rate! Each individual window looks fine, but the burst across the boundary is a loophole.
Algorithm 4: Sliding Window
Fixes the boundary problem of fixed windows. Two variants:
Sliding Window Log: Keep a sorted log of all request timestamps. For each new request, remove timestamps older than the window, then count the remaining. If under the limit, allow. Accurate but memory-intensive — you store every timestamp.
Sliding Window Counter: A hybrid approach. Keep the count from the previous window and the current window. Estimate the current rate using a weighted average: estimated = prev_count × overlap_ratio + curr_count. It's an approximation but uses constant memory and handles boundaries smoothly.
The sliding window counter is the sweet spot for most systems — accurate enough, low memory, easy to implement in Redis.
Distributed Rate Limiting
The algorithms above work great on a single server. But what if you have 10 API servers behind a load balancer? Each server has its own counter — a user could hit all 10 servers and get 10x the rate limit!
Solutions:
Centralized store (Redis): All servers check and update the same counter in Redis (a hash-based key-value store). Use Redis's atomic operations (INCR, EXPIRE) to avoid race conditions. This is the most common approach.
Sticky sessions: Route each user to the same server (via IP hash). Then local rate limiting works. But this reduces load balancer flexibility and creates SPOFs.
Rate limit at the load balancer/gateway: Put the rate limiter before your servers. Tools like Nginx, Kong, and AWS API Gateway have built-in rate limiting with shared state.
Redis script for atomic rate limiting:
Use a Lua script in Redis to atomically check-and-increment. This avoids race conditions where two servers read the count as 99 (under limit 100) and both allow their request.
The HTTP 429 Response
When a request is rate limited, return 429 Too Many Requests with helpful headers:
X-RateLimit-Limit: 100— Your max allowed requests per window.X-RateLimit-Remaining: 0— How many you have left.X-RateLimit-Reset: 1640000060— When the window resets (Unix timestamp).Retry-After: 30— How many seconds to wait before retrying.
Good rate limiter responses are informative, not punitive. They tell the client exactly how to behave. This reduces unnecessary retry storms and makes your API developer-friendly.