Design Search Autocomplete
Understanding the Problem
Every time you type in Google's search bar, suggestions appear almost instantly. How does that work at scale? Let's design a search autocomplete system.
First, let's clarify the requirements:
Functional Requirements:
- As the user types each character, return the top-k (e.g., 5) matching queries based on the prefix.
- Suggestions are ranked by popularity (query frequency).
- Prefix matching only β we're not doing fuzzy search or spell correction.
Non-Functional Requirements:
- Low latency: Suggestions must appear in under 100ms. Anything slower feels laggy.
- High availability: Autocomplete is a core UX feature β it must always work.
- Scalable: Must handle billions of queries per day across millions of users.
Estimation
Let's size this system:
- 10 billion queries/day β ~115K QPS, peak ~230K QPS
- Average query: 4 words Γ 5 chars/word = 20 characters
- Autocomplete triggers per query: ~20 requests (one per keystroke)
- Autocomplete QPS: ~2.3M peak (but most are cached)
- Response requirement: < 100ms end-to-end
- Top-k suggestions: Return 5 results per prefix
- Unique queries to store: ~5 billion (with frequency counts)
- Storage per query: ~50 bytes avg β ~250 GB for the query dataset
- Trie storage: ~100 GB (with optimization) β fits in memory across a cluster
API Design
Keep it simple β one endpoint:
Get Suggestions
| Endpoint | GET /api/v1/suggestions?prefix=sys&limit=5 |
| Response | {"suggestions": ["system design", "system design interview", "system call", "system architecture", "system programming"]} |
| Status | 200 OK |
| Latency | < 100ms |
Design choices:
- GET request (idempotent, cacheable by CDN/browser).
- Prefix is URL-encoded in query params.
- No authentication needed for basic autocomplete (public queries).
- Response is intentionally small β just an array of strings.
Core Data Structure: The Trie
A trie (prefix tree) is the ideal data structure for autocomplete. Each node represents a character, and paths from root to nodes form prefixes.
Key insight: At each node, we pre-compute and cache the top-k most popular queries that pass through that node. This way, a lookup for any prefix is just a tree traversal β no sorting needed at query time.
How it works:
- Traverse the trie following the prefix characters (e.g., s β y β s).
- At the final node, read the pre-cached top-5 queries.
- Return them immediately β O(p) where p = prefix length.
Without pre-caching, you'd need to explore all subtrees below the prefix node to find top-k results, which could be O(n) for n total queries. Pre-caching makes it O(p) β a huge win.
Data Collection & Aggregation
The trie needs to reflect current query popularity. Here's the pipeline:
- Log queries: Every search query is logged with a timestamp.
- Sample at scale: At 10B queries/day, we don't log every single one β sample 1 in 100 or 1 in 1000.
- Aggregate hourly/daily: A MapReduce or Spark job counts query frequencies over time windows (e.g., last 7 days with decay weighting).
- Rebuild the trie: Periodically (e.g., weekly), rebuild the entire trie from aggregated data. Deploy to trie servers via blue-green deployment.
- Real-time updates: For trending queries, use a secondary fast path β a small in-memory structure updated in near real-time.
This two-track approach (batch + real-time) ensures both accuracy and freshness.
Scaling & Reliability
Sharding: Shard the trie by the first 1-2 characters of the prefix. For example, shard 1 handles prefixes starting with a-m, shard 2 handles n-z. This distributes load evenly and allows independent scaling.
Replication: Each shard has multiple replicas for availability. If one replica goes down, traffic is routed to another. Use leader-follower replication β all replicas serve reads, only the leader handles trie updates.
CDN & Edge Caching: Cache popular prefix results at the CDN layer. Prefixes like "how to", "what is", and "best" are queried millions of times with identical results. A short TTL (e.g., 1 hour) balances freshness and cache hit rate.
Client-side optimization: The browser can cache recent suggestions locally and debounce keystrokes (wait 50-100ms after the user stops typing before sending a request). This reduces server load dramatically.