Web Services11 min read

Design Search Autocomplete

Instant suggestions as you type — at massive scale

scope:Real-World Systemdifficulty:Intermediate

Understanding the Problem

Every time you type in Google's search bar, suggestions appear almost instantly. How does that work at scale? Let's design a search autocomplete system.

First, let's clarify the requirements:

Functional Requirements:

As the user types each character, return the top-k (e.g., 5) matching queries based on the prefix.
Suggestions are ranked by popularity (query frequency).
Prefix matching only — we're not doing fuzzy search or spell correction.

Non-Functional Requirements:

Low latency: Suggestions must appear in under 100ms. Anything slower feels laggy.
High availability: Autocomplete is a core UX feature — it must always work.
Scalable: Must handle billions of queries per day across millions of users.

▸ How autocomplete works: prefix → suggestions

Estimation

Let's size this system:

10 billion queries/day → ~115K QPS, peak ~230K QPS
Average query: 4 words × 5 chars/word = 20 characters
Autocomplete triggers per query: ~20 requests (one per keystroke)
Autocomplete QPS: ~2.3M peak (but most are cached)
Response requirement: < 100ms end-to-end
Top-k suggestions: Return 5 results per prefix
Unique queries to store: ~5 billion (with frequency counts)
Storage per query: ~50 bytes avg → ~250 GB for the query dataset
Trie storage: ~100 GB (with optimization) — fits in memory across a cluster

API Design

Keep it simple — one endpoint:

Get Suggestions

Endpoint	`GET /api/v1/suggestions?prefix=sys&limit=5`
Response	`{"suggestions": ["system design", "system design interview", "system call", "system architecture", "system programming"]}`
Status	`200 OK`
Latency	< 100ms

Design choices:

GET request (idempotent, cacheable by CDN/browser).
Prefix is URL-encoded in query params.
No authentication needed for basic autocomplete (public queries).
Response is intentionally small — just an array of strings.

▸ Trie data structure for prefix matching

Click chart to zoom

Query flow: CDN handles most requests; only cache misses reach the Trie servers

Core Data Structure: The Trie

A trie (prefix tree) is the ideal data structure for autocomplete. Each node represents a character, and paths from root to nodes form prefixes.

Key insight: At each node, we pre-compute and cache the top-k most popular queries that pass through that node. This way, a lookup for any prefix is just a tree traversal — no sorting needed at query time.

How it works:

Traverse the trie following the prefix characters (e.g., s → y → s).
At the final node, read the pre-cached top-5 queries.
Return them immediately — O(p) where p = prefix length.

Without pre-caching, you'd need to explore all subtrees below the prefix node to find top-k results, which could be O(n) for n total queries. Pre-caching makes it O(p) — a huge win.

▸ Data collection and aggregation pipeline

Data Collection & Aggregation

The trie needs to reflect current query popularity. Here's the pipeline:

Log queries: Every search query is logged with a timestamp.
Sample at scale: At 10B queries/day, we don't log every single one — sample 1 in 100 or 1 in 1000.
Aggregate hourly/daily: A MapReduce or Spark job counts query frequencies over time windows (e.g., last 7 days with decay weighting).
Rebuild the trie: Periodically (e.g., weekly), rebuild the entire trie from aggregated data. Deploy to trie servers via blue-green deployment.
Real-time updates: For trending queries, use a secondary fast path — a small in-memory structure updated in near real-time.

This two-track approach (batch + real-time) ensures both accuracy and freshness.

▸ Full architecture

Note: Trie optimization tips: (1) Cache the top-k results at frequently accessed nodes — the top 2 levels of the trie handle most queries. (2) Shard the trie by prefix range (a-m on shard 1, n-z on shard 2) for horizontal scaling. (3) Use CDN/edge caching for popular prefixes — 'how', 'what', 'why' are queried millions of times.

Scaling & Reliability

Sharding: Shard the trie by the first 1-2 characters of the prefix. For example, shard 1 handles prefixes starting with a-m, shard 2 handles n-z. This distributes load evenly and allows independent scaling.

Replication: Each shard has multiple replicas for availability. If one replica goes down, traffic is routed to another. Use leader-follower replication — all replicas serve reads, only the leader handles trie updates.

CDN & Edge Caching: Cache popular prefix results at the CDN layer. Prefixes like "how to", "what is", and "best" are queried millions of times with identical results. A short TTL (e.g., 1 hour) balances freshness and cache hit rate.

Client-side optimization: The browser can cache recent suggestions locally and debounce keystrokes (wait 50-100ms after the user stops typing before sending a request). This reduces server load dramatically.