Web Services11 min read

Design Search Autocomplete

Instant suggestions as you type β€” at massive scale
scope:Real-World Systemdifficulty:Intermediate

Understanding the Problem

Every time you type in Google's search bar, suggestions appear almost instantly. How does that work at scale? Let's design a search autocomplete system.

First, let's clarify the requirements:

Functional Requirements:

  • As the user types each character, return the top-k (e.g., 5) matching queries based on the prefix.
  • Suggestions are ranked by popularity (query frequency).
  • Prefix matching only β€” we're not doing fuzzy search or spell correction.

Non-Functional Requirements:

  • Low latency: Suggestions must appear in under 100ms. Anything slower feels laggy.
  • High availability: Autocomplete is a core UX feature β€” it must always work.
  • Scalable: Must handle billions of queries per day across millions of users.
β–Έ How autocomplete works: prefix β†’ suggestions

Estimation

Let's size this system:

  • 10 billion queries/day β†’ ~115K QPS, peak ~230K QPS
  • Average query: 4 words Γ— 5 chars/word = 20 characters
  • Autocomplete triggers per query: ~20 requests (one per keystroke)
  • Autocomplete QPS: ~2.3M peak (but most are cached)
  • Response requirement: < 100ms end-to-end
  • Top-k suggestions: Return 5 results per prefix
  • Unique queries to store: ~5 billion (with frequency counts)
  • Storage per query: ~50 bytes avg β†’ ~250 GB for the query dataset
  • Trie storage: ~100 GB (with optimization) β€” fits in memory across a cluster

API Design

Keep it simple β€” one endpoint:

Get Suggestions

EndpointGET /api/v1/suggestions?prefix=sys&limit=5
Response{"suggestions": ["system design", "system design interview", "system call", "system architecture", "system programming"]}
Status200 OK
Latency< 100ms

Design choices:

  • GET request (idempotent, cacheable by CDN/browser).
  • Prefix is URL-encoded in query params.
  • No authentication needed for basic autocomplete (public queries).
  • Response is intentionally small β€” just an array of strings.
β–Έ Trie data structure for prefix matching
Click chart to zoom
Query flow: CDN handles most requests; only cache misses reach the Trie servers

Core Data Structure: The Trie

A trie (prefix tree) is the ideal data structure for autocomplete. Each node represents a character, and paths from root to nodes form prefixes.

Key insight: At each node, we pre-compute and cache the top-k most popular queries that pass through that node. This way, a lookup for any prefix is just a tree traversal β€” no sorting needed at query time.

How it works:

  1. Traverse the trie following the prefix characters (e.g., s β†’ y β†’ s).
  2. At the final node, read the pre-cached top-5 queries.
  3. Return them immediately β€” O(p) where p = prefix length.

Without pre-caching, you'd need to explore all subtrees below the prefix node to find top-k results, which could be O(n) for n total queries. Pre-caching makes it O(p) β€” a huge win.

β–Έ Data collection and aggregation pipeline

Data Collection & Aggregation

The trie needs to reflect current query popularity. Here's the pipeline:

  1. Log queries: Every search query is logged with a timestamp.
  2. Sample at scale: At 10B queries/day, we don't log every single one β€” sample 1 in 100 or 1 in 1000.
  3. Aggregate hourly/daily: A MapReduce or Spark job counts query frequencies over time windows (e.g., last 7 days with decay weighting).
  4. Rebuild the trie: Periodically (e.g., weekly), rebuild the entire trie from aggregated data. Deploy to trie servers via blue-green deployment.
  5. Real-time updates: For trending queries, use a secondary fast path β€” a small in-memory structure updated in near real-time.

This two-track approach (batch + real-time) ensures both accuracy and freshness.

β–Έ Full architecture
Note: Trie optimization tips: (1) Cache the top-k results at frequently accessed nodes β€” the top 2 levels of the trie handle most queries. (2) Shard the trie by prefix range (a-m on shard 1, n-z on shard 2) for horizontal scaling. (3) Use CDN/edge caching for popular prefixes β€” 'how', 'what', 'why' are queried millions of times.

Scaling & Reliability

Sharding: Shard the trie by the first 1-2 characters of the prefix. For example, shard 1 handles prefixes starting with a-m, shard 2 handles n-z. This distributes load evenly and allows independent scaling.

Replication: Each shard has multiple replicas for availability. If one replica goes down, traffic is routed to another. Use leader-follower replication β€” all replicas serve reads, only the leader handles trie updates.

CDN & Edge Caching: Cache popular prefix results at the CDN layer. Prefixes like "how to", "what is", and "best" are queried millions of times with identical results. A short TTL (e.g., 1 hour) balances freshness and cache hit rate.

Client-side optimization: The browser can cache recent suggestions locally and debounce keystrokes (wait 50-100ms after the user stops typing before sending a request). This reduces server load dramatically.

Key Metrics

Prefix lookup
p = prefix length, read cached top-k
< 10ms \(O(p)\)
Build trie (batch)
n queries, L = avg query length
~hours \(O(n \cdot L)\)
Trie storage (in-memory)
Sharded across cluster
~100 GB β€”
Autocomplete QPS
Handled by sharding + CDN cache
~2.3M peak β€”
CDN cache hit rate
Popular prefixes served at edge
~80-90% β€”
End-to-end latency
SLA target for suggestions
< 100ms β€”

Quick check

Why is a trie (prefix tree) preferred over a hash map for autocomplete?

Continue reading