Data & Infrastructure14 min read

Design a Search Engine

Index billions of pages and return relevant results in milliseconds
scope:Real-World Systemdifficulty:Advanced

Understanding the Problem

How does Google index billions of web pages and return relevant results in under 200 milliseconds? Let's design a search engine from scratch.

Functional Requirements:

  • Crawl the web β€” discover and download billions of pages.
  • Index crawled content β€” build a searchable data structure from raw HTML.
  • Search queries β€” accept a text query and find matching documents.
  • Rank results β€” order results by relevance, authority, and freshness.
  • Snippets β€” generate short previews showing why each result matches.

Non-Functional Requirements:

  • Low latency: Query-to-results in < 200 ms. Users won't wait.
  • Freshness: New and updated content should appear within hours, not weeks.
  • Relevance: The top 10 results must feel "right" β€” this is the hardest part.
  • Scale: Handle 100K+ queries per second at peak.
β–Έ The idea: crawl β†’ index β†’ search β†’ rank

Estimation

Let's size this beast:

  • 50 billion pages indexed β€” the visible web is enormous.
  • 10 billion queries/day β€” that's roughly 115,000 QPS, peaking at ~200K QPS.
  • Index size: 100 PB+ β€” the inverted index alone dwarfs most databases.
  • Average query touches ~1 million candidate documents before ranking narrows it down to the top 10.

This is one of the most demanding systems in existence. The key insight: we preprocess everything at index time so that query time can be blazingly fast.

The Inverted Index

The core data structure of every search engine is the inverted index: a mapping from every word to the list of documents containing that word.

For each word, we store a posting list: a sorted list of (docId, position, frequency) tuples. This tells us which documents contain the word, where in the document it appears, and how often.

Building the index from crawled content involves:

  1. Tokenization β€” split text into words, normalize case, remove stop words.
  2. Posting list construction β€” for each token, append the (docId, position, freq) entry.
  3. TF-IDF scoring β€” precompute term frequency Γ— inverse document frequency for basic relevance. Words that appear often in a document but rarely across all documents are strong signals.

The inverted index turns a search query from "scan every document" into "look up a word, get a list" β€” transforming an \(O(N)\) problem into \(O(1)\) lookup + \(O(K)\) intersection.

β–Έ Inverted index: from documents to search
Click chart to zoom
Search query flow: scatter to index shards, gather results, then rank the merged candidates

Ranking Pipeline

Finding documents is easy. Ranking them is the hard part. Modern search engines use a two-phase ranking approach:

Phase 1 β€” Rough scoring (on index servers):

  • BM25 β€” an improved version of TF-IDF that accounts for document length and term saturation. This is the workhorse of text relevance.
  • Each index shard scores its local candidates and returns the top-K (e.g., top 1,000) to the gather layer.

Phase 2 β€” Fine ranking (on ranking servers):

  • PageRank β€” measures link authority. A page linked by many authoritative pages is itself authoritative. Precomputed offline via iterative graph algorithm.
  • ML re-ranking β€” a learned model that combines hundreds of signals: click-through rate, content freshness, page quality score, dwell time, geographic relevance, and more.
  • This phase is expensive per-document but only runs on the merged top-K candidates (~1,000-5,000 documents, not millions).

The two-phase approach is critical: you can't run an ML model on 1 million documents in 200ms, but you can run BM25 on millions and ML on thousands.

β–Έ Ranking pipeline: from candidates to results

Index Partitioning and Architecture

With 50 billion documents, no single machine can hold the entire index. We partition by document: each shard holds the complete inverted index for a subset of documents.

Scatter-gather query execution:

  1. A query arrives at the Query Service, which parses and rewrites it.
  2. The query is scattered to all index shards in parallel.
  3. Each shard runs BM25 on its local documents and returns its top-K results.
  4. Results are gathered and merged, then sent to the Ranking Service for fine scoring.

Replication: Each shard is replicated 3-5x for availability and load distribution. If one replica is slow or down, the query still succeeds from another replica.

Background pipeline: The Web Crawler continuously discovers new pages. The Indexer processes raw HTML into tokens and posting lists. The Index Builder compiles new index segments that are pushed to index servers β€” this happens continuously, not as a big batch.

β–Έ Full architecture
Note: Interview tip: The two key patterns to emphasize are (1) scatter-gather for parallel index lookup across shards, and (2) two-phase ranking β€” rough scoring on cheap index servers, fine ranking with ML on expensive ranking servers. This shows you understand how to make a 200ms latency budget work across billions of documents.

Key Metrics

Query latency (p99)
Scatter-gather + two-phase ranking
< 200 ms β€”
Index size
50B docs Γ— inverted index per term
100+ PB β€”
Crawl rate
Continuous crawling with politeness
~1B pages/day β€”
Index freshness
Incremental index updates
Minutes to hours β€”
Peak QPS
115K avg, 200K peak
~200K β€”
Docs per query (candidates)
Funnel: 1M β†’ 1K β†’ 10
~1M β†’ top 10 β€”

Quick check

Why do search engines use a two-phase ranking approach instead of running the full ML model on all candidates?

Continue reading