Data & Infrastructure12 min read

Design a Web Crawler

Crawl the entire web like Googlebot β€” politely and at scale
scope:Real-World Systemdifficulty:Advanced

Understanding the Problem

A web crawler (like Googlebot) systematically browses the internet to build a search index. It starts with a set of seed URLs, downloads pages, extracts links, and repeats β€” discovering the web graph one page at a time.

Functional Requirements:

  • Discover new URLs by following links from known pages.
  • Download (fetch) web pages and store their content.
  • Extract links from downloaded pages to find new URLs.
  • Store crawled content for indexing, analysis, or archival.

Non-Functional Requirements:

  • Politeness / rate limiting: Don't overwhelm any single domain β€” respect robots.txt and limit requests per host.
  • Avoid duplicate URLs: The same URL can appear on thousands of pages. Crawl it once, not thousands of times.
  • Handle traps / infinite loops: Some sites generate infinite URLs (calendars, session IDs). The crawler must detect and avoid these.
  • Scale to billions of pages: The web has billions of pages. The crawler must be horizontally scalable and fault-tolerant.
β–Έ The idea: seed URLs β†’ crawl β†’ discover β†’ repeat

Estimation

Let's size this system:

  • 1 billion pages per month: ~400 pages/second sustained throughput
  • Average page size: ~100 KB (HTML + headers)
  • Storage per month: 1B Γ— 100 KB = ~100 TB/month
  • DNS lookups: ~400/second (one per page fetch, with caching reducing actual external lookups)
  • URL frontier size: Billions of discovered but uncrawled URLs at any given time
  • Bandwidth: 400 pages/s Γ— 100 KB = ~40 MB/s = ~320 Mbps sustained

This is a storage-heavy, I/O-bound system. The main bottlenecks are network I/O, DNS resolution, and managing the massive URL frontier.

Core Algorithm: BFS with URL Frontier

A web crawler is essentially a breadth-first search (BFS) over the web graph:

  1. Seed URLs β€” Start with a curated list of high-quality seed URLs (e.g., top news sites, Wikipedia, government portals).
  2. URL Frontier (queue) β€” A priority queue of URLs to crawl next. Seed URLs go in first.
  3. Fetch β€” Dequeue a URL, resolve DNS, download the page via HTTP.
  4. Parse β€” Parse the HTML content, extract text for the search index.
  5. Extract URLs β€” Find all <a href="#"> links in the page.
  6. Filter & Deduplicate β€” Check if each extracted URL has been seen before. If not, add it to the frontier.
  7. Repeat β€” Go back to step 3. Continue until the frontier is empty (it never really is!).

The frontier is the heart of the crawler. It determines what to crawl and when β€” balancing freshness, importance, and politeness.

β–Έ BFS crawl: URL frontier and fetch cycle
Click chart to zoom
The crawl loop: URLs flow from the frontier through fetch, parse, extract, filter, and back to the frontier
β–Έ Politeness and dedup: being a good citizen

Politeness and Deduplication

A crawler that hammers servers is a bad crawler. Politeness isn't just nice β€” it's essential to avoid getting blocked and to be a responsible internet citizen.

Politeness:

  • Respect robots.txt β€” Every domain's /robots.txt file specifies which paths crawlers may or may not access, and often a Crawl-delay directive. Always obey it.
  • Per-host rate limiting β€” Limit to ~1 request per second per domain. Use separate queues per domain so you can throttle each independently.
  • Priority queue β€” Crawl important pages first (high PageRank, frequently updated news sites) rather than deep-linked obscure pages.

URL Deduplication:

  • Bloom filter β€” Space-efficient probabilistic set. Can tell you "definitely not seen" or "probably seen." With 1B URLs and 1% false positive rate, needs only ~1.2 GB of memory.
  • URL fingerprint set β€” Hash each URL (MD5/SHA) and store in a set. More memory but zero false positives.

Content Deduplication:

  • SimHash β€” Detects near-duplicate pages (mirrors, syndicated content). Two pages with similar SimHash values have similar content, even if URLs differ.
β–Έ Full architecture at scale

Scaling to Billions of Pages

A single machine can't crawl the web. Here's how to scale:

Multiple Crawler Workers: Run hundreds of crawler workers in parallel. Each worker runs the fetch-parse-extract loop independently.

Distributed URL Frontier: Partition the frontier by domain using consistent hashing. All URLs for example.com go to the same frontier partition. This naturally enforces per-domain rate limiting β€” each partition handles its own domains.

Checkpointing & Crash Recovery: Periodically snapshot the frontier state and crawl progress. If a worker crashes, another picks up where it left off. Use message queues with at-least-once delivery to ensure no URL is lost.

DNS Cache: DNS resolution is slow (~10-100ms). Cache DNS results aggressively β€” domains don't change IPs often.

Recrawl Strategy: Pages change over time. Use exponential backoff: if a page hasn't changed after several recrawls, increase the interval. Fresh news pages get recrawled more frequently.

Note: Interview tip: Politeness is often the most critical talking point for a web crawler design. Interviewers want to see that you understand robots.txt, per-domain rate limiting, and why crawling too aggressively can get your crawler IP-banned. Always mention it early in your design.

Key Metrics

Pages crawled per second
1B pages/month sustained rate
~400/s β€”
Storage per month
1B pages Γ— 100 KB avg
~100 TB β€”
URL frontier size
Discovered but uncrawled URLs
Billions β€”
Bloom filter (dedup)
1B URLs, 1% false positive
~1.2 GB \(O(1)\) lookup
DNS resolution
Cached: <1 ms
~10-100 ms \(O(1)\)
Per-domain rate limit
Politeness constraint
~1 req/s β€”

Quick check

Why does a web crawler use per-domain queues instead of a single global queue?

Continue reading