Data & Infrastructure12 min read

Design a Web Crawler

Crawl the entire web like Googlebot — politely and at scale

scope:Real-World Systemdifficulty:Advanced

Understanding the Problem

A web crawler (like Googlebot) systematically browses the internet to build a search index. It starts with a set of seed URLs, downloads pages, extracts links, and repeats — discovering the web graph one page at a time.

Functional Requirements:

Discover new URLs by following links from known pages.
Download (fetch) web pages and store their content.
Extract links from downloaded pages to find new URLs.
Store crawled content for indexing, analysis, or archival.

Non-Functional Requirements:

Politeness / rate limiting: Don't overwhelm any single domain — respect robots.txt and limit requests per host.
Avoid duplicate URLs: The same URL can appear on thousands of pages. Crawl it once, not thousands of times.
Handle traps / infinite loops: Some sites generate infinite URLs (calendars, session IDs). The crawler must detect and avoid these.
Scale to billions of pages: The web has billions of pages. The crawler must be horizontally scalable and fault-tolerant.

▸ The idea: seed URLs → crawl → discover → repeat

Estimation

Let's size this system:

1 billion pages per month: ~400 pages/second sustained throughput
Average page size: ~100 KB (HTML + headers)
Storage per month: 1B × 100 KB = ~100 TB/month
DNS lookups: ~400/second (one per page fetch, with caching reducing actual external lookups)
URL frontier size: Billions of discovered but uncrawled URLs at any given time
Bandwidth: 400 pages/s × 100 KB = ~40 MB/s = ~320 Mbps sustained

This is a storage-heavy, I/O-bound system. The main bottlenecks are network I/O, DNS resolution, and managing the massive URL frontier.

Core Algorithm: BFS with URL Frontier

A web crawler is essentially a breadth-first search (BFS) over the web graph:

Seed URLs — Start with a curated list of high-quality seed URLs (e.g., top news sites, Wikipedia, government portals).
URL Frontier (queue) — A priority queue of URLs to crawl next. Seed URLs go in first.
Fetch — Dequeue a URL, resolve DNS, download the page via HTTP.
Parse — Parse the HTML content, extract text for the search index.
Extract URLs — Find all <a href="#"> links in the page.
Filter & Deduplicate — Check if each extracted URL has been seen before. If not, add it to the frontier.
Repeat — Go back to step 3. Continue until the frontier is empty (it never really is!).

The frontier is the heart of the crawler. It determines what to crawl and when — balancing freshness, importance, and politeness.

▸ BFS crawl: URL frontier and fetch cycle

Click chart to zoom

The crawl loop: URLs flow from the frontier through fetch, parse, extract, filter, and back to the frontier

▸ Politeness and dedup: being a good citizen

`Politeness and Deduplication`

A crawler that hammers servers is a bad crawler. Politeness isn't just nice — it's essential to avoid getting blocked and to be a responsible internet citizen.

Politeness:

Respect robots.txt — Every domain's /robots.txt file specifies which paths crawlers may or may not access, and often a Crawl-delay directive. Always obey it.
Per-host rate limiting — Limit to ~1 request per second per domain. Use separate queues per domain so you can throttle each independently.
Priority queue — Crawl important pages first (high PageRank, frequently updated news sites) rather than deep-linked obscure pages.

URL Deduplication:

Bloom filter — Space-efficient probabilistic set. Can tell you "definitely not seen" or "probably seen." With 1B URLs and 1% false positive rate, needs only ~1.2 GB of memory.
URL fingerprint set — Hash each URL (MD5/SHA) and store in a set. More memory but zero false positives.

Content Deduplication:

SimHash — Detects near-duplicate pages (mirrors, syndicated content). Two pages with similar SimHash values have similar content, even if URLs differ.

▸ Full architecture at scale

`Scaling to Billions of Pages`

A single machine can't crawl the web. Here's how to scale:

Multiple Crawler Workers: Run hundreds of crawler workers in parallel. Each worker runs the fetch-parse-extract loop independently.

Distributed URL Frontier: Partition the frontier by domain using consistent hashing. All URLs for example.com go to the same frontier partition. This naturally enforces per-domain rate limiting — each partition handles its own domains.

Checkpointing & Crash Recovery: Periodically snapshot the frontier state and crawl progress. If a worker crashes, another picks up where it left off. Use message queues with at-least-once delivery to ensure no URL is lost.

DNS Cache: DNS resolution is slow (~10-100ms). Cache DNS results aggressively — domains don't change IPs often.

Recrawl Strategy: Pages change over time. Use exponential backoff: if a page hasn't changed after several recrawls, increase the interval. Fresh news pages get recrawled more frequently.

Note: Interview tip: Politeness is often the most critical talking point for a web crawler design. Interviewers want to see that you understand robots.txt, per-domain rate limiting, and why crawling too aggressively can get your crawler IP-banned. Always mention it early in your design.

`Key Metrics`

Pages crawled per second1B pages/month sustained rate

~400/s —

Storage per month1B pages × 100 KB avg

~100 TB —

URL frontier sizeDiscovered but uncrawled URLs

Billions —

Bloom filter (dedup)1B URLs, 1% false positive

~1.2 GB \(O(1)\) lookup

DNS resolutionCached: <1 ms

~10-100 ms \(O(1)\)

Per-domain rate limitPoliteness constraint

~1 req/s —

`Quick check`

Why does a web crawler use per-domain queues instead of a single global queue?

Share

Continue reading

Consistent Hashing
Adding a server shouldn't reshuffle everything
→Message Queues
Don't do everything right now — put it in line
→Design a Key-Value Store
Build a distributed hash map like Redis or DynamoDB — at planet scale
→