Design a Web Crawler
Understanding the Problem
A web crawler (like Googlebot) systematically browses the internet to build a search index. It starts with a set of seed URLs, downloads pages, extracts links, and repeats β discovering the web graph one page at a time.
Functional Requirements:
- Discover new URLs by following links from known pages.
- Download (fetch) web pages and store their content.
- Extract links from downloaded pages to find new URLs.
- Store crawled content for indexing, analysis, or archival.
Non-Functional Requirements:
- Politeness / rate limiting: Don't overwhelm any single domain β respect
robots.txtand limit requests per host. - Avoid duplicate URLs: The same URL can appear on thousands of pages. Crawl it once, not thousands of times.
- Handle traps / infinite loops: Some sites generate infinite URLs (calendars, session IDs). The crawler must detect and avoid these.
- Scale to billions of pages: The web has billions of pages. The crawler must be horizontally scalable and fault-tolerant.
Estimation
Let's size this system:
- 1 billion pages per month: ~400 pages/second sustained throughput
- Average page size: ~100 KB (HTML + headers)
- Storage per month: 1B Γ 100 KB = ~100 TB/month
- DNS lookups: ~400/second (one per page fetch, with caching reducing actual external lookups)
- URL frontier size: Billions of discovered but uncrawled URLs at any given time
- Bandwidth: 400 pages/s Γ 100 KB = ~40 MB/s = ~320 Mbps sustained
This is a storage-heavy, I/O-bound system. The main bottlenecks are network I/O, DNS resolution, and managing the massive URL frontier.
Core Algorithm: BFS with URL Frontier
A web crawler is essentially a breadth-first search (BFS) over the web graph:
- Seed URLs β Start with a curated list of high-quality seed URLs (e.g., top news sites, Wikipedia, government portals).
- URL Frontier (queue) β A priority queue of URLs to crawl next. Seed URLs go in first.
- Fetch β Dequeue a URL, resolve DNS, download the page via HTTP.
- Parse β Parse the HTML content, extract text for the search index.
- Extract URLs β Find all
<a href="#"> links in the page. - Filter & Deduplicate β Check if each extracted URL has been seen before. If not, add it to the frontier.
- Repeat β Go back to step 3. Continue until the frontier is empty (it never really is!).
The frontier is the heart of the crawler. It determines what to crawl and when β balancing freshness, importance, and politeness.
Politeness and Deduplication
A crawler that hammers servers is a bad crawler. Politeness isn't just nice β it's essential to avoid getting blocked and to be a responsible internet citizen.
Politeness:
- Respect
robots.txtβ Every domain's/robots.txtfile specifies which paths crawlers may or may not access, and often aCrawl-delaydirective. Always obey it. - Per-host rate limiting β Limit to ~1 request per second per domain. Use separate queues per domain so you can throttle each independently.
- Priority queue β Crawl important pages first (high PageRank, frequently updated news sites) rather than deep-linked obscure pages.
URL Deduplication:
- Bloom filter β Space-efficient probabilistic set. Can tell you "definitely not seen" or "probably seen." With 1B URLs and 1% false positive rate, needs only ~1.2 GB of memory.
- URL fingerprint set β Hash each URL (MD5/SHA) and store in a set. More memory but zero false positives.
Content Deduplication:
- SimHash β Detects near-duplicate pages (mirrors, syndicated content). Two pages with similar SimHash values have similar content, even if URLs differ.
Scaling to Billions of Pages
A single machine can't crawl the web. Here's how to scale:
Multiple Crawler Workers: Run hundreds of crawler workers in parallel. Each worker runs the fetch-parse-extract loop independently.
Distributed URL Frontier: Partition the frontier by domain using consistent hashing. All URLs for example.com go to the same frontier partition. This naturally enforces per-domain rate limiting β each partition handles its own domains.
Checkpointing & Crash Recovery: Periodically snapshot the frontier state and crawl progress. If a worker crashes, another picks up where it left off. Use message queues with at-least-once delivery to ensure no URL is lost.
DNS Cache: DNS resolution is slow (~10-100ms). Cache DNS results aggressively β domains don't change IPs often.
Recrawl Strategy: Pages change over time. Use exponential backoff: if a page hasn't changed after several recrawls, increase the interval. Fresh news pages get recrawled more frequently.