Business Systems12 min read

Design an Ad Click Aggregator

Count billions of ad clicks accurately β€” for billing and analytics
scope:Real-World Systemdifficulty:Advanced

Understanding the Problem

Every time someone clicks an ad on Google, Facebook, or any ad network, that click needs to be counted exactly once. The count feeds into billing (advertisers pay per click) and analytics (campaign performance dashboards). Getting it wrong means either overcharging advertisers or losing revenue.

Functional Requirements:

  • Record every ad click with metadata (ad ID, campaign ID, user ID, timestamp, IP).
  • Aggregate click counts per ad, per campaign, and per time window (1 min, 5 min, 1 hr).
  • Detect and filter fraudulent clicks (bots, click farms, duplicate clicks).
  • Provide near real-time reporting dashboards for advertisers.

Non-Functional Requirements:

  • Exactly-once counting for billing β€” no double-counting, no missed clicks.
  • Near real-time: Aggregated counts available within <1 minute of the click.
  • Massive scale: Handle 10 billion clicks per day.
  • 99.99% accuracy: Financial data must be correct. Reconciliation between real-time and batch counts.
β–Έ The idea: click β†’ count β†’ bill β†’ report

Estimation

Let's size the system:

  • 10 billion clicks/day = ~115,000 clicks/second average
  • Peak traffic: ~500,000 clicks/second (during major events, holidays)
  • Click event size: ~200 bytes (ad ID, campaign ID, user ID, timestamp, IP, user agent, geo)
  • Raw storage: 10B Γ— 200B = 2 TB/day, ~730 TB/year
  • Aggregated data: Much smaller β€” 1M ads Γ— 1440 minutes Γ— 50B β‰ˆ 72 GB/day
  • Network bandwidth: 500K Γ— 200B = ~100 MB/s peak ingestion

This is a streaming problem at its core. We need to process a firehose of events, deduplicate them, aggregate counts in time windows, and serve queries β€” all in near real-time.

Why This Is Hard

Counting clicks sounds simple, but at scale it's deceptively hard:

Duplicate clicks: Users double-click, browsers retry, mobile networks replay requests. The same click can arrive 2-3 times. We must count it exactly once for billing.

Bot clicks: Competitors or click farms can inflate click counts to drain an advertiser's budget. We need real-time fraud detection.

Network retries: If the click collector doesn't ACK fast enough, the client SDK retries. Each retry looks like a new click unless we deduplicate.

Time-windowed aggregation: Advertisers want counts in 1-minute, 5-minute, and 1-hour windows. Late-arriving events (due to network delays) must be placed in the correct window, not the current one. This requires watermarks and event-time processing.

Exactly-once semantics: In distributed systems, achieving exactly-once is the hardest guarantee. We use idempotent writes + deduplication to get effectively-once.

β–Έ Stream processing: real-time click aggregation
Click chart to zoom
Click flow: from SDK to real-time dashboard in under 1 minute

Deduplication & Exactly-Once Counting

The most critical component. Here's the strategy:

Click ID: Each click event gets a unique click_id generated by the Ad SDK (UUID + timestamp hash). This is the dedup key.

Two-layer deduplication:

  1. Bloom filter (fast path): In-memory probabilistic check. If the Bloom filter says "not seen," the click is definitely new. If it says "maybe seen," we check the exact store. False positive rate ~1%, but zero false negatives.
  2. Exact dedup store (Redis/RocksDB): Stores click_id β†’ 1 with a TTL of 2 hours. Definitive answer for "maybe seen" clicks from the Bloom filter.

Watermarks for late events: Flink uses watermarks to track event-time progress. Events arriving up to 5 minutes late are placed in the correct time window. Events beyond the watermark threshold are dropped or sent to a dead-letter queue for manual review.

MapReduce for reconciliation: A daily batch job (Spark) re-aggregates all raw clicks from HDFS. The batch count is compared against the real-time count. Discrepancies trigger alerts and corrections.

β–Έ Deduplication and exactly-once counting

Fraud Detection & Reconciliation

Real-time fraud signals:

  • Abnormal click rate: If one IP generates 100 clicks on the same ad in 1 minute, that's suspicious. Rate limit per IP per ad.
  • IP clustering: Multiple clicks from the same /24 subnet β†’ possible click farm.
  • Click-to-conversion ratio: If an ad gets 10,000 clicks but 0 conversions, something is wrong.
  • ML models: Train on labeled fraud data. Features: click timing patterns, session behavior, device fingerprint, geo mismatch.

Reconciliation (Lambda Architecture):

  • Real-time path (speed layer): Flink produces approximate counts within seconds.
  • Batch path (batch layer): Spark re-processes raw events nightly from HDFS, producing exact counts.
  • Serving layer: Query service merges both. Batch counts override real-time counts once available.

This is the Lambda Architecture pattern β€” combining stream and batch processing for both speed and accuracy.

β–Έ Full architecture
Note: Interview tip: The critical requirement for an ad click aggregator is exactly-once counting semantics. Explain how you achieve it: unique click IDs from the SDK, Bloom filter + exact dedup store, idempotent writes to the aggregation store, and batch reconciliation as a safety net. This shows you understand distributed counting is fundamentally harder than it appears.

Key Metrics

Click ingestion rate
Kafka partitioned by ad_id
500K clicks/s peak β€”
Aggregation latency
Flink micro-batch windows
<1 min β€”
Dedup accuracy
Bloom filter + Redis exact check
99.99% β€”
Raw storage per day
10B clicks Γ— 200 bytes
~2 TB/day β€”
Aggregated storage/day
1M ads Γ— 1440 min windows
~72 GB/day β€”
Fraud detection
Real-time ML + rule engine
<5 sec β€”

Quick check

Why is a Bloom filter used as the first layer of deduplication instead of going directly to Redis?

Continue reading