Design an Ad Click Aggregator
Understanding the Problem
Every time someone clicks an ad on Google, Facebook, or any ad network, that click needs to be counted exactly once. The count feeds into billing (advertisers pay per click) and analytics (campaign performance dashboards). Getting it wrong means either overcharging advertisers or losing revenue.
Functional Requirements:
- Record every ad click with metadata (ad ID, campaign ID, user ID, timestamp, IP).
- Aggregate click counts per ad, per campaign, and per time window (1 min, 5 min, 1 hr).
- Detect and filter fraudulent clicks (bots, click farms, duplicate clicks).
- Provide near real-time reporting dashboards for advertisers.
Non-Functional Requirements:
- Exactly-once counting for billing β no double-counting, no missed clicks.
- Near real-time: Aggregated counts available within <1 minute of the click.
- Massive scale: Handle 10 billion clicks per day.
- 99.99% accuracy: Financial data must be correct. Reconciliation between real-time and batch counts.
Estimation
Let's size the system:
- 10 billion clicks/day = ~115,000 clicks/second average
- Peak traffic: ~500,000 clicks/second (during major events, holidays)
- Click event size: ~200 bytes (ad ID, campaign ID, user ID, timestamp, IP, user agent, geo)
- Raw storage: 10B Γ 200B = 2 TB/day, ~730 TB/year
- Aggregated data: Much smaller β 1M ads Γ 1440 minutes Γ 50B β 72 GB/day
- Network bandwidth: 500K Γ 200B = ~100 MB/s peak ingestion
This is a streaming problem at its core. We need to process a firehose of events, deduplicate them, aggregate counts in time windows, and serve queries β all in near real-time.
Why This Is Hard
Counting clicks sounds simple, but at scale it's deceptively hard:
Duplicate clicks: Users double-click, browsers retry, mobile networks replay requests. The same click can arrive 2-3 times. We must count it exactly once for billing.
Bot clicks: Competitors or click farms can inflate click counts to drain an advertiser's budget. We need real-time fraud detection.
Network retries: If the click collector doesn't ACK fast enough, the client SDK retries. Each retry looks like a new click unless we deduplicate.
Time-windowed aggregation: Advertisers want counts in 1-minute, 5-minute, and 1-hour windows. Late-arriving events (due to network delays) must be placed in the correct window, not the current one. This requires watermarks and event-time processing.
Exactly-once semantics: In distributed systems, achieving exactly-once is the hardest guarantee. We use idempotent writes + deduplication to get effectively-once.
Deduplication & Exactly-Once Counting
The most critical component. Here's the strategy:
Click ID: Each click event gets a unique click_id generated by the Ad SDK (UUID + timestamp hash). This is the dedup key.
Two-layer deduplication:
- Bloom filter (fast path): In-memory probabilistic check. If the Bloom filter says "not seen," the click is definitely new. If it says "maybe seen," we check the exact store. False positive rate ~1%, but zero false negatives.
- Exact dedup store (Redis/RocksDB): Stores
click_id β 1with a TTL of 2 hours. Definitive answer for "maybe seen" clicks from the Bloom filter.
Watermarks for late events: Flink uses watermarks to track event-time progress. Events arriving up to 5 minutes late are placed in the correct time window. Events beyond the watermark threshold are dropped or sent to a dead-letter queue for manual review.
MapReduce for reconciliation: A daily batch job (Spark) re-aggregates all raw clicks from HDFS. The batch count is compared against the real-time count. Discrepancies trigger alerts and corrections.
Fraud Detection & Reconciliation
Real-time fraud signals:
- Abnormal click rate: If one IP generates 100 clicks on the same ad in 1 minute, that's suspicious. Rate limit per IP per ad.
- IP clustering: Multiple clicks from the same /24 subnet β possible click farm.
- Click-to-conversion ratio: If an ad gets 10,000 clicks but 0 conversions, something is wrong.
- ML models: Train on labeled fraud data. Features: click timing patterns, session behavior, device fingerprint, geo mismatch.
Reconciliation (Lambda Architecture):
- Real-time path (speed layer): Flink produces approximate counts within seconds.
- Batch path (batch layer): Spark re-processes raw events nightly from HDFS, producing exact counts.
- Serving layer: Query service merges both. Batch counts override real-time counts once available.
This is the Lambda Architecture pattern β combining stream and batch processing for both speed and accuracy.