Data & Infrastructure13 min read

Design a Metrics Monitoring System

Collect, store, query, and alert on millions of time-series data points β€” like Datadog and Prometheus
scope:Real-World Systemdifficulty:Advanced

Understanding the Problem

Modern distributed systems generate an enormous volume of metrics β€” CPU usage, request latency, error rates, queue depths, and more. A metrics monitoring system collects all this data, stores it as time-series, lets engineers query and visualize it, and fires alerts when things go wrong.

Think Datadog, Prometheus + Grafana, or AWS CloudWatch.

Functional Requirements:

  • Ingest metrics from thousands of services (push or pull model).
  • Store time-series data with high write throughput and efficient compression.
  • Query and aggregate metrics over arbitrary time ranges (avg, sum, percentiles).
  • Dashboards β€” real-time and historical visualization of metrics.
  • Alerting β€” threshold-based and anomaly-detection alerts with notification routing (PagerDuty, Slack, email).

Non-Functional Requirements:

  • Write throughput: 10M data points/second ingestion.
  • Query latency: < 1 second for dashboard queries spanning hours/days.
  • Availability: 99.9% β€” alerting must not go down when you need it most.
  • Retention: 1 year of historical data with automatic downsampling.
β–Έ The idea: collect β†’ store β†’ query β†’ alert

Estimation

Let's size this system:

  • 10,000 services, each emitting ~500 metrics at a 10-second interval.
  • Write rate: 10K Γ— 500 / 10 = 500K writes/second (steady state). Peak ~2M/s.
  • Data point size: metric name + tags + timestamp + value β‰ˆ 100 bytes raw, ~16 bytes compressed (Gorilla compression achieves 12Γ— compression).
  • Daily raw data: 500K Γ— 86,400 Γ— 16B = ~690 GB/day compressed.
  • 1-year retention: 690 GB Γ— 365 = ~250 TB compressed. With rollups: raw (2 weeks) + 1min (3 months) + 1hr (1 year) β‰ˆ ~15 PB raw equivalent, ~50 TB stored.
  • Query fanout: A single dashboard panel may query 100+ time-series across multiple partitions.

The core challenge is handling massive write throughput while keeping queries fast across enormous time ranges.

Time-Series Data Model

Every data point looks like this:

{
  "metric_name": "http.request.latency_ms",
  "tags": {
    "host": "web-server-42",
    "service": "api-gateway",
    "region": "us-east-1",
    "endpoint": "/api/v1/users"
  },
  "timestamp": 1709251200,
  "value": 42.5
}

A unique time-series is defined by (metric_name + sorted tags). All data points for the same series are stored contiguously for efficient compression.

Compression (Gorilla paper):

  • Timestamps: Delta-of-delta encoding. If timestamps come every 10s, the delta is always ~10. The delta-of-delta is ~0, which needs only 1 bit!
  • Values: XOR encoding. Consecutive float values are often similar. XOR of two similar IEEE 754 floats has many leading zeros, which compress well.
  • Result: ~1.37 bytes per data point (12Γ— compression from 16 bytes).

Data is organized into blocks (typically 2-hour chunks). Each block is immutable once flushed, making it easy to compact and move between storage tiers.

β–Έ Time-series storage: compression and partitioning
Click chart to zoom
Ingestion pipeline: agents push metrics through collectors and Kafka to a sharded writer pool, which writes to a WAL and in-memory blocks before periodic compaction

Query Engine and Alerting

The query engine supports a PromQL-style language for flexible time-series queries:

# Average request latency by service over the last hour
avg by (service) (
  rate(http_request_latency_ms[5m])
)

# 99th percentile latency
histogram_quantile(0.99, rate(http_request_duration_bucket[5m]))

Downsampling is critical for long-range queries:

  • Raw data (10s resolution) β€” kept for 2 weeks.
  • 1-minute rollups (avg, min, max, count) β€” kept for 3 months.
  • 5-minute rollups β€” kept for 6 months.
  • 1-hour rollups β€” kept for 1 year+.

When a dashboard queries "CPU over the last 6 months," it reads 1-hour rollup data β€” only ~4,380 points instead of 15.5 million raw points.

Alert Evaluation:

  • An alert evaluator runs queries on a schedule (every 15-60 seconds).
  • Threshold alerts: avg(cpu_usage) > 90% for 5 minutes
  • Anomaly detection: Statistical models detect deviations from historical baselines.
  • Alerts are routed through a notification pipeline: deduplication β†’ grouping β†’ throttling β†’ PagerDuty/Slack/email.
  • On-call schedules and escalation policies ensure the right person gets paged.
β–Έ Query and alerting pipeline

Storage Tiers

Not all data needs the same performance level. A tiered storage strategy balances cost and speed:

  • Hot tier (recent data, < 2 weeks): Fast SSDs, in-memory caches. Serves real-time dashboards and alerting queries. This is where most queries hit.
  • Warm tier (weeks to months): HDD-based storage or cheaper SSDs. Serves historical dashboards. Data is pre-rolled-up for faster queries.
  • Cold tier (months to year+): Object storage (S3/GCS). Cheapest. Used for compliance, audit, and rare long-range investigations. Queries are slower but acceptable for these use cases.

Compaction is the process of merging many small 2-hour blocks into larger blocks. This reduces the number of files, improves query efficiency (fewer seeks), and applies stronger compression. Similar to LSM-tree compaction in databases.

Data automatically flows from hot β†’ warm β†’ cold based on age, managed by a background lifecycle service.

β–Έ Full architecture
Note: Interview tip: Push vs Pull model. Prometheus uses a pull model β€” it scrapes metrics from service endpoints on a schedule. This gives Prometheus control over ingestion rate and makes it easy to detect if a service is down (scrape fails). Datadog uses a push model β€” agents on each host push metrics to a collector. Push is better for dynamic/ephemeral environments (containers, serverless) where services come and go. Many production systems use a hybrid: agents push to a local collector, and a central system pulls from collectors.

Key Metrics

Ingestion rate
Sharded writers with Kafka buffering
500K–10M pts/sec β€”
Query latency (dashboard)
Pre-rolled-up data, indexed by metric+tags
100ms–1s β€”
Storage per year
Gorilla compression + tiered rollups
~50 TB β€”
Compression ratio
Delta-of-delta + XOR (Gorilla paper)
~12Γ— β€”

Quick check

Why is delta-of-delta encoding particularly effective for metric timestamps?

Continue reading