Data & Infrastructure13 min read

Design a Metrics Monitoring System

Collect, store, query, and alert on millions of time-series data points — like Datadog and Prometheus

scope:Real-World Systemdifficulty:Advanced

Understanding the Problem

Modern distributed systems generate an enormous volume of metrics — CPU usage, request latency, error rates, queue depths, and more. A metrics monitoring system collects all this data, stores it as time-series, lets engineers query and visualize it, and fires alerts when things go wrong.

Think Datadog, Prometheus + Grafana, or AWS CloudWatch.

Functional Requirements:

Ingest metrics from thousands of services (push or pull model).
Store time-series data with high write throughput and efficient compression.
Query and aggregate metrics over arbitrary time ranges (avg, sum, percentiles).
Dashboards — real-time and historical visualization of metrics.
Alerting — threshold-based and anomaly-detection alerts with notification routing (PagerDuty, Slack, email).

Non-Functional Requirements:

Write throughput: 10M data points/second ingestion.
Query latency: < 1 second for dashboard queries spanning hours/days.
Availability: 99.9% — alerting must not go down when you need it most.
Retention: 1 year of historical data with automatic downsampling.

▸ The idea: collect → store → query → alert

Estimation

Let's size this system:

10,000 services, each emitting ~500 metrics at a 10-second interval.
Write rate: 10K × 500 / 10 = 500K writes/second (steady state). Peak ~2M/s.
Data point size: metric name + tags + timestamp + value ≈ 100 bytes raw, ~16 bytes compressed (Gorilla compression achieves 12× compression).
Daily raw data: 500K × 86,400 × 16B = ~690 GB/day compressed.
1-year retention: 690 GB × 365 = ~250 TB compressed. With rollups: raw (2 weeks) + 1min (3 months) + 1hr (1 year) ≈ ~15 PB raw equivalent, ~50 TB stored.
Query fanout: A single dashboard panel may query 100+ time-series across multiple partitions.

The core challenge is handling massive write throughput while keeping queries fast across enormous time ranges.

Time-Series Data Model

Every data point looks like this:

{
  "metric_name": "http.request.latency_ms",
  "tags": {
    "host": "web-server-42",
    "service": "api-gateway",
    "region": "us-east-1",
    "endpoint": "/api/v1/users"
  },
  "timestamp": 1709251200,
  "value": 42.5
}

A unique time-series is defined by (metric_name + sorted tags). All data points for the same series are stored contiguously for efficient compression.

Compression (Gorilla paper):

Timestamps: Delta-of-delta encoding. If timestamps come every 10s, the delta is always ~10. The delta-of-delta is ~0, which needs only 1 bit!
Values: XOR encoding. Consecutive float values are often similar. XOR of two similar IEEE 754 floats has many leading zeros, which compress well.
Result: ~1.37 bytes per data point (12× compression from 16 bytes).

Data is organized into blocks (typically 2-hour chunks). Each block is immutable once flushed, making it easy to compact and move between storage tiers.

▸ Time-series storage: compression and partitioning

Click chart to zoom

Ingestion pipeline: agents push metrics through collectors and Kafka to a sharded writer pool, which writes to a WAL and in-memory blocks before periodic compaction

Query Engine and Alerting

The query engine supports a PromQL-style language for flexible time-series queries:

# Average request latency by service over the last hour
avg by (service) (
  rate(http_request_latency_ms[5m])
)

# 99th percentile latency
histogram_quantile(0.99, rate(http_request_duration_bucket[5m]))

Downsampling is critical for long-range queries:

Raw data (10s resolution) — kept for 2 weeks.
1-minute rollups (avg, min, max, count) — kept for 3 months.
5-minute rollups — kept for 6 months.
1-hour rollups — kept for 1 year+.

When a dashboard queries "CPU over the last 6 months," it reads 1-hour rollup data — only ~4,380 points instead of 15.5 million raw points.

Alert Evaluation:

An alert evaluator runs queries on a schedule (every 15-60 seconds).
Threshold alerts: avg(cpu_usage) > 90% for 5 minutes
Anomaly detection: Statistical models detect deviations from historical baselines.
Alerts are routed through a notification pipeline: deduplication → grouping → throttling → PagerDuty/Slack/email.
On-call schedules and escalation policies ensure the right person gets paged.

▸ Query and alerting pipeline

Storage Tiers

Not all data needs the same performance level. A tiered storage strategy balances cost and speed:

Hot tier (recent data, < 2 weeks): Fast SSDs, in-memory caches. Serves real-time dashboards and alerting queries. This is where most queries hit.
Warm tier (weeks to months): HDD-based storage or cheaper SSDs. Serves historical dashboards. Data is pre-rolled-up for faster queries.
Cold tier (months to year+): Object storage (S3/GCS). Cheapest. Used for compliance, audit, and rare long-range investigations. Queries are slower but acceptable for these use cases.

Compaction is the process of merging many small 2-hour blocks into larger blocks. This reduces the number of files, improves query efficiency (fewer seeks), and applies stronger compression. Similar to LSM-tree compaction in databases.

Data automatically flows from hot → warm → cold based on age, managed by a background lifecycle service.

▸ Full architecture

Note: Interview tip: Push vs Pull model. Prometheus uses a pull model — it scrapes metrics from service endpoints on a schedule. This gives Prometheus control over ingestion rate and makes it easy to detect if a service is down (scrape fails). Datadog uses a push model — agents on each host push metrics to a collector. Push is better for dynamic/ephemeral environments (containers, serverless) where services come and go. Many production systems use a hybrid: agents push to a local collector, and a central system pulls from collectors.