Design a Metrics Monitoring System
Understanding the Problem
Modern distributed systems generate an enormous volume of metrics β CPU usage, request latency, error rates, queue depths, and more. A metrics monitoring system collects all this data, stores it as time-series, lets engineers query and visualize it, and fires alerts when things go wrong.
Think Datadog, Prometheus + Grafana, or AWS CloudWatch.
Functional Requirements:
- Ingest metrics from thousands of services (push or pull model).
- Store time-series data with high write throughput and efficient compression.
- Query and aggregate metrics over arbitrary time ranges (avg, sum, percentiles).
- Dashboards β real-time and historical visualization of metrics.
- Alerting β threshold-based and anomaly-detection alerts with notification routing (PagerDuty, Slack, email).
Non-Functional Requirements:
- Write throughput: 10M data points/second ingestion.
- Query latency: < 1 second for dashboard queries spanning hours/days.
- Availability: 99.9% β alerting must not go down when you need it most.
- Retention: 1 year of historical data with automatic downsampling.
Estimation
Let's size this system:
- 10,000 services, each emitting ~500 metrics at a 10-second interval.
- Write rate: 10K Γ 500 / 10 = 500K writes/second (steady state). Peak ~2M/s.
- Data point size: metric name + tags + timestamp + value β 100 bytes raw, ~16 bytes compressed (Gorilla compression achieves 12Γ compression).
- Daily raw data: 500K Γ 86,400 Γ 16B = ~690 GB/day compressed.
- 1-year retention: 690 GB Γ 365 = ~250 TB compressed. With rollups: raw (2 weeks) + 1min (3 months) + 1hr (1 year) β ~15 PB raw equivalent, ~50 TB stored.
- Query fanout: A single dashboard panel may query 100+ time-series across multiple partitions.
The core challenge is handling massive write throughput while keeping queries fast across enormous time ranges.
Time-Series Data Model
Every data point looks like this:
{
"metric_name": "http.request.latency_ms",
"tags": {
"host": "web-server-42",
"service": "api-gateway",
"region": "us-east-1",
"endpoint": "/api/v1/users"
},
"timestamp": 1709251200,
"value": 42.5
}A unique time-series is defined by (metric_name + sorted tags). All data points for the same series are stored contiguously for efficient compression.
Compression (Gorilla paper):
- Timestamps: Delta-of-delta encoding. If timestamps come every 10s, the delta is always ~10. The delta-of-delta is ~0, which needs only 1 bit!
- Values: XOR encoding. Consecutive float values are often similar. XOR of two similar IEEE 754 floats has many leading zeros, which compress well.
- Result: ~1.37 bytes per data point (12Γ compression from 16 bytes).
Data is organized into blocks (typically 2-hour chunks). Each block is immutable once flushed, making it easy to compact and move between storage tiers.
Query Engine and Alerting
The query engine supports a PromQL-style language for flexible time-series queries:
# Average request latency by service over the last hour
avg by (service) (
rate(http_request_latency_ms[5m])
)
# 99th percentile latency
histogram_quantile(0.99, rate(http_request_duration_bucket[5m]))Downsampling is critical for long-range queries:
- Raw data (10s resolution) β kept for 2 weeks.
- 1-minute rollups (avg, min, max, count) β kept for 3 months.
- 5-minute rollups β kept for 6 months.
- 1-hour rollups β kept for 1 year+.
When a dashboard queries "CPU over the last 6 months," it reads 1-hour rollup data β only ~4,380 points instead of 15.5 million raw points.
Alert Evaluation:
- An alert evaluator runs queries on a schedule (every 15-60 seconds).
- Threshold alerts:
avg(cpu_usage) > 90% for 5 minutes - Anomaly detection: Statistical models detect deviations from historical baselines.
- Alerts are routed through a notification pipeline: deduplication β grouping β throttling β PagerDuty/Slack/email.
- On-call schedules and escalation policies ensure the right person gets paged.
Storage Tiers
Not all data needs the same performance level. A tiered storage strategy balances cost and speed:
- Hot tier (recent data, < 2 weeks): Fast SSDs, in-memory caches. Serves real-time dashboards and alerting queries. This is where most queries hit.
- Warm tier (weeks to months): HDD-based storage or cheaper SSDs. Serves historical dashboards. Data is pre-rolled-up for faster queries.
- Cold tier (months to year+): Object storage (S3/GCS). Cheapest. Used for compliance, audit, and rare long-range investigations. Queries are slower but acceptable for these use cases.
Compaction is the process of merging many small 2-hour blocks into larger blocks. This reduces the number of files, improves query efficiency (fewer seeks), and applies stronger compression. Similar to LSM-tree compaction in databases.
Data automatically flows from hot β warm β cold based on age, managed by a background lifecycle service.