Social & Communication14 min read

Design a Video Streaming Service

Upload, transcode, and stream video to millions β€” like YouTube
scope:Real-World Systemdifficulty:Advanced

Understanding the Problem

We're designing a video streaming service like YouTube or Netflix. Users upload videos, the system transcodes them to multiple resolutions, and millions of viewers stream them with minimal buffering.

Functional Requirements:

  • Upload videos β€” Creators upload raw video files (potentially gigabytes).
  • Stream videos β€” Viewers watch videos with smooth playback.
  • Search & discovery β€” Users can search for videos by title, tags, and description.
  • Recommendations β€” Suggest relevant videos based on watch history and preferences.

Non-Functional Requirements:

  • Low buffering: Video should start playing within 2 seconds. Rebuffering ratio < 1%.
  • Adaptive bitrate: Video quality adjusts automatically based on the viewer's bandwidth β€” no manual 720p/1080p switching needed.
  • Global delivery: Viewers worldwide should get fast, consistent playback via CDN edge servers.
  • Fault tolerance: Partial failures shouldn't stop playback. The system degrades gracefully (lower quality, not a black screen).
β–Έ The idea: upload β†’ transcode β†’ stream

Estimation

Let's size the system:

  • 500M daily active users β€” watching an average of 5 videos/day
  • 5M uploads/day β€” average raw video size ~200 MB
  • Transcoding: Each video is transcoded to 4 resolutions (360p, 480p, 720p, 1080p). Total output per video β‰ˆ 1.5Γ— raw size = ~300 MB of transcoded variants.
  • Daily upload storage: 5M Γ— 200 MB = 1 PB/day raw + 1.5 PB transcoded
  • Concurrent streams: ~1M peak concurrent viewers
  • Bandwidth: 1M streams Γ— 4 Mbps (720p avg) = 4 Tbps of egress bandwidth

This is a storage-heavy, bandwidth-heavy system. The key challenges are efficient transcoding, smart CDN caching, and adaptive streaming.

API Design

Video streaming uses specialized protocols beyond simple REST:

Upload (Chunked)

EndpointPOST /api/v1/upload/init
Request{"title": "My Video", "description": "...", "tags": ["tech"]}
Response{"upload_id": "abc123", "chunk_size": 5242880}
EndpointPUT /api/v1/upload/:upload_id/chunk/:n
BodyBinary chunk data (5 MB per chunk)
Response{"chunk_n": 3, "status": "received"}

Streaming (HLS/DASH)

ManifestGET /videos/:id/manifest.m3u8 (HLS) or .mpd (DASH)
SegmentGET /videos/:id/segment/:quality/:n.ts

Why chunked upload? Large videos (1 GB+) can't be uploaded in a single request β€” network drops, timeouts, and memory limits make it impractical. Chunking enables resumable uploads: if the connection drops at chunk 47 of 200, you resume from chunk 48 instead of starting over.

β–Έ Upload & transcoding pipeline
Click chart to zoom
Upload flow: chunked upload β†’ object store β†’ async transcode β†’ CDN distribution
β–Έ Adaptive bitrate streaming

HLS / DASH: How Video Streaming Actually Works

Modern video streaming doesn't send one giant file. Instead, it uses adaptive bitrate streaming (ABR):

  1. Segmentation: Each transcoded video is split into small segments (2–10 seconds each). A 10-minute video at 4 quality levels = ~600 segments total.
  2. Manifest file: An .m3u8 (HLS) or .mpd (DASH) file lists all available quality levels and their segment URLs. The client downloads this first.
  3. Adaptive switching: The player monitors download speed in real time. Slow connection? Switch to 360p segments. Fast WiFi? Jump to 1080p. The switch happens seamlessly between segments β€” no buffering, no restart.

HLS (HTTP Live Streaming): Developed by Apple. Uses .m3u8 playlists and .ts segments. Universally supported.

DASH (Dynamic Adaptive Streaming over HTTP): Open standard. Uses .mpd manifests and .m4s segments. More flexible but slightly less browser support.

Both work over standard HTTP/HTTPS β€” no special protocols needed. This is why CDN caching works beautifully for video.

β–Έ Full architecture: upload + streaming paths

CDN Strategy: The Key to Global Streaming

Serving video at scale is fundamentally a CDN problem. Here's the strategy:

Popular content (top 10%): Pre-warmed on edge servers worldwide. When a video goes viral, it's already cached close to viewers. This handles ~80% of all views.

Long-tail content (bottom 90%): Fetched on-demand from the origin. First viewer experiences a cold cache miss (~500ms extra latency), then subsequent viewers in that region get it from edge cache.

Multi-tier caching:

  • L1 β€” Edge PoPs (200+ locations): Closest to users. Cache hot segments only.
  • L2 β€” Regional hubs (20-30 locations): Larger capacity. Cache warm + lukewarm content.
  • L3 β€” Origin: Object store (S3) with all content. Only hit on full cache misses.

Pre-warming: When a channel with 10M subscribers uploads a new video, don't wait for cache misses. Proactively push transcoded segments to edge PoPs in regions where subscribers are concentrated.

Segment-level caching: Because videos are split into 2-10s segments, the CDN can cache at segment granularity. The first 30 seconds of a video (segments 1-5) get cached aggressively since most viewers watch at least that much.

Note: Interview tip: Video streaming has unique challenges that differentiate it from other system designs. Key talking points: (1) chunked/resumable uploads for large files, (2) async transcoding pipeline with multiple output resolutions, (3) HLS/DASH adaptive bitrate β€” explain how the manifest works, (4) CDN as the primary serving layer with multi-tier caching. Mentioning segment-level caching and pre-warming shows deep understanding.

Key Metrics

Upload + transcode latency
Depends on video length and resolution targets
2–10 min β€”
Stream startup time
Manifest fetch + first segment from CDN edge
< 2s \(O(1)\)
Daily raw upload storage
5M uploads Γ— 200 MB average
~1 PB/day β€”
Peak CDN egress bandwidth
1M concurrent streams Γ— 4 Mbps avg
~4 Tbps β€”
CDN cache hit rate (popular)
Top 10% content pre-warmed on edge
> 95% β€”
Adaptive bitrate switch
Quality changes at next segment boundary
< 1 segment \(O(1)\)

Quick check

Why are videos split into small segments (2–10 seconds) instead of streamed as a single file?

Continue reading