Web Services11 min read

Design a Paste Service

Store and share text snippets β€” like Pastebin, at scale
scope:Real-World Systemdifficulty:Intermediate

Understanding the Problem

Paste services like Pastebin, GitHub Gist, and Hastebin let users store text snippets and share them via a short URL. Let's design one from scratch.

First, let's clarify the requirements:

Functional Requirements:

  • Users can create a paste by submitting text content and receive a unique URL.
  • Users can retrieve paste content by visiting the unique URL.
  • Pastes can optionally have an expiration time (e.g., 10 minutes, 1 hour, never).
  • Optional: syntax highlighting for popular programming languages.
  • Optional: custom short URLs (aliases).

Non-Functional Requirements:

  • Read-heavy: More reads than writes. Estimated ratio ~5:1.
  • Low latency: Paste retrieval should be fast (< 50ms for cached content).
  • High availability: The service should be highly available β€” shared links must always work.
  • Durability: Once created, paste content must not be lost until it expires.
  • Size limits: Individual pastes capped at ~10 MB to prevent abuse.
β–Έ The idea: text β†’ unique URL

Estimation

Let's size this system:

  • 5M new pastes per month (writes): ~2 pastes/second
  • Read:Write ratio 5:1: ~10 reads/second, peak ~30/s
  • Average paste size: ~10 KB (most pastes are small code snippets or config files)
  • Storage per month: 5M Γ— 10 KB = 50 GB/month
  • Storage for 5 years: 50 GB Γ— 60 months = 3 TB
  • Metadata per paste: ~500 bytes (key, title, language, timestamps, user ID)
  • Metadata storage (5 years): 300M pastes Γ— 500B = 150 GB
  • Cache: Cache the top 20% hot pastes. 20% Γ— 3 TB = ~600 GB (or cache metadata only: ~30 GB)

The paste content is the big storage cost. Storing it in an object store (like S3) is far cheaper than a relational database. Metadata stays in a traditional database.

API Design

Simple REST API:

Create Paste

EndpointPOST /api/v1/pastes
Request{"content": "print('hello')", "title": "My Snippet", "language": "python", "expires_in": 3600}
Response{"key": "abc123", "url": "https://paste.ly/abc123", "created_at": "...", "expires_at": "..."}
Status201 Created

Get Paste

EndpointGET /api/v1/pastes/:key
Response{"key": "abc123", "content": "print('hello')", "title": "My Snippet", "language": "python", "created_at": "...", "expires_at": "...", "views": 42}
Status200 OK or 404 Not Found (if expired or deleted)

Content vs Metadata separation: The API returns both metadata and content together, but internally they are stored separately. Metadata (title, language, timestamps) lives in a database for fast queries. The actual paste content lives in an object store (S3) for cost-effective, durable storage.

β–Έ Write flow: creating a paste
Click chart to zoom
Write path: the API grabs a pre-generated key, stores content in S3, metadata in DB, and caches it
β–Έ Read flow: retrieving a paste
Click chart to zoom
Read path: metadata from cache/DB, content from object store. Hot pastes benefit from CDN caching too.

Storage Design

The key insight: separate metadata from content.

Metadata (SQL or NoSQL database):

  • paste_key (VARCHAR 8, PRIMARY KEY) β€” Unique short key
  • title (VARCHAR 255) β€” Optional title
  • language (VARCHAR 32) β€” Syntax highlighting language
  • s3_path (VARCHAR 255) β€” Path to content in object store
  • content_size (INT) β€” Size in bytes
  • user_id (BIGINT) β€” Creator (nullable for anonymous)
  • created_at (TIMESTAMP)
  • expires_at (TIMESTAMP, nullable) β€” Null means never expires
  • view_count (BIGINT) β€” Number of reads

Content (Object Store β€” S3, GCS, or MinIO):

  • Stored as plain text files keyed by paste_key
  • Object stores are optimized for large blob storage β€” much cheaper per GB than databases
  • Built-in durability (11 nines for S3), replication, and CDN integration

Why not just store content in the database? At 10 KB average per paste, 300M pastes would mean 3 TB of BLOB data in your database. This bloats the DB, makes backups slow, and is 10-50x more expensive than object storage. The database should only handle the small, structured metadata.

β–Έ Full architecture

Expiration & Cleanup

Expiration: When a paste is created with expires_in, compute expires_at and store it in metadata. On reads, check expires_at β€” if past, return 404. This is a lazy deletion approach.

Active cleanup: Run an async cleanup worker on a schedule (e.g., every hour). It queries for expired pastes, deletes the S3 content, and removes the metadata row. The paste key can optionally be returned to the KGS pool for reuse.

Rate Limiting: Prevent abuse with per-IP and per-user rate limits. Use a token bucket or sliding window. Typical limits: 10 pastes/minute for anonymous, 60/minute for authenticated users. Also enforce the 10 MB content size limit at the API gateway level.

Note: Interview tip: The key differentiator of a paste service vs a URL shortener is the storage strategy. URL shorteners store tiny data (a URL) in a database. Paste services store large blobs (up to 10 MB) β€” which is why you need an object store. Always mention this separation of metadata and content to show you understand cost-effective architecture.

Key Metrics

Create paste
KGS key + S3 upload + DB metadata write
~50 ms \(O(1)\)
Read paste (cache hit)
Redis metadata + S3 content fetch
~5-10 ms \(O(1)\)
Read paste (cache miss)
DB metadata + S3 fetch + cache backfill
~30-50 ms \(O(1)\)
8-char key space
62^8 unique paste keys
218 trillion β€”
Storage (5 years)
5M pastes/month Γ— 10 KB avg
~3 TB β€”
Metadata storage (5 years)
300M pastes Γ— 500B metadata
~150 GB β€”

Quick check

Why should paste content be stored in an object store (S3) rather than directly in the database?

Continue reading