Data & Infrastructure13 min read

Design an Object Storage

Store unlimited objects with 11 nines of durability β€” like AWS S3
scope:Real-World Systemdifficulty:Advanced

Understanding the Problem

Object storage systems like AWS S3 let you store virtually unlimited objects β€” files, images, backups, logs β€” with extreme durability. Unlike file systems (hierarchical) or block storage (fixed-size blocks), object storage uses a flat namespace: every object lives in a bucket and is identified by a unique key.

The defining goal: 11 nines of durability (99.999999999%). That means if you store 10 million objects, you'd statistically lose one object every 10,000 years.

Functional Requirements:

  • PUT /bucket/key β€” Upload an object (1 KB to 5 TB).
  • GET /bucket/key β€” Retrieve an object by key.
  • DELETE /bucket/key β€” Remove an object.
  • Bucket creation, listing, and management.
  • Object metadata (content-type, custom headers, timestamps).
  • Versioning β€” keep previous versions of overwritten objects.
  • Access control lists (ACLs) β€” per-bucket and per-object permissions.

Non-Functional Requirements:

  • Durability: 99.999999999% (11 nines). Data must never be lost.
  • Availability: 99.99% β€” the service should almost always be reachable.
  • Unlimited scale: Handle petabytes of data and billions of objects.
  • Strong read-after-write consistency: A GET immediately after a PUT must return the latest version.
β–Έ The idea: buckets, objects, and metadata

Estimation

Let's size this system for a large-scale deployment:

  • Total storage: 100 PB (petabytes) across all customers.
  • Write throughput: 1 million PUT requests/second at peak.
  • Read throughput: 10 million GET requests/second at peak (10:1 read-to-write ratio).
  • Object sizes: Range from 1 KB to 5 TB, with an average of ~1 MB.
  • Number of objects: 100 PB / 1 MB average = ~100 billion objects.
  • Metadata per object: ~1 KB (key, bucket, timestamps, ACLs, version info) = ~100 TB of metadata.

Key insight: the metadata service must handle 11 million QPS (PUT + GET), and the data layer must move massive amounts of bytes. These are very different scaling challenges.

API Design

Object storage APIs are beautifully simple β€” it's essentially a key-value store with HTTP semantics:

Upload Object

EndpointPUT /{bucket}/{key}
HeadersContent-Type, Content-Length, x-amz-meta-* (custom metadata)
BodyObject data (binary)
Response200 OK with ETag (MD5 hash of object)

Download Object

EndpointGET /{bucket}/{key}
HeadersRange (for partial downloads)
Response200 OK with object data

Delete Object

EndpointDELETE /{bucket}/{key}
Response204 No Content

Head Object (metadata only)

EndpointHEAD /{bucket}/{key}
Response200 OK with headers only (no body) β€” content-type, size, ETag, last-modified

For large objects (>5 GB), use multipart upload: initiate upload, upload parts in parallel, complete upload. This enables resumable uploads and parallel transfers.

β–Έ Data path: how objects are stored
Click chart to zoom
PUT flow: the placement service picks 3 data nodes, writes in parallel, waits for quorum (2/3), then records metadata

Durability via Erasure Coding

The naive approach to durability is 3x replication: store three full copies of every object on different nodes. Simple and effective, but it has a huge cost β€” 3x storage overhead. At 100 PB, that's 300 PB of raw storage.

Erasure coding (specifically Reed-Solomon coding) is far more efficient. Here's how it works:

  1. Split an object into k data chunks (e.g., k=4).
  2. Generate m parity chunks using mathematical encoding (e.g., m=2).
  3. Distribute all k+m chunks across different nodes/racks.
  4. To reconstruct the original object, you need any k out of k+m chunks.

With a 4+2 scheme:

  • You can tolerate any 2 node failures without data loss.
  • Storage overhead: (4+2)/4 = 1.5x (vs 3x for replication).
  • At 100 PB, you need 150 PB raw storage instead of 300 PB β€” saving 150 PB!

The trade-off: erasure coding requires more CPU for encoding/decoding and has higher read latency (you must fetch k chunks and reconstruct). That's why many systems use replication for hot/small objects and erasure coding for cold/large objects.

β–Έ Erasure coding vs replication

Metadata Service

The metadata service is the brain of the system. It maps bucket/key to the physical location of chunks on data nodes.

What metadata stores:

  • Bucket name + object key β†’ list of chunk locations (node ID, disk ID, offset)
  • Object size, content type, ETag, creation time
  • Version history (for versioned buckets)
  • ACL and ownership information

Data placement uses consistent hashing to map objects to data nodes. When a node joins or leaves, only ~1/N of keys need to be remapped. A placement service maintains the ring and handles rebalancing.

Garbage collection: When an object is deleted, we don't immediately remove the data chunks. Instead, we mark the metadata as deleted (tombstone) and a background GC process reclaims storage later. This avoids expensive synchronous deletes on the write path and handles edge cases like in-flight reads.

Consistency: For strong read-after-write consistency, the metadata write (after a PUT) must be committed before returning success to the client. The metadata DB can use Raft or Paxos for consensus.

β–Έ Full architecture
Note: Interview tip: Erasure coding is the key differentiator that separates a basic 'replicate everything 3x' answer from a strong one. Mention the storage efficiency (1.5x vs 3x), the trade-off with CPU and read latency, and how real systems use both replication (for hot data) and erasure coding (for cold data).

Key Metrics

PUT (small object)
Placement + parallel writes + metadata update
~10-50 ms \(O(1)\)
GET (cache hit)
Metadata lookup + single node read
~5-20 ms \(O(1)\)
GET (erasure-coded)
Fetch k chunks + reconstruct
~50-200 ms \(O(k)\)
Durability (11 nines)
Erasure coding + cross-rack placement
99.999999999% β€”
Storage efficiency (EC 4+2)
vs 3x for triple replication
1.5x overhead β€”
Total capacity
Horizontally scalable data nodes
100+ PB β€”

Quick check

Why does erasure coding (e.g., 4+2) achieve better storage efficiency than 3x replication while maintaining similar durability?

Continue reading