Data & Infrastructure13 min read

Design an Object Storage

Store unlimited objects with 11 nines of durability — like AWS S3

scope:Real-World Systemdifficulty:Advanced

Understanding the Problem

Object storage systems like AWS S3 let you store virtually unlimited objects — files, images, backups, logs — with extreme durability. Unlike file systems (hierarchical) or block storage (fixed-size blocks), object storage uses a flat namespace: every object lives in a bucket and is identified by a unique key.

The defining goal: 11 nines of durability (99.999999999%). That means if you store 10 million objects, you'd statistically lose one object every 10,000 years.

Functional Requirements:

PUT /bucket/key — Upload an object (1 KB to 5 TB).
GET /bucket/key — Retrieve an object by key.
DELETE /bucket/key — Remove an object.
Bucket creation, listing, and management.
Object metadata (content-type, custom headers, timestamps).
Versioning — keep previous versions of overwritten objects.
Access control lists (ACLs) — per-bucket and per-object permissions.

Non-Functional Requirements:

Durability: 99.999999999% (11 nines). Data must never be lost.
Availability: 99.99% — the service should almost always be reachable.
Unlimited scale: Handle petabytes of data and billions of objects.
Strong read-after-write consistency: A GET immediately after a PUT must return the latest version.

▸ The idea: buckets, objects, and metadata

Estimation

Let's size this system for a large-scale deployment:

Total storage: 100 PB (petabytes) across all customers.
Write throughput: 1 million PUT requests/second at peak.
Read throughput: 10 million GET requests/second at peak (10:1 read-to-write ratio).
Object sizes: Range from 1 KB to 5 TB, with an average of ~1 MB.
Number of objects: 100 PB / 1 MB average = ~100 billion objects.
Metadata per object: ~1 KB (key, bucket, timestamps, ACLs, version info) = ~100 TB of metadata.

Key insight: the metadata service must handle 11 million QPS (PUT + GET), and the data layer must move massive amounts of bytes. These are very different scaling challenges.

API Design

Object storage APIs are beautifully simple — it's essentially a key-value store with HTTP semantics:

Upload Object

Endpoint	`PUT /{bucket}/{key}`
Headers	`Content-Type`, `Content-Length`, `x-amz-meta-*` (custom metadata)
Body	Object data (binary)
Response	`200 OK` with `ETag` (MD5 hash of object)

Download Object

Endpoint	`GET /{bucket}/{key}`
Headers	`Range` (for partial downloads)
Response	`200 OK` with object data

Delete Object

Endpoint	`DELETE /{bucket}/{key}`
Response	`204 No Content`

Head Object (metadata only)

Endpoint	`HEAD /{bucket}/{key}`
Response	`200 OK` with headers only (no body) — content-type, size, ETag, last-modified

For large objects (>5 GB), use multipart upload: initiate upload, upload parts in parallel, complete upload. This enables resumable uploads and parallel transfers.

▸ Data path: how objects are stored

Click chart to zoom

PUT flow: the placement service picks 3 data nodes, writes in parallel, waits for quorum (2/3), then records metadata

Durability via Erasure Coding

The naive approach to durability is 3x replication: store three full copies of every object on different nodes. Simple and effective, but it has a huge cost — 3x storage overhead. At 100 PB, that's 300 PB of raw storage.

Erasure coding (specifically Reed-Solomon coding) is far more efficient. Here's how it works:

Split an object into k data chunks (e.g., k=4).
Generate m parity chunks using mathematical encoding (e.g., m=2).
Distribute all k+m chunks across different nodes/racks.
To reconstruct the original object, you need any k out of k+m chunks.

With a 4+2 scheme:

You can tolerate any 2 node failures without data loss.
Storage overhead: (4+2)/4 = 1.5x (vs 3x for replication).
At 100 PB, you need 150 PB raw storage instead of 300 PB — saving 150 PB!

The trade-off: erasure coding requires more CPU for encoding/decoding and has higher read latency (you must fetch k chunks and reconstruct). That's why many systems use replication for hot/small objects and erasure coding for cold/large objects.

▸ Erasure coding vs replication

Metadata Service

The metadata service is the brain of the system. It maps bucket/key to the physical location of chunks on data nodes.

What metadata stores:

Bucket name + object key → list of chunk locations (node ID, disk ID, offset)
Object size, content type, ETag, creation time
Version history (for versioned buckets)
ACL and ownership information

Data placement uses consistent hashing to map objects to data nodes. When a node joins or leaves, only ~1/N of keys need to be remapped. A placement service maintains the ring and handles rebalancing.

Garbage collection: When an object is deleted, we don't immediately remove the data chunks. Instead, we mark the metadata as deleted (tombstone) and a background GC process reclaims storage later. This avoids expensive synchronous deletes on the write path and handles edge cases like in-flight reads.

Consistency: For strong read-after-write consistency, the metadata write (after a PUT) must be committed before returning success to the client. The metadata DB can use Raft or Paxos for consensus.

▸ Full architecture

Note: Interview tip: Erasure coding is the key differentiator that separates a basic 'replicate everything 3x' answer from a strong one. Mention the storage efficiency (1.5x vs 3x), the trade-off with CPU and read latency, and how real systems use both replication (for hot data) and erasure coding (for cold data).