Design an Object Storage
Understanding the Problem
Object storage systems like AWS S3 let you store virtually unlimited objects β files, images, backups, logs β with extreme durability. Unlike file systems (hierarchical) or block storage (fixed-size blocks), object storage uses a flat namespace: every object lives in a bucket and is identified by a unique key.
The defining goal: 11 nines of durability (99.999999999%). That means if you store 10 million objects, you'd statistically lose one object every 10,000 years.
Functional Requirements:
PUT /bucket/keyβ Upload an object (1 KB to 5 TB).GET /bucket/keyβ Retrieve an object by key.DELETE /bucket/keyβ Remove an object.- Bucket creation, listing, and management.
- Object metadata (content-type, custom headers, timestamps).
- Versioning β keep previous versions of overwritten objects.
- Access control lists (ACLs) β per-bucket and per-object permissions.
Non-Functional Requirements:
- Durability: 99.999999999% (11 nines). Data must never be lost.
- Availability: 99.99% β the service should almost always be reachable.
- Unlimited scale: Handle petabytes of data and billions of objects.
- Strong read-after-write consistency: A GET immediately after a PUT must return the latest version.
Estimation
Let's size this system for a large-scale deployment:
- Total storage: 100 PB (petabytes) across all customers.
- Write throughput: 1 million PUT requests/second at peak.
- Read throughput: 10 million GET requests/second at peak (10:1 read-to-write ratio).
- Object sizes: Range from 1 KB to 5 TB, with an average of ~1 MB.
- Number of objects: 100 PB / 1 MB average = ~100 billion objects.
- Metadata per object: ~1 KB (key, bucket, timestamps, ACLs, version info) = ~100 TB of metadata.
Key insight: the metadata service must handle 11 million QPS (PUT + GET), and the data layer must move massive amounts of bytes. These are very different scaling challenges.
API Design
Object storage APIs are beautifully simple β it's essentially a key-value store with HTTP semantics:
Upload Object
| Endpoint | PUT /{bucket}/{key} |
| Headers | Content-Type, Content-Length, x-amz-meta-* (custom metadata) |
| Body | Object data (binary) |
| Response | 200 OK with ETag (MD5 hash of object) |
Download Object
| Endpoint | GET /{bucket}/{key} |
| Headers | Range (for partial downloads) |
| Response | 200 OK with object data |
Delete Object
| Endpoint | DELETE /{bucket}/{key} |
| Response | 204 No Content |
Head Object (metadata only)
| Endpoint | HEAD /{bucket}/{key} |
| Response | 200 OK with headers only (no body) β content-type, size, ETag, last-modified |
For large objects (>5 GB), use multipart upload: initiate upload, upload parts in parallel, complete upload. This enables resumable uploads and parallel transfers.
Durability via Erasure Coding
The naive approach to durability is 3x replication: store three full copies of every object on different nodes. Simple and effective, but it has a huge cost β 3x storage overhead. At 100 PB, that's 300 PB of raw storage.
Erasure coding (specifically Reed-Solomon coding) is far more efficient. Here's how it works:
- Split an object into k data chunks (e.g., k=4).
- Generate m parity chunks using mathematical encoding (e.g., m=2).
- Distribute all k+m chunks across different nodes/racks.
- To reconstruct the original object, you need any k out of k+m chunks.
With a 4+2 scheme:
- You can tolerate any 2 node failures without data loss.
- Storage overhead: (4+2)/4 = 1.5x (vs 3x for replication).
- At 100 PB, you need 150 PB raw storage instead of 300 PB β saving 150 PB!
The trade-off: erasure coding requires more CPU for encoding/decoding and has higher read latency (you must fetch k chunks and reconstruct). That's why many systems use replication for hot/small objects and erasure coding for cold/large objects.
Metadata Service
The metadata service is the brain of the system. It maps bucket/key to the physical location of chunks on data nodes.
What metadata stores:
- Bucket name + object key β list of chunk locations (node ID, disk ID, offset)
- Object size, content type, ETag, creation time
- Version history (for versioned buckets)
- ACL and ownership information
Data placement uses consistent hashing to map objects to data nodes. When a node joins or leaves, only ~1/N of keys need to be remapped. A placement service maintains the ring and handles rebalancing.
Garbage collection: When an object is deleted, we don't immediately remove the data chunks. Instead, we mark the metadata as deleted (tombstone) and a background GC process reclaims storage later. This avoids expensive synchronous deletes on the write path and handles edge cases like in-flight reads.
Consistency: For strong read-after-write consistency, the metadata write (after a PUT) must be committed before returning success to the client. The metadata DB can use Raft or Paxos for consensus.