Design a Key-Value Store
Understanding the Problem
A key-value store is one of the simplest yet most powerful abstractions in distributed systems. Think Redis, DynamoDB, or etcd β a giant hash map spread across multiple machines.
The API is deceptively simple:
put(key, value)β Store a valueget(key)β Retrieve a valuedelete(key)β Remove a value- Optional: TTL (time-to-live) β auto-expire keys
Functional Requirements:
- Store key-value pairs where keys are strings and values are opaque blobs (up to ~1 MB).
- Support
get,put, anddeleteoperations. - Support TTL-based expiration of keys.
Non-Functional Requirements:
- Low latency: Both reads and writes should complete in under 10 ms.
- High availability: The store should remain available even when nodes fail.
- Tunable consistency: Let users choose between strong and eventual consistency depending on their use case.
- Scalability: Handle millions of keys across many machines with horizontal scaling.
Single Server vs. Distributed
A single-server key-value store is trivial β just use an in-memory hash map. But a single machine has limits:
- Memory: One server can hold maybe 256 GB of RAM. What if you need 100 TB?
- Availability: If that server dies, all data is lost.
- Throughput: A single machine can handle maybe 100K ops/sec. What about 10M ops/sec?
To go beyond a single machine, we need two key techniques:
- Partitioning (Sharding): Split data across multiple nodes so each node stores only a subset of keys. We use consistent hashing to decide which node owns which key.
- Replication: Copy each key to multiple nodes (typically N=3) so that if one node dies, the data survives. This introduces the challenge of keeping replicas in sync.
The combination of partitioning + replication is what makes distributed key-value stores both scalable and fault-tolerant.
Replication and Quorum Consensus
Each key is stored on N nodes (typically N=3). But how many nodes must acknowledge a read or write before we consider it successful?
This is where quorum comes in:
- W (write quorum): Number of nodes that must ACK a write before returning success.
- R (read quorum): Number of nodes that must respond to a read.
- N (replication factor): Total number of replicas.
The magic formula: W + R > N guarantees strong consistency.
Common configurations:
- Strong consistency: W=2, R=2, N=3 β always read the latest write
- High write availability: W=1, R=3, N=3 β fast writes, slower reads
- High read availability: W=3, R=1, N=3 β slow writes, fast reads
- Eventual consistency: W=1, R=1, N=3 β fastest, but may read stale data
DynamoDB uses N=3 by default. Cassandra lets you tune W and R per query. This tunability is what makes these systems so versatile.
Conflict Resolution and Failure Handling
When multiple replicas accept writes independently, conflicts are inevitable. How do we resolve them?
Vector Clocks: Each node maintains a logical clock. When a value is written, the vector clock is incremented. On read, if two versions have conflicting vector clocks (neither dominates), the client must resolve the conflict (e.g., merge shopping carts). Amazon's Dynamo uses this approach.
Last-Write-Wins (LWW): Simpler β each write gets a timestamp, and the most recent write wins. Cassandra uses this. Easy to implement but can lose data if clocks are skewed.
Failure Detection β Gossip Protocol: Nodes periodically exchange heartbeat messages with random peers. If a node hasn't been heard from in a while, it's marked as suspected-down. This is decentralized β no single point of failure in the detection itself.
Handling Failures:
- Hinted Handoff: If a replica is temporarily down, another node holds the write as a "hint" and forwards it when the node recovers.
- Read Repair: During a read, if replicas return different versions, the coordinator sends the latest version to stale replicas.
- Anti-Entropy (Merkle Trees): Nodes periodically compare Merkle trees of their data to find and fix inconsistencies in the background.