Data & Infrastructure14 min read

Design a Key-Value Store

Build a distributed hash map like Redis or DynamoDB β€” at planet scale
scope:Real-World Systemdifficulty:Advanced

Understanding the Problem

A key-value store is one of the simplest yet most powerful abstractions in distributed systems. Think Redis, DynamoDB, or etcd β€” a giant hash map spread across multiple machines.

The API is deceptively simple:

  • put(key, value) β€” Store a value
  • get(key) β€” Retrieve a value
  • delete(key) β€” Remove a value
  • Optional: TTL (time-to-live) β€” auto-expire keys

Functional Requirements:

  • Store key-value pairs where keys are strings and values are opaque blobs (up to ~1 MB).
  • Support get, put, and delete operations.
  • Support TTL-based expiration of keys.

Non-Functional Requirements:

  • Low latency: Both reads and writes should complete in under 10 ms.
  • High availability: The store should remain available even when nodes fail.
  • Tunable consistency: Let users choose between strong and eventual consistency depending on their use case.
  • Scalability: Handle millions of keys across many machines with horizontal scaling.
β–Έ The idea: distributed hash map

Single Server vs. Distributed

A single-server key-value store is trivial β€” just use an in-memory hash map. But a single machine has limits:

  • Memory: One server can hold maybe 256 GB of RAM. What if you need 100 TB?
  • Availability: If that server dies, all data is lost.
  • Throughput: A single machine can handle maybe 100K ops/sec. What about 10M ops/sec?

To go beyond a single machine, we need two key techniques:

  • Partitioning (Sharding): Split data across multiple nodes so each node stores only a subset of keys. We use consistent hashing to decide which node owns which key.
  • Replication: Copy each key to multiple nodes (typically N=3) so that if one node dies, the data survives. This introduces the challenge of keeping replicas in sync.

The combination of partitioning + replication is what makes distributed key-value stores both scalable and fault-tolerant.

β–Έ Data partitioning with consistent hashing
Click chart to zoom
Write path: the coordinator hashes the key, writes to N=3 replicas, and returns success after W=2 acknowledge

Replication and Quorum Consensus

Each key is stored on N nodes (typically N=3). But how many nodes must acknowledge a read or write before we consider it successful?

This is where quorum comes in:

  • W (write quorum): Number of nodes that must ACK a write before returning success.
  • R (read quorum): Number of nodes that must respond to a read.
  • N (replication factor): Total number of replicas.

The magic formula: W + R > N guarantees strong consistency.

Common configurations:

  • Strong consistency: W=2, R=2, N=3 β†’ always read the latest write
  • High write availability: W=1, R=3, N=3 β†’ fast writes, slower reads
  • High read availability: W=3, R=1, N=3 β†’ slow writes, fast reads
  • Eventual consistency: W=1, R=1, N=3 β†’ fastest, but may read stale data

DynamoDB uses N=3 by default. Cassandra lets you tune W and R per query. This tunability is what makes these systems so versatile.

β–Έ Replication and quorum consensus

Conflict Resolution and Failure Handling

When multiple replicas accept writes independently, conflicts are inevitable. How do we resolve them?

Vector Clocks: Each node maintains a logical clock. When a value is written, the vector clock is incremented. On read, if two versions have conflicting vector clocks (neither dominates), the client must resolve the conflict (e.g., merge shopping carts). Amazon's Dynamo uses this approach.

Last-Write-Wins (LWW): Simpler β€” each write gets a timestamp, and the most recent write wins. Cassandra uses this. Easy to implement but can lose data if clocks are skewed.

Failure Detection β€” Gossip Protocol: Nodes periodically exchange heartbeat messages with random peers. If a node hasn't been heard from in a while, it's marked as suspected-down. This is decentralized β€” no single point of failure in the detection itself.

Handling Failures:

  • Hinted Handoff: If a replica is temporarily down, another node holds the write as a "hint" and forwards it when the node recovers.
  • Read Repair: During a read, if replicas return different versions, the coordinator sends the latest version to stale replicas.
  • Anti-Entropy (Merkle Trees): Nodes periodically compare Merkle trees of their data to find and fix inconsistencies in the background.
β–Έ Full architecture with failure handling

Simple In-Memory KV Store with Consistent Hashing

import hashlib
from bisect import bisect_right
class ConsistentHash:
"""Consistent hash ring with virtual nodes."""
def __init__(self, nodes: list[str], vnodes: int = 150):
self.ring: dict[int, str] = {}
self.sorted_keys: list[int] = []
for node in nodes:
for i in range(vnodes):
key = self._hash(f"{node}:{i}")
self.ring[key] = node
self.sorted_keys = sorted(self.ring.keys())
def _hash(self, key: str) -> int:
return int(hashlib.md5(key.encode()).hexdigest(), 16)
def get_nodes(self, key: str, n: int = 3) -> list[str]:
"""Get N distinct nodes responsible for a key."""
if not self.sorted_keys:
return []
h = self._hash(key)
idx = bisect_right(self.sorted_keys, h) % len(self.sorted_keys)
nodes = []
seen = set()
while len(nodes) < n:
node = self.ring[self.sorted_keys[idx]]
if node not in seen:
nodes.append(node)
seen.add(node)
idx = (idx + 1) % len(self.sorted_keys)
return nodes
class KVStore:
"""Distributed KV store with quorum reads/writes."""
def __init__(self, nodes: list[str], n=3, w=2, r=2):
self.n, self.w, self.r = n, w, r
self.ring = ConsistentHash(nodes)
# Simulate per-node storage
self.storage: dict[str, dict[str, str]] = {node: {} for node in nodes}
def put(self, key: str, value: str) -> bool:
nodes = self.ring.get_nodes(key, self.n)
acks = 0
for node in nodes:
self.storage[node][key] = value # simulate write
acks += 1
return acks >= self.w # quorum met?
def get(self, key: str) -> str | None:
nodes = self.ring.get_nodes(key, self.n)
responses = []
for node in nodes:
val = self.storage[node].get(key)
if val is not None:
responses.append(val)
if len(responses) >= self.r:
return responses[0] # In practice: pick latest version
return None
# Demo
store = KVStore(["node-1", "node-2", "node-3", "node-4"])
store.put("user:1001", '{"name": "Alice"}')
store.put("session:abc", '{"token": "xyz"}')
print(f"user:1001 β†’ {store.get('user:1001')}")
print(f"session:abc β†’ {store.get('session:abc')}")
print(f"Nodes for 'user:1001': {store.ring.get_nodes('user:1001')}")
print(f"Quorum: W={store.w}, R={store.r}, N={store.n}")
print(f"Strong consistency: W+R > N β†’ {store.w}+{store.r} > {store.n} = {store.w+store.r > store.n}")
Output
user:1001 β†’ {"name": "Alice"}
session:abc β†’ {"token": "xyz"}
Nodes for 'user:1001': ['node-2', 'node-4', 'node-1']
Quorum: W=2, R=2, N=3
Strong consistency: W+R > N β†’ 2+2 > 3 = True
Note: CAP Theorem in practice: you can't have all three of Consistency, Availability, and Partition tolerance. During a network partition, you must choose β€” CP systems (like etcd, ZooKeeper) reject writes to stay consistent; AP systems (like Cassandra, DynamoDB) accept writes and reconcile later. Most real-world KV stores choose AP with tunable consistency, letting the application decide per-request.

Key Metrics

get(key)
Hash lookup + quorum read from R nodes
~1-10 ms \(O(1)\)
put(key, value)
Hash + write to W nodes in parallel
~2-10 ms \(O(1)\)
Replication latency
Async replication to remaining N-W nodes
~10-100 ms β€”
Storage per node
Data evenly distributed via consistent hashing
Total / N β€”
Strong consistency
e.g., W=2, R=2, N=3
W+R > N β€”
Eventual consistency
Fastest but may read stale data
W=1, R=1 β€”

Quick check

In a quorum-based system with N=3, W=2, R=2, what happens if one replica node goes down during a write?

Continue reading