Data & Infrastructure14 min read

Design a Key-Value Store

Build a distributed hash map like Redis or DynamoDB — at planet scale

scope:Real-World Systemdifficulty:Advanced

Understanding the Problem

A key-value store is one of the simplest yet most powerful abstractions in distributed systems. Think Redis, DynamoDB, or etcd — a giant hash map spread across multiple machines.

The API is deceptively simple:

put(key, value) — Store a value
get(key) — Retrieve a value
delete(key) — Remove a value
Optional: TTL (time-to-live) — auto-expire keys

Functional Requirements:

Store key-value pairs where keys are strings and values are opaque blobs (up to ~1 MB).
Support get, put, and delete operations.
Support TTL-based expiration of keys.

Non-Functional Requirements:

Low latency: Both reads and writes should complete in under 10 ms.
High availability: The store should remain available even when nodes fail.
Tunable consistency: Let users choose between strong and eventual consistency depending on their use case.
Scalability: Handle millions of keys across many machines with horizontal scaling.

▸ The idea: distributed hash map

Single Server vs. Distributed

A single-server key-value store is trivial — just use an in-memory hash map. But a single machine has limits:

Memory: One server can hold maybe 256 GB of RAM. What if you need 100 TB?
Availability: If that server dies, all data is lost.
Throughput: A single machine can handle maybe 100K ops/sec. What about 10M ops/sec?

To go beyond a single machine, we need two key techniques:

Partitioning (Sharding): Split data across multiple nodes so each node stores only a subset of keys. We use consistent hashing to decide which node owns which key.
Replication: Copy each key to multiple nodes (typically N=3) so that if one node dies, the data survives. This introduces the challenge of keeping replicas in sync.

The combination of partitioning + replication is what makes distributed key-value stores both scalable and fault-tolerant.

▸ Data partitioning with consistent hashing

Click chart to zoom

Write path: the coordinator hashes the key, writes to N=3 replicas, and returns success after W=2 acknowledge

Replication and Quorum Consensus

Each key is stored on N nodes (typically N=3). But how many nodes must acknowledge a read or write before we consider it successful?

This is where quorum comes in:

W (write quorum): Number of nodes that must ACK a write before returning success.
R (read quorum): Number of nodes that must respond to a read.
N (replication factor): Total number of replicas.

The magic formula: W + R > N guarantees strong consistency.

Common configurations:

Strong consistency: W=2, R=2, N=3 → always read the latest write
High write availability: W=1, R=3, N=3 → fast writes, slower reads
High read availability: W=3, R=1, N=3 → slow writes, fast reads
Eventual consistency: W=1, R=1, N=3 → fastest, but may read stale data

DynamoDB uses N=3 by default. Cassandra lets you tune W and R per query. This tunability is what makes these systems so versatile.

▸ Replication and quorum consensus

Conflict Resolution and Failure Handling

When multiple replicas accept writes independently, conflicts are inevitable. How do we resolve them?

Vector Clocks: Each node maintains a logical clock. When a value is written, the vector clock is incremented. On read, if two versions have conflicting vector clocks (neither dominates), the client must resolve the conflict (e.g., merge shopping carts). Amazon's Dynamo uses this approach.

Last-Write-Wins (LWW): Simpler — each write gets a timestamp, and the most recent write wins. Cassandra uses this. Easy to implement but can lose data if clocks are skewed.

Failure Detection — Gossip Protocol: Nodes periodically exchange heartbeat messages with random peers. If a node hasn't been heard from in a while, it's marked as suspected-down. This is decentralized — no single point of failure in the detection itself.

Handling Failures:

Hinted Handoff: If a replica is temporarily down, another node holds the write as a "hint" and forwards it when the node recovers.
Read Repair: During a read, if replicas return different versions, the coordinator sends the latest version to stale replicas.
Anti-Entropy (Merkle Trees): Nodes periodically compare Merkle trees of their data to find and fix inconsistencies in the background.

▸ Full architecture with failure handling

Simple In-Memory KV Store with Consistent Hashing

import hashlib
from bisect import bisect_right

class ConsistentHash:
    """Consistent hash ring with virtual nodes."""
    def __init__(self, nodes: list[str], vnodes: int = 150):
        self.ring: dict[int, str] = {}
        self.sorted_keys: list[int] = []
        for node in nodes:
            for i in range(vnodes):
                key = self._hash(f"{node}:{i}")
                self.ring[key] = node
        self.sorted_keys = sorted(self.ring.keys())

    def _hash(self, key: str) -> int:
        return int(hashlib.md5(key.encode()).hexdigest(), 16)

    def get_nodes(self, key: str, n: int = 3) -> list[str]:
        """Get N distinct nodes responsible for a key."""
        if not self.sorted_keys:
            return []
        h = self._hash(key)
        idx = bisect_right(self.sorted_keys, h) % len(self.sorted_keys)
        nodes = []
        seen = set()
        while len(nodes) < n:
            node = self.ring[self.sorted_keys[idx]]
            if node not in seen:
                nodes.append(node)
                seen.add(node)
            idx = (idx + 1) % len(self.sorted_keys)
        return nodes

class KVStore:
    """Distributed KV store with quorum reads/writes."""
    def __init__(self, nodes: list[str], n=3, w=2, r=2):
        self.n, self.w, self.r = n, w, r
        self.ring = ConsistentHash(nodes)
        # Simulate per-node storage
        self.storage: dict[str, dict[str, str]] = {node: {} for node in nodes}

    def put(self, key: str, value: str) -> bool:
        nodes = self.ring.get_nodes(key, self.n)
        acks = 0
        for node in nodes:
            self.storage[node][key] = value  # simulate write
            acks += 1
        return acks >= self.w  # quorum met?

    def get(self, key: str) -> str | None:
        nodes = self.ring.get_nodes(key, self.n)
        responses = []
        for node in nodes:
            val = self.storage[node].get(key)
            if val is not None:
                responses.append(val)
        if len(responses) >= self.r:
            return responses[0]  # In practice: pick latest version
        return None

# Demo
store = KVStore(["node-1", "node-2", "node-3", "node-4"])
store.put("user:1001", '{"name": "Alice"}')
store.put("session:abc", '{"token": "xyz"}')

print(f"user:1001 → {store.get('user:1001')}")
print(f"session:abc → {store.get('session:abc')}")
print(f"Nodes for 'user:1001': {store.ring.get_nodes('user:1001')}")
print(f"Quorum: W={store.w}, R={store.r}, N={store.n}")
print(f"Strong consistency: W+R > N → {store.w}+{store.r} > {store.n} = {store.w+store.r > store.n}")

Output

user:1001 → {"name": "Alice"}
session:abc → {"token": "xyz"}
Nodes for 'user:1001': ['node-2', 'node-4', 'node-1']
Quorum: W=2, R=2, N=3
Strong consistency: W+R > N → 2+2 > 3 = True

Note: CAP Theorem in practice: you can't have all three of Consistency, Availability, and Partition tolerance. During a network partition, you must choose — CP systems (like etcd, ZooKeeper) reject writes to stay consistent; AP systems (like Cassandra, DynamoDB) accept writes and reconcile later. Most real-world KV stores choose AP with tunable consistency, letting the application decide per-request.