Business Systems13 min read

Design a Payment System

Move money safely, exactly once, every time
scope:Real-World Systemdifficulty:Advanced

Understanding the Problem

Every time you tap "Pay" on an app, a complex chain of events fires behind the scenes. A payment system (like Stripe) must move money between accounts reliably, securely, and exactly once β€” even when networks fail, services crash, or users double-click the pay button.

Functional Requirements:

  • Process payments: Accept a payment request with amount, currency, payment method, and merchant details. Charge the customer and credit the merchant.
  • Refunds: Reverse a completed payment β€” full or partial β€” and update both ledger entries.
  • Track payment status: Every payment moves through states: created β†’ processing β†’ succeeded/failed. Clients can poll or receive webhooks for status updates.
  • Webhooks: Notify merchants of payment events (payment succeeded, refund issued, dispute opened) via HTTP callbacks with retry logic.

Non-Functional Requirements:

  • Exactly-once processing: The most critical requirement. If a network timeout causes the client to retry, the system must not charge the customer twice. This is achieved through idempotency keys.
  • High availability: 99.999% uptime β€” even minutes of downtime means millions in lost transactions.
  • Data consistency: Money must never be created or destroyed. Every debit must have a matching credit (double-entry bookkeeping).
  • PCI-DSS compliance: Card numbers must be tokenized, encrypted at rest, and never logged in plaintext.
  • Audit trail: Every action must be recorded immutably for regulatory compliance and dispute resolution.
β–Έ The idea: pay β†’ validate β†’ process β†’ confirm

Back-of-the-Envelope Estimation

Let's size the system for a mid-to-large payment processor:

  • Daily transactions: 1 million payments/day
  • Average transaction value: $50 β†’ $50M daily volume
  • Peak throughput: ~100 transactions per second (TPS) during peak hours (2-3x average)
  • Storage per transaction: ~1 KB (payment record + ledger entries + audit log) β†’ ~1 GB/day, ~365 GB/year
  • Uptime requirement: 99.999% ("five nines") β€” only ~5 minutes of downtime per year
  • Webhook delivery: ~3M webhook events/day (multiple events per payment: created, processing, succeeded)

The throughput is modest compared to social media systems, but the correctness requirement is extreme. A social media post appearing twice is annoying; a payment being charged twice is a legal and financial liability.

API Design

Clean, idempotent APIs are the foundation of a reliable payment system.

Create a Payment

POST /api/v1/payments
Headers:
  Idempotency-Key: 550e8400-e29b-41d4-a716-446655440000
  Authorization: Bearer sk_live_...

Body:
{
  "amount": 4999,           // in cents β€” always use integers to avoid floating-point issues
  "currency": "usd",
  "payment_method": "pm_card_visa_4242",
  "merchant_id": "merch_abc123",
  "description": "Order #7892",
  "metadata": { "order_id": "7892" }
}

Response: 201 Created
{
  "id": "pay_1234567890",
  "status": "processing",
  "amount": 4999,
  "currency": "usd",
  "created_at": "2026-03-10T14:30:00Z"
}

Get Payment Status

GET /api/v1/payments/pay_1234567890

Response: 200 OK
{
  "id": "pay_1234567890",
  "status": "succeeded",
  "amount": 4999,
  "currency": "usd"
}

Refund a Payment

POST /api/v1/payments/pay_1234567890/refund
Headers:
  Idempotency-Key: 660e9500-f30c-52e5-b827-557766551111

Body:
{
  "amount": 4999,   // full refund; omit for partial
  "reason": "customer_request"
}

Response: 201 Created
{
  "id": "ref_0987654321",
  "payment_id": "pay_1234567890",
  "status": "processing",
  "amount": 4999
}

Key design choices: amounts in cents (integers avoid floating-point bugs), idempotency keys on every mutating endpoint, and status polling + webhooks for async updates.

β–Έ Payment flow: from checkout to confirmation
Click chart to zoom
Payment flow: the client sends a payment with an idempotency key, which flows through fraud detection, external processor authorization, ledger recording, and merchant notification
β–Έ Idempotency: handling retries safely

Idempotency and Double-Entry Bookkeeping

Idempotency is the single most important concept in payment system design. Here's how it works:

  1. The client generates a UUID (the idempotency key) before making the request.
  2. When the server receives the request, it checks if that key already exists in the idempotency store (a Redis cache or database table).
  3. If the key exists, return the cached result β€” don't process the payment again.
  4. If the key is new, process the payment normally and store the result keyed by that UUID.

This means even if the client retries 10 times (due to network timeouts), the payment is only processed once.

Double-Entry Bookkeeping: Every transaction creates exactly two ledger entries β€” a debit and a credit β€” that sum to zero. For a $49.99 payment:

  • Debit: Customer account βˆ’$49.99
  • Credit: Merchant account +$49.99

For a refund, the entries reverse:

  • Debit: Merchant account βˆ’$49.99
  • Credit: Customer account +$49.99

The ledger is append-only β€” entries are never updated or deleted, only new entries are added. This creates an immutable audit trail.

Reconciliation: A background worker periodically compares internal ledger totals against the external payment processor's records. Any discrepancies are flagged for investigation. This catches bugs, fraud, and processor errors.

β–Έ Full architecture

Ledger Design and Failure Handling

Append-Only Ledger: The ledger is the source of truth for all money movement. It uses an event-sourcing pattern β€” instead of storing current balances (which can drift), you store every individual transaction as an immutable event. The current balance is derived by replaying events.

Ledger entry schema:

{
  "entry_id": "led_abc123",
  "payment_id": "pay_1234567890",
  "account_id": "acct_merchant_xyz",
  "type": "credit",
  "amount": 4999,
  "currency": "usd",
  "created_at": "2026-03-10T14:30:00Z",
  "description": "Payment for Order #7892"
}

Handling External Processor Failures:

  • Timeout: If the external processor (Visa, Mastercard) doesn't respond, don't immediately fail. Retry with exponential backoff (1s, 2s, 4s, 8s...) up to a maximum of 5 attempts.
  • Processor down: If the primary processor is unavailable, route to a backup processor. Major payment systems maintain relationships with multiple processors.
  • Partial failures: If the charge succeeds at the processor but the ledger write fails, use a saga pattern β€” roll back the charge with a void/refund at the processor level.
  • Stuck payments: A reconciliation worker detects payments stuck in "processing" state for too long and either completes or reverses them.
Note: Interview tip: When discussing payment systems, always emphasize idempotency and exactly-once semantics. Interviewers want to hear that you understand why charging a customer twice is catastrophic, and how idempotency keys, double-entry bookkeeping, and reconciliation workers prevent it. These are the make-or-break concepts for this design.

Key Metrics

Payment API β†’ Processing
End-to-end payment latency
~200-500 ms \(O(1)\)
Fraud scoring
ML model inference
~50-100 ms \(O(1)\)
External processor (Visa/MC)
Network round-trip to card network
~100-300 ms \(O(1)\)
Ledger write
Append-only insert
~5-20 ms \(O(1)\)
Success rate target
Excluding user errors
>99.5% β€”
Reconciliation frequency
n = transactions since last run
Every 15 min \(O(n)\)
Uptime SLA
~5 min downtime/year
99.999% β€”

Quick check

Why do payment systems use idempotency keys instead of simply deduplicating by amount and merchant?

Continue reading