Day 17 Hard Payments Idempotency Ledger Reconciliation

Payment Systems — Engineering Where Money Can't Be Lost or DuplicatedIdempotency, Double-Entry Ledger, Saga, Reconciliation

Problem Scenario & Constraints

Design a marketplace payment platform (think Airbnb / Uber): charge the guest, pay out the host. 10M transactions/day, ~115 QPS average writes, ~3000 QPS at peak. This is one of the rare systems where correctness vastly outranks availability — better to refuse service than to double-charge, double-pay, or misrecord a single cent.

Key constraints to clarify in an interview:

Idempotency: clients retry, networks time out. When the same payment is submitted multiple times, money must move exactly once.
Reconcilable: every cent must be traceable; internal books must match the PSP (Stripe/Adyen) and bank statements.
Unreliable externals: PSPs time out and return unknown — you can assume neither success nor failure.
Latency: synchronous charge P99 < 500ms; payout can be async. Amounts are integer minor units (cents), never floats.
Compliance: the ledger must be immutable and auditable (regulatory requirement).

High-Level Architecture

graph TD C[Client / Merchant] -->|idempotency-key| API[Payment API
idempotency layer] API --> ORC[Payment Orchestrator
Saga] ORC --> LED[Ledger Service
immutable double-entry] ORC -->|outbox| GW[PSP Gateway] GW --> PSP[(Stripe / Adyen)] PSP --> BANK[(Banks / Card networks)] LED --> DB[(Ledger DB
append-only)] RECON[Reconciliation Job
daily 3-way] --> DB RECON --> PSP RECON --> BANK

Responsibilities: ① the idempotency layer intercepts duplicate requests via an idempotency key; ② the Orchestrator uses Saga to chain "charge → record → payout" across services; ③ the Ledger Service is the single source of truth — every money movement lands as immutable debit/credit entries; ④ the PSP Gateway wraps external processors; ⑤ the reconciliation job compares internal ledger against PSP and bank statements line by line to surface breaks.

Key Technical Points

① Idempotency — Turning "Did It Happen?" Into "Safe to Retry"

Core trade-off: spend a little storage and a state machine to guarantee that any retry never moves money twice.

[Principle] The client generates an idempotency key (UUID) per payment and sends it with the request. The server persists key → request state: a retry with the same key replays the prior result. The trick is more than dedup — split a payment into atomic phases (Stripe calls them recovery points): create local record → call PSP → record ledger, persisting progress after each phase. If the process crashes midway, a retry resumes from the last recovery point rather than restarting. This is essentially a persisted state machine.

Options compared:

Plain DB unique constraint (unique key column): one line of code, but only blocks fully-duplicate requests. It can't handle "charged the PSP but crashed locally" — on retry there's no local record, so you charge again.
Full idempotency layer (key + recovery point + cached response): resumes half-done requests, but needs a state machine and locking against concurrent same-key calls.
Concurrent same key: row-lock serialization (simple, may block) vs optimistic insert-conflict retry (lock-free, must handle races).

# pseudo-code: idempotent charge with recovery-point resume
def charge(key, req):
    rec = db.upsert_idempotency(key, fingerprint(req))  # row-lock + body check
    if rec.body_hash != fingerprint(req):
        raise Conflict("same key, different request")    # Stripe rejects this
    if rec.response:           # already done -> replay
        return rec.response
    if rec.phase < CREATED:
        db.create_payment(rec); rec.phase = CREATED      # recovery point
    if rec.phase < CHARGED:
        r = psp.charge(req, idem=key)  # PSP carries the key too: two-layer idempotency
        rec.phase = CHARGED; rec.save()
    if rec.phase < LEDGERED:
        ledger.record(rec); rec.phase = LEDGERED
    rec.response = build_resp(rec); rec.save()
    return rec.response

Real cases: Stripe implements idempotency keys + recovery points in Postgres so a request can crash at any phase and resume, and rejects "same key, different body." Airbnb built a generic idempotency framework across its SOA payments path to achieve "at most once" money movement across microservices.

② Double-Entry Ledger — Immutable, Append-Only

Core trade-off: give up the simplicity of "just update a balance" for an auditable, reconcilable source of truth that never loses an entry.

[Principle] Every money movement writes at least two entries (debit + credit), and the sum per transaction is always 0 — an invariant accounting has validated for 500 years that naturally guards against missed entries. The ledger is immutable (append only, never update/delete): corrections (refunds, fixes) are made via a reversing entry, not by mutating the original. An account balance is the aggregation of all its entries.

Options compared:

Mutable balance column on a business table: fast reads, but no audit trail, lost-update risk under concurrency, and corrections aren't traceable. Fine for small apps, not for finance.
Separate immutable ledger: fully auditable, reconciliation-friendly, but "live balance" needs aggregation — full-table SUM is too slow, so you need periodic balance snapshots + incremental entries.
Balance consistency: live SUM every time (accurate, slow) vs maintained materialized balance (fast, must update within the same transaction as the entries).

# transfer: two entries in one DB transaction, assert conservation
def transfer(from_acct, to_acct, amount, txn_id):  # amount is integer cents
    with db.transaction():
        e1 = Entry(txn=txn_id, acct=from_acct, delta=-amount)
        e2 = Entry(txn=txn_id, acct=to_acct,   delta=+amount)
        db.insert(e1); db.insert(e2)
        assert e1.delta + e2.delta == 0     # double-entry invariant
        # a refund is NOT a delete; it's a reversing txn referencing txn_id

Real cases: Square's Books is an immutable, double-entry database service ensuring "you can never transact in a way that yields illogical money movement." Uber LedgerStore is the immutable source of truth for all money events — append-only, supporting trillions of indexes. Modern Treasury treats double-entry, auditability, and immutability as its three ledger principles.

③ Cross-Service Money Movement — Saga, Not 2PC

Core trade-off: 2PC gives strong consistency but deadlocks external systems; Saga gives availability but only eventual consistency, and you hand-write compensations.

[Principle] A full payment spans multiple services and an external PSP: charge guest → record ledger → trigger host payout. 2PC is unworkable here — the external PSP won't join your transaction coordinator, locks held across the network are too long, and the coordinator is a single point. Use a Saga: a chain of local transactions, each failure undone by a compensating action. The atomicity of "change local state" and "emit the next step" is guaranteed by the Outbox pattern (write the business table + outbox table in one transaction, then a relay delivers).

Options compared:

2PC: strong consistency, but a coordinator failure leaves participants holding locks indefinitely, and external PSPs simply don't support it.
Saga: highly available and scalable, but only eventually consistent, with intermediate visible states and a compensation to write per step.
Compensation vs forward recovery: reversible steps (an auth hold) can be compensated; irreversible steps (money already paid out) can't be compensated, only forward-recovered — retried to success, so downstreams must be idempotent.

# Outbox: local state + message land atomically; a relay delivers async
def step_charge_then_payout(payment):
    with db.transaction():
        payment.status = "CHARGED"; db.save(payment)
        db.insert(Outbox(event="StartPayout", payload=payment.id))  # same txn
    # a separate relay polls the outbox -> emits -> marks sent
    # a failed payout is NOT compensated (money received); forward-retry to success

Real cases: at Airbnb, one API call drills into multiple downstream services forming a complex distributed transaction; they reach eventual consistency via idempotency + auto-retry rather than 2PC. Uber's money path likewise keeps LedgerStore as the strongly-consistent core with async orchestration around it.

④ Reconciliation — Finding the Drift Between Books and Reality

Core trade-off: real-time reconciliation is expensive but catches issues early; T+1 batch is cheap but exposes losses late.

[Principle] The internal ledger is "the money you think you have"; PSP/bank statements are "the money you actually have," and the two inevitably drift: fees, FX, refund timing, PSP bugs, unhandled timeouts. Run 3-way reconciliation daily: internal ledger ↔ PSP statement ↔ bank, joined line by line on external_id, surfacing breaks (missing, amount mismatch, status mismatch), auto-classified (fee items auto-settled, suspected losses alerted), the rest handled manually. Reconciliation is a payment system's last line of defense — every bug in your idempotency/Saga logic eventually shows up here.

Options compared: real-time reconciliation (streaming compare, minute-level detection, costly) vs T+1 batch (once a day, cheap, but a loss can run for a day). Tolerance policy: zero-tolerance per-line settlement vs threshold alerting (small diffs recorded first, reviewed later).

# pseudo-code: 3-way reconciliation matching
def reconcile(ledger_rows, psp_rows):
    by_ext = {r.external_id: r for r in psp_rows}
    for L in ledger_rows:
        P = by_ext.pop(L.external_id, None)
        if P is None:        breaks.add("MISSING_IN_PSP", L)   # I recorded, PSP didn't
        elif P.amount != L.amount:
            breaks.add("AMOUNT_MISMATCH", L, P)                # usually a fee
    for leftover in by_ext.values():
        breaks.add("MISSING_IN_LEDGER", leftover)              # PSP has it, I missed -> loss risk

Real cases: reconciliation is table stakes for every payments company. Modern Treasury makes reconcilability a core product; Stripe's and Square's immutable ledger designs exist precisely to make reconciliation traceable line by line.

Scaling & Optimization

Sharding: shard the ledger by account_id; cross-account transfers land in the same logical shard or use a local transaction within a 2PC scope.
Hot/cold separation: old entries are rarely read but can't be deleted — Uber migrated historical ledger data off DynamoDB into its own LedgerStore, saving ~$6M/year.
CDC for real-time reconciliation/risk: stream ledger changes (Debezium) to the warehouse for near-real-time reconciliation and fraud detection.
Multi-currency: a separate ledger per currency; FX is recorded as two entries (out of one currency, into another).
Multi-region: money systems are usually region-pinned and consistency-first; cross-region settlement goes through async clearing.

Common Pitfalls & Interview Questions

Storing money as float: rounding error accumulates to the cent. Use integer minor units or decimal.
Wrong idempotency-key scope: the same key with a different body should error (Stripe checks the request fingerprint), not return the old result.
Putting the PSP call inside the DB transaction: you hold a DB lock across a network round-trip and exhaust the connection pool. External calls must be outside the transaction.
The "exactly-once" illusion: over a network there is no true exactly-once, only at-least-once delivery + idempotent consumption = effectively-once.
Mishandling PSP unknown/timeout: treat it as neither success nor failure — actively query PSP status or let reconciliation catch it.

Frequent interview follow-ups: ① how do you avoid double payouts when the client retries? ② the PSP times out — how do you know whether money actually moved? ③ how do you record a refund (why can't you edit the original)? ④ why don't payments use 2PC? ⑤ how do you reconcile "to the cent"? Which breaks auto-settle and which need a human?

Going Deeper — Resources

Stripe — Designing robust and predictable APIs with idempotency: the canonical key + recovery-point implementation.
Airbnb — Avoiding Double Payments in a Distributed Payments System: a generic idempotency framework under SOA.
Square — Books: an immutable double-entry accounting database: engineering immutable double-entry.
Uber — How LedgerStore Supports Trillions of Indexes: storage design and cost optimization for the money source of truth.
Designing Data-Intensive Applications (Kleppmann), chapters 9, 11, 12: consistency, stream processing, and correctness.

Deeper Reflection

1. A PSP call returned a timeout — money may or may not have moved. You can assume neither. What's the complete fallback chain?

A timeout is the most dangerous state: the request may have stalled at "PSP charged but response lost." The chain:

Mark no terminal state locally: keep PENDING, never write success or failed.
Retry/re-query with the idempotency key: because the PSP request also carried the key, retrying is safe — if it charged, the PSP returns the original result rather than charging twice; if it didn't, this attempt succeeds.
Actively reconcile: call the PSP's status API by key to pull the truth and persist it.
Last line is reconciliation: even if the above are missed, T+1 reconciliation surfaces the "PSP has it, ledger doesn't" break and triggers a record or refund.

Essence: converge "uncertain" into "certain" via idempotency + reconciliation, instead of guessing.

2. Why can't a refund just set the original payment's amount to 0 or delete it? What principle is behind this?

Because the ledger is immutable — editing/deleting the original destroys the audit trail and reconcilability. A refund writes a reversing entry referencing the original txn_id, with opposite direction. Benefits: ① complete, traceable history that regulators and reconciliation can reconstruct; ② the original payment still truly happened (fees, taxes, reports all depend on it); ③ a partial refund just writes a partial-amount reversing entry and the balance is naturally correct. This is the heart of double-entry's "append-only + reversing": errors aren't erased, they're corrected with new entries — the same logic as Git revert vs force-push.

3. A celebrity merchant generates 1M transactions in a day, all on one account. What happens when the ledger is sharded by account_id, and how do you fix it?

Sharding by account_id lands all of that account's entries on one shard, creating a hot spot: write contention, and the materialized balance row becomes a lock hot spot (every transaction updates that account's balance). Fixes:

Balance bucketing: split the single balance row into N sub-balances, scatter writes randomly across buckets, and SUM the N buckets on read — like a sharded counter for high-concurrency counting.
Batched posting: buffer high-frequency small amounts and aggregate them into a few entries per window (trading real-time for throughput).
Async materialization: still append entries one by one (no lock contention), and let a downstream aggregate the balance off the main write path.

The key is to move "single-point balance update" off the write path — entries append lock-free, and the balance is computed eventually-consistently.

4. If the ledger's conservation (debits = credits) mathematically guarantees no entry is lost, why bother with reconciliation at all?

Because conservation only guarantees your ledger is internally self-consistent, not that it reflects external reality. Drift comes from boundaries: ① you think the PSP charged and recorded it, but the PSP actually failed (your ledger is still internally conserved, but disagrees with the PSP); ② the PSP took a fee you didn't record; ③ an unhandled timeout moved money at the PSP that you didn't record. None of these break internal conservation, yet they're real losses. Reconciliation is the only way to upgrade "internally consistent" to "consistent with reality" — it tests the system boundary, whereas conservation only tests the system interior. That's why reconciliation is the last line of defense.

5. "At-least-once delivery + idempotent consumption = effectively-once" — what's the precondition for this in payments, and when does it break?

Precondition: the consumer's idempotency key must map one-to-one to "the money movement," and the dedup record and side effect must land atomically. It breaks when:

Wrong key granularity: dedup on message_id, but the same business payment is triggered by two different messages → each passes dedup, money moves twice. Use the business payment id as the key.
Dedup and side effect not atomic: charge the PSP first, then write the dedup row, crash in between → on retry no dedup row, charge again. Must be in one transaction, or use a PSP-side idempotency key as a second layer.
Dedup record expires: TTL too short, so when the retry arrives the record is gone and it degrades to no idempotency.

So "effectively-once" isn't free — it requires the key to align with business semantics + dedup atomic with the side effect + the downstream also idempotent, all at once.

← Back to index