Design a marketplace payment platform (think Airbnb / Uber): charge the guest, pay out the host. 10M transactions/day, ~115 QPS average writes, ~3000 QPS at peak. This is one of the rare systems where correctness vastly outranks availability — better to refuse service than to double-charge, double-pay, or misrecord a single cent.
Key constraints to clarify in an interview:
unknown — you can assume neither success nor failure.Responsibilities: ① the idempotency layer intercepts duplicate requests via an idempotency key; ② the Orchestrator uses Saga to chain "charge → record → payout" across services; ③ the Ledger Service is the single source of truth — every money movement lands as immutable debit/credit entries; ④ the PSP Gateway wraps external processors; ⑤ the reconciliation job compares internal ledger against PSP and bank statements line by line to surface breaks.
Core trade-off: spend a little storage and a state machine to guarantee that any retry never moves money twice.
[Principle] The client generates an idempotency key (UUID) per payment and sends it with the request. The server persists key → request state: a retry with the same key replays the prior result. The trick is more than dedup — split a payment into atomic phases (Stripe calls them recovery points): create local record → call PSP → record ledger, persisting progress after each phase. If the process crashes midway, a retry resumes from the last recovery point rather than restarting. This is essentially a persisted state machine.
# pseudo-code: idempotent charge with recovery-point resume
def charge(key, req):
rec = db.upsert_idempotency(key, fingerprint(req)) # row-lock + body check
if rec.body_hash != fingerprint(req):
raise Conflict("same key, different request") # Stripe rejects this
if rec.response: # already done -> replay
return rec.response
if rec.phase < CREATED:
db.create_payment(rec); rec.phase = CREATED # recovery point
if rec.phase < CHARGED:
r = psp.charge(req, idem=key) # PSP carries the key too: two-layer idempotency
rec.phase = CHARGED; rec.save()
if rec.phase < LEDGERED:
ledger.record(rec); rec.phase = LEDGERED
rec.response = build_resp(rec); rec.save()
return rec.response
Core trade-off: give up the simplicity of "just update a balance" for an auditable, reconcilable source of truth that never loses an entry.
[Principle] Every money movement writes at least two entries (debit + credit), and the sum per transaction is always 0 — an invariant accounting has validated for 500 years that naturally guards against missed entries. The ledger is immutable (append only, never update/delete): corrections (refunds, fixes) are made via a reversing entry, not by mutating the original. An account balance is the aggregation of all its entries.
SUM is too slow, so you need periodic balance snapshots + incremental entries.# transfer: two entries in one DB transaction, assert conservation
def transfer(from_acct, to_acct, amount, txn_id): # amount is integer cents
with db.transaction():
e1 = Entry(txn=txn_id, acct=from_acct, delta=-amount)
e2 = Entry(txn=txn_id, acct=to_acct, delta=+amount)
db.insert(e1); db.insert(e2)
assert e1.delta + e2.delta == 0 # double-entry invariant
# a refund is NOT a delete; it's a reversing txn referencing txn_id
Core trade-off: 2PC gives strong consistency but deadlocks external systems; Saga gives availability but only eventual consistency, and you hand-write compensations.
[Principle] A full payment spans multiple services and an external PSP: charge guest → record ledger → trigger host payout. 2PC is unworkable here — the external PSP won't join your transaction coordinator, locks held across the network are too long, and the coordinator is a single point. Use a Saga: a chain of local transactions, each failure undone by a compensating action. The atomicity of "change local state" and "emit the next step" is guaranteed by the Outbox pattern (write the business table + outbox table in one transaction, then a relay delivers).
# Outbox: local state + message land atomically; a relay delivers async
def step_charge_then_payout(payment):
with db.transaction():
payment.status = "CHARGED"; db.save(payment)
db.insert(Outbox(event="StartPayout", payload=payment.id)) # same txn
# a separate relay polls the outbox -> emits -> marks sent
# a failed payout is NOT compensated (money received); forward-retry to success
Core trade-off: real-time reconciliation is expensive but catches issues early; T+1 batch is cheap but exposes losses late.
[Principle] The internal ledger is "the money you think you have"; PSP/bank statements are "the money you actually have," and the two inevitably drift: fees, FX, refund timing, PSP bugs, unhandled timeouts. Run 3-way reconciliation daily: internal ledger ↔ PSP statement ↔ bank, joined line by line on external_id, surfacing breaks (missing, amount mismatch, status mismatch), auto-classified (fee items auto-settled, suspected losses alerted), the rest handled manually. Reconciliation is a payment system's last line of defense — every bug in your idempotency/Saga logic eventually shows up here.
# pseudo-code: 3-way reconciliation matching
def reconcile(ledger_rows, psp_rows):
by_ext = {r.external_id: r for r in psp_rows}
for L in ledger_rows:
P = by_ext.pop(L.external_id, None)
if P is None: breaks.add("MISSING_IN_PSP", L) # I recorded, PSP didn't
elif P.amount != L.amount:
breaks.add("AMOUNT_MISMATCH", L, P) # usually a fee
for leftover in by_ext.values():
breaks.add("MISSING_IN_LEDGER", leftover) # PSP has it, I missed -> loss risk
account_id; cross-account transfers land in the same logical shard or use a local transaction within a 2PC scope.Frequent interview follow-ups: ① how do you avoid double payouts when the client retries? ② the PSP times out — how do you know whether money actually moved? ③ how do you record a refund (why can't you edit the original)? ④ why don't payments use 2PC? ⑤ how do you reconcile "to the cent"? Which breaks auto-settle and which need a human?
A timeout is the most dangerous state: the request may have stalled at "PSP charged but response lost." The chain:
PENDING, never write success or failed.Essence: converge "uncertain" into "certain" via idempotency + reconciliation, instead of guessing.
Because the ledger is immutable — editing/deleting the original destroys the audit trail and reconcilability. A refund writes a reversing entry referencing the original txn_id, with opposite direction. Benefits: ① complete, traceable history that regulators and reconciliation can reconstruct; ② the original payment still truly happened (fees, taxes, reports all depend on it); ③ a partial refund just writes a partial-amount reversing entry and the balance is naturally correct. This is the heart of double-entry's "append-only + reversing": errors aren't erased, they're corrected with new entries — the same logic as Git revert vs force-push.
Sharding by account_id lands all of that account's entries on one shard, creating a hot spot: write contention, and the materialized balance row becomes a lock hot spot (every transaction updates that account's balance). Fixes:
The key is to move "single-point balance update" off the write path — entries append lock-free, and the balance is computed eventually-consistently.
Because conservation only guarantees your ledger is internally self-consistent, not that it reflects external reality. Drift comes from boundaries: ① you think the PSP charged and recorded it, but the PSP actually failed (your ledger is still internally conserved, but disagrees with the PSP); ② the PSP took a fee you didn't record; ③ an unhandled timeout moved money at the PSP that you didn't record. None of these break internal conservation, yet they're real losses. Reconciliation is the only way to upgrade "internally consistent" to "consistent with reality" — it tests the system boundary, whereas conservation only tests the system interior. That's why reconciliation is the last line of defense.
Precondition: the consumer's idempotency key must map one-to-one to "the money movement," and the dedup record and side effect must land atomically. It breaks when:
So "effectively-once" isn't free — it requires the key to align with business semantics + dedup atomic with the side effect + the downstream also idempotent, all at once.