Design a cross-border e-commerce checkout flow. A single checkout must complete 5 things at once:
Scale: 5k peak TPS checkout, P99 ≤ 800ms end-to-end, 0 oversells, 0 duplicate charges. The 5 services are owned by 5 teams with no shared database. Every intermediate step may fail: service crash, network partition, Stripe 504, message loss.
Question: how do these 5 steps guarantee "all succeed or none happen (or compensate observably)"? This is the distributed transactions problem. Today: why textbook 2PC/3PC is almost never used in production, how Saga actually lands, how Outbox solves dual-write, and why "the best distributed transaction is none at all."
graph TD
Client["Client"]
Orch["Saga Orchestrator
Temporal / Cadence
persists every step"]
subgraph S1["Order Domain"]
OrdSvc["Order Service"]
OrdDB[("Postgres
orders + outbox")]
end
subgraph S2["Inventory Domain"]
InvSvc["Inventory Service"]
InvDB[("Postgres
reservations + outbox")]
end
subgraph S3["Payment Domain"]
PaySvc["Payment Service"]
Stripe["Stripe API
external, not rollbackable"]
end
subgraph S4["Points Domain"]
PtsSvc["Points Service"]
PtsDB[("DynamoDB
+ idempotency table")]
end
subgraph S5["Notification Domain"]
Notify["Notify Service"]
end
CDC["Debezium CDC
reads outbox table"]
Kafka["Kafka
order.events"]
Client --> Orch
Orch -->|1 createOrder| OrdSvc
Orch -->|2 reserveStock| InvSvc
Orch -->|3 charge| PaySvc
Orch -->|4 deductPoints| PtsSvc
Orch -.->|5 async notify| Kafka
OrdSvc --> OrdDB
InvSvc --> InvDB
PaySvc --> Stripe
PtsSvc --> PtsDB
OrdDB -.-> CDC
InvDB -.-> CDC
CDC --> Kafka
Kafka --> Notify
classDef orch fill:#2a1530,stroke:#ff7ab6,color:#e8eef5
classDef ext fill:#3a2010,stroke:#ffb450,color:#e8eef5
class Orch orch
class Stripe ext
Core idea: a Saga (Temporal/Cadence) chains the 5 steps; each is a local transaction with a reverse compensation on failure. Event fan-out goes through Outbox + CDC (DB commit ⇒ Kafka emit is guaranteed). Weak-transaction work like notifications goes eventual. The whole system has zero 2PC.
Mechanism: Two-Phase Commit uses a coordinator to orchestrate multiple participants:
Correctness gives ACID atomicity across nodes. But:
# 2PC disaster — why Pat Helland called it a "Maginot Line"
# T = 0 coord -> all: PREPARE
# T = 10ms all -> coord: YES (each locks resources, writes prepare log)
# T = 11ms coord crashes ⚠
# T = ∞ participants hold locks waiting for coord to recover
# other transactions blocked, business halts
#
# Even when coord restarts, it must replay its transaction log
# before deciding; participants stay locked. This is the worst
# failure mode in distributed systems: "one process death stalls
# the entire system."
Origin: Garcia-Molina & Salem introduced Saga at SIGMOD 1987, originally for "long-running transactions inside a single DB" (to avoid long-held locks). Microservices adopted it: split a cross-service transaction into a chain of local transactions T1…Tn; if any Ti fails, execute compensations C(i-1)…C1 in reverse order for completed steps.
You lose ACID isolation but gain eventual consistency + business-level reversibility. Two landing patterns:
| Pattern | Coordinator | Pros | Cons | Use when |
|---|---|---|---|---|
| Orchestration | Central orchestrator (Temporal / Cadence / Step Functions) explicitly calls each step | Logic is centralized and readable; clear compensation path; easy to debug/visualize | Orchestrator is a new component to operate; could become a bottleneck | Many steps (≥4), complex compensations, audit needs |
| Choreography | No center; services subscribe to events and act | Fully decoupled; no single point | Flow is scattered across services; one change affects N services; hard to trace | Few steps (≤3), naturally event-driven within a bounded context |
sequenceDiagram
participant C as Client
participant O as Orchestrator (Temporal)
participant Ord as Order Svc
participant Inv as Inventory Svc
participant Pay as Payment (Stripe)
participant Pts as Points Svc
C->>O: placeOrder()
O->>Ord: 1 createOrder (PENDING)
Ord-->>O: orderId=42
O->>Inv: 2 reserveStock(sku, qty)
Inv-->>O: reserved
O->>Pay: 3 charge($100, idempKey=42)
Pay-->>O: paid (charge_xyz)
O->>Pts: 4 deductPoints(user, 50, idempKey=42)
Pts--xO: fail! insufficient points
Note over O: Saga reverse compensations
O->>Pay: C3 refund(charge_xyz, idempKey=42-refund)
O->>Inv: C2 releaseStock(sku, qty, idempKey=42-rel)
O->>Ord: C1 markFailed(42, reason="points")
O-->>C: failed: insufficient points, fully rolled back
# Temporal Workflow-style pseudocode (Python SDK)
@workflow.defn
class PlaceOrderSaga:
@workflow.run
async def run(self, req: OrderReq) -> Result:
compensations = [] # stack of comp actions for completed steps
try:
order = await activity.execute(create_order, req)
compensations.append(lambda: mark_failed(order.id))
await activity.execute(reserve_stock, req.sku, req.qty)
compensations.append(lambda: release_stock(req.sku, req.qty))
charge = await activity.execute(
stripe_charge,
amount=req.amount,
idempotency_key=f"order-{order.id}" # key: idempotency
)
compensations.append(
lambda: stripe_refund(charge.id, key=f"refund-{order.id}")
)
await activity.execute(deduct_points, req.user, req.points,
key=f"pts-{order.id}")
return Result.ok(order.id)
except ActivityError as e:
# Run compensations in reverse. Temporal guarantees at-least-once + retry.
for comp in reversed(compensations):
await activity.execute(comp, retry_policy=infinite_backoff)
return Result.fail(reason=str(e))
# Key properties:
# - Temporal persists workflow state to an event log; on crash it replays and resumes.
# - Every activity must be idempotent (because it may retry).
# - Compensations must be idempotent AND not allowed to fail forever
# (a failed compensation needs human intervention).
Saga primitive; 1000+ services inside Uber use it, with Cadence 1.0 announced in 2024. Order fulfillment and driver dispatch run as saga workflows.Problem: a service must "write DB" + "publish Kafka event." The two are not in one transaction, so there is always an inconsistency window:
This is the dual write problem, the most common transactional pitfall in microservices.
Outbox solution:
outbox table are written in the same local DB transaction. Local = ACID; both succeed or both fail.
graph LR
App["Order Service"]
DB[("Postgres
orders + outbox
same local txn")]
Debez["Debezium
reads WAL"]
K["Kafka
order.events"]
Cons1["Inventory Consumer"]
Cons2["Analytics Consumer"]
Cons3["Notification Consumer"]
App -- "BEGIN
INSERT orders
INSERT outbox
COMMIT" --> DB
DB -. WAL .-> Debez
Debez --> K
K --> Cons1
K --> Cons2
K --> Cons3
# Outbox in practice — Postgres + Debezium
CREATE TABLE orders (id BIGSERIAL PK, user_id ..., amount ..., status ...);
CREATE TABLE outbox (
id BIGSERIAL PK,
aggregate_id BIGINT, -- order id (used as partition key)
event_type TEXT, -- "order.created" / "order.cancelled"
payload JSONB,
created_at TIMESTAMPTZ DEFAULT now()
);
-- Business code: write orders and outbox event in the same txn
BEGIN;
INSERT INTO orders (...) VALUES (...) RETURNING id;
INSERT INTO outbox (aggregate_id, event_type, payload)
VALUES (42, 'order.created', '{"user":7,"amount":100}');
COMMIT; -- atomic: both succeed or neither does
-- Debezium attaches to the Postgres replication slot,
-- tails the WAL → watches outbox INSERTs → transforms → pushes Kafka.
-- The business code doesn't need to know Kafka exists — fully decoupled.
# Why not CDC on the business table directly?
# - Business schema changes often; downstreams become tightly coupled.
# - A single row change may correspond to 0 or many domain events
# (an order INSERT may emit "order.created" + "loyalty.points.eligible").
# Outbox lets business code control event granularity.
# Why not just call Kafka producer in the same function?
# - Producer.send() is not in the DB transaction — dual write problem.
# - Even with Kafka transactions, you cannot span Postgres + Kafka.
Mechanism: in distributed systems every RPC, message, and callback is at-least-once — a network timeout can't distinguish "request lost" from "response lost," so retries are inevitable. Idempotency guarantees "executing the same operation N times has the same effect as executing it once" — it is the safety net for saga retries, outbox redelivery, and compensations.
| Strategy | Mechanism | Best for |
|---|---|---|
| Naturally idempotent ops | SET x=5, DELETE id=42 — same result if repeated | Express as set when possible |
| Idempotency key + dedup table | Client generates a UUID; server stores (key → result); retries return cached result | Payments, orders, external callbacks (Stripe is the canonical example) |
| Optimistic locking / version | UPDATE … WHERE version=X, discard conflicts | Concurrent updates on the same resource |
| Consumer-side dedup | When consuming Kafka, write event_id to dedup table; skip on duplicate | Outbox / event sourcing consumers |
# Stripe-style idempotency key, server-side (Postgres)
# Reference: https://stripe.com/blog/idempotency
CREATE TABLE idempotency (
key TEXT PRIMARY KEY,
request_hash TEXT NOT NULL, -- detect key reuse with different bodies
response_code INT,
response_body JSONB,
status TEXT, -- 'in_progress' | 'completed'
locked_at TIMESTAMPTZ,
created_at TIMESTAMPTZ DEFAULT now()
);
def handle(key, req):
h = hash(req.body)
row = db.fetch_for_update("SELECT * FROM idempotency WHERE key=%s", key)
if row and row.request_hash != h:
raise Conflict("idempotency key reused with different body")
if row and row.status == 'completed':
return row.response_body # return cached response
if row and row.status == 'in_progress':
if now() - row.locked_at > LEASE:
pass # lease expired, retry ok
else:
raise Conflict("request in flight")
# claim
db.upsert("idempotency", key=key, request_hash=h,
status='in_progress', locked_at=now())
try:
resp = do_business(req)
db.update("idempotency", key=key,
status='completed', response_code=200,
response_body=resp)
return resp
except RetryableError:
# don't update; let client retry
raise
except FatalError as e:
db.update("idempotency", key=key,
status='completed', response_code=500,
response_body={"error": str(e)})
raise
# Expiry policy: Stripe keeps keys for 24h, then prunes. Clients should
# not assume keys live forever.
Interview follow-ups:
The classic saga interview question. The core issue is that the orchestrator's own failure must be recoverable — otherwise you've just moved the single point from 2PC's coordinator to your saga orchestrator, without solving anything.
Forward flow:
Compensation: T2 fails → C1: A.credit(100, idempKey=tx-42-comp) (not a refund, a reverse credit).
Critical failure modes:
Deeper reflection: this example is often used to "prove" saga is weak. But in real banking systems, transfers don't rely on distributed transactions at all — they use double-entry bookkeeping: the ledger isn't "modify A, modify B," but "record a debit, record a credit"; the sum of both must be zero. Any imbalance is caught by end-of-day reconciliation. This is centuries-old financial engineering wisdom, long predating distributed databases. Lesson: many "distributed-transaction needs" disappear when you change the data model.
Scenario: an e-commerce saga has steps "deduct inventory → deduct balance → ship." User A triggers a saga; inventory deduction succeeds (stock -1), balance deduction not yet started. Another query endpoint is hit and reads stock as -1; the recommender concludes "low stock" and sends user B a "limited-time offer" push. Then A's saga fails balance deduction and compensates (stock +1).
Result: user B got a marketing push based on an inventory reduction that never actually happened, and may immediately order only to find stock is fine — business logic disordered. This is the isolation violation caused by saga's "intermediate state leak." Under ACID, uncommitted transactions are invisible to others, so this cannot happen.
Solutions, from cheap to expensive:
pending column on inventory. During the saga, mark pending=true; other readers know "this is a saga in-flight state" and decide accordingly (the recommender can ignore pending rows).Underlying principle: Saga trades "eventual consistency + visible intermediate state" for "cross-service feasibility + high throughput." If the business cannot tolerate visible intermediate state, that data should not cross services — it should stay in one service with local transactions. Back to "draw the right domain boundary."
Real scenario: after placing an order, the user must immediately see it in "My Orders." If outbox + Debezium has 2s lag, the user wonders "where's my order?" This is the conflict between synchronous requirements and async outbox.
Approach 1: Dual-read (recommended, cheap)
Approach 2: Dual-write + tolerate inconsistency (for "occasionally lose a bit" cases)
Approach 3: Synchronous outbox (heavy, edge cases only)
Deeper reflection: "instant consistency" is almost always over-engineering. Google's and Amazon's search both have index lag. Ask the business: "really, not even 1 second?" — 80% of the time the answer is "5 seconds is fine." For the remaining 20%, Approach 1 (dual-read) usually suffices. Treating "instant consistency" as default is a hidden tax on performance and availability.
This is the layered-decision skill required of architects. Each subdomain in one system should pick both consistency and transaction strategy independently.
| Subdomain | Consistency (Day 6) | Transaction strategy (Day 7) | Implementation |
|---|---|---|---|
| Inventory deduction | Linearizable | Single-DB local txn + row lock / CAS | Postgres SELECT FOR UPDATE or Spanner |
| Checkout (across 5 services) | Eventual + compensation | Saga (Temporal orchestration) | Per-step local txn + reverse compensation + idempotency |
| Payments (Stripe) | External, at-most-once charge | Idempotency key + retry + reconciliation | Stripe key = saga_id + end-of-day reconciliation |
| Order status fanout (search, recs, msgs) | Eventual | Outbox + Kafka + at-least-once consumers | Debezium → Kafka → multiple consumers |
| "My Orders" display | Read-your-writes | Primary DB + secondary-index dual read | Session token carries LSN + dual read |
| User analytics / telemetry | Eventual + lossy ok | Fire-and-forget + batch | Local buffer → async Kafka push |
Overall architecture:
Anti-patterns:
The architect's core skill: be able to draw this table in 5 minutes for a new business, with every row backed by numbers (QPS, SLO, cost of violation). That's the real boundary between "senior" and "architect."
This is the essence of Pat Helland's paper and the mark of microservice design maturity. When you find 90% of flows cross the same two services and saga compensations grow ever more complex, your boundaries are wrong.
Real case 1: e-commerce "order + inventory" used to be two services
Real case 2: banking transfers replace distributed transactions with double-entry bookkeeping
Real case 3: event sourcing collapses multi-service state into a single source of truth
Philosophical takeaway: complexity in distributed systems never disappears, only moves. Saga moves "coordinator complexity" to "compensation complexity"; double-entry moves "cross-service transaction" to "data-model design"; event sourcing moves "cross-service consistency" to "event schema evolution."
Mark of a mature architect: identify where complexity hides and put it where it is easiest to understand and evolve. Not eliminate complexity, but place it well. This principle applies to all design decisions — fundamentally the same thinking as "push inconsistency to the UX layer, keep consistency at the DB layer."