Day 7 Hard Distributed Systems Transactions Saga / Outbox Microservices

Distributed Transactions — Don't. And if you must, keep them tiny.2PC / 3PC · Saga (Orchestration vs Choreography) · Transactional Outbox · Idempotency · Why Microservices Avoid Distributed Transactions

Scenario & Constraints

Design a cross-border e-commerce checkout flow. A single checkout must complete 5 things at once:

  1. Order service writes the order (own Postgres)
  2. Inventory service decrements stock (own Postgres)
  3. Payment service calls Stripe to charge (external API, not rollbackable)
  4. Points service deducts loyalty points (own DynamoDB)
  5. Notification service sends email + push (external SES / FCM)

Scale: 5k peak TPS checkout, P99 ≤ 800ms end-to-end, 0 oversells, 0 duplicate charges. The 5 services are owned by 5 teams with no shared database. Every intermediate step may fail: service crash, network partition, Stripe 504, message loss.

Question: how do these 5 steps guarantee "all succeed or none happen (or compensate observably)"? This is the distributed transactions problem. Today: why textbook 2PC/3PC is almost never used in production, how Saga actually lands, how Outbox solves dual-write, and why "the best distributed transaction is none at all."

High-level Architecture

graph TD
    Client["Client"]
    Orch["Saga Orchestrator
Temporal / Cadence
persists every step
"] subgraph S1["Order Domain"] OrdSvc["Order Service"] OrdDB[("Postgres
orders + outbox")] end subgraph S2["Inventory Domain"] InvSvc["Inventory Service"] InvDB[("Postgres
reservations + outbox")] end subgraph S3["Payment Domain"] PaySvc["Payment Service"] Stripe["Stripe API
external, not rollbackable"] end subgraph S4["Points Domain"] PtsSvc["Points Service"] PtsDB[("DynamoDB
+ idempotency table")] end subgraph S5["Notification Domain"] Notify["Notify Service"] end CDC["Debezium CDC
reads outbox table"] Kafka["Kafka
order.events"] Client --> Orch Orch -->|1 createOrder| OrdSvc Orch -->|2 reserveStock| InvSvc Orch -->|3 charge| PaySvc Orch -->|4 deductPoints| PtsSvc Orch -.->|5 async notify| Kafka OrdSvc --> OrdDB InvSvc --> InvDB PaySvc --> Stripe PtsSvc --> PtsDB OrdDB -.-> CDC InvDB -.-> CDC CDC --> Kafka Kafka --> Notify classDef orch fill:#2a1530,stroke:#ff7ab6,color:#e8eef5 classDef ext fill:#3a2010,stroke:#ffb450,color:#e8eef5 class Orch orch class Stripe ext

Core idea: a Saga (Temporal/Cadence) chains the 5 steps; each is a local transaction with a reverse compensation on failure. Event fan-out goes through Outbox + CDC (DB commit ⇒ Kafka emit is guaranteed). Weak-transaction work like notifications goes eventual. The whole system has zero 2PC.

Key Techniques

1. 2PC / 3PC: textbook answer, production liability

Mechanism: Two-Phase Commit uses a coordinator to orchestrate multiple participants:

Correctness gives ACID atomicity across nodes. But:

2PC's three fatal flaws Who still uses 2PC: legacy enterprise XA (Oracle + MQ + another Oracle, one DBA team); distributed-DB internals (Spanner uses 2PC for cross-shard, but each participant is itself a Paxos group, so no single-point block). Across microservices or external APIs → never 2PC.
# 2PC disaster — why Pat Helland called it a "Maginot Line"
# T = 0    coord -> all: PREPARE
# T = 10ms all -> coord: YES (each locks resources, writes prepare log)
# T = 11ms coord crashes ⚠
# T = ∞    participants hold locks waiting for coord to recover
#          other transactions blocked, business halts
#
# Even when coord restarts, it must replay its transaction log
# before deciding; participants stay locked. This is the worst
# failure mode in distributed systems: "one process death stalls
# the entire system."
Real cases:

2. Saga: break one big transaction into N local transactions + N-1 compensations

Origin: Garcia-Molina & Salem introduced Saga at SIGMOD 1987, originally for "long-running transactions inside a single DB" (to avoid long-held locks). Microservices adopted it: split a cross-service transaction into a chain of local transactions T1…Tn; if any Ti fails, execute compensations C(i-1)…C1 in reverse order for completed steps.

You lose ACID isolation but gain eventual consistency + business-level reversibility. Two landing patterns:

PatternCoordinatorProsConsUse when
OrchestrationCentral orchestrator (Temporal / Cadence / Step Functions) explicitly calls each stepLogic is centralized and readable; clear compensation path; easy to debug/visualizeOrchestrator is a new component to operate; could become a bottleneckMany steps (≥4), complex compensations, audit needs
ChoreographyNo center; services subscribe to events and actFully decoupled; no single pointFlow is scattered across services; one change affects N services; hard to traceFew steps (≤3), naturally event-driven within a bounded context
sequenceDiagram
    participant C as Client
    participant O as Orchestrator (Temporal)
    participant Ord as Order Svc
    participant Inv as Inventory Svc
    participant Pay as Payment (Stripe)
    participant Pts as Points Svc

    C->>O: placeOrder()
    O->>Ord: 1 createOrder (PENDING)
    Ord-->>O: orderId=42
    O->>Inv: 2 reserveStock(sku, qty)
    Inv-->>O: reserved
    O->>Pay: 3 charge($100, idempKey=42)
    Pay-->>O: paid (charge_xyz)
    O->>Pts: 4 deductPoints(user, 50, idempKey=42)
    Pts--xO: fail! insufficient points
    Note over O: Saga reverse compensations
    O->>Pay: C3 refund(charge_xyz, idempKey=42-refund)
    O->>Inv: C2 releaseStock(sku, qty, idempKey=42-rel)
    O->>Ord: C1 markFailed(42, reason="points")
    O-->>C: failed: insufficient points, fully rolled back
# Temporal Workflow-style pseudocode (Python SDK)
@workflow.defn
class PlaceOrderSaga:
    @workflow.run
    async def run(self, req: OrderReq) -> Result:
        compensations = []          # stack of comp actions for completed steps
        try:
            order = await activity.execute(create_order, req)
            compensations.append(lambda: mark_failed(order.id))

            await activity.execute(reserve_stock, req.sku, req.qty)
            compensations.append(lambda: release_stock(req.sku, req.qty))

            charge = await activity.execute(
                stripe_charge,
                amount=req.amount,
                idempotency_key=f"order-{order.id}"   # key: idempotency
            )
            compensations.append(
                lambda: stripe_refund(charge.id, key=f"refund-{order.id}")
            )

            await activity.execute(deduct_points, req.user, req.points,
                                   key=f"pts-{order.id}")
            return Result.ok(order.id)

        except ActivityError as e:
            # Run compensations in reverse. Temporal guarantees at-least-once + retry.
            for comp in reversed(compensations):
                await activity.execute(comp, retry_policy=infinite_backoff)
            return Result.fail(reason=str(e))

# Key properties:
#  - Temporal persists workflow state to an event log; on crash it replays and resumes.
#  - Every activity must be idempotent (because it may retry).
#  - Compensations must be idempotent AND not allowed to fail forever
#    (a failed compensation needs human intervention).
Compensation is not rollback. Rollback pretends nothing happened (DB undo log). Compensation is "do the reverse operation to reach an equivalent outcome." A charge can be refunded, but a sent email is sent — compensation is "send a 'sorry, your order failed' follow-up." This forces the business to accept "visible intermediate states" and is Saga's biggest semantic gap from ACID.
Real cases:

3. Outbox Pattern: kill "dual write," demote distributed transactions to local ones

Problem: a service must "write DB" + "publish Kafka event." The two are not in one transaction, so there is always an inconsistency window:

This is the dual write problem, the most common transactional pitfall in microservices.

Outbox solution:

  1. The business table and the outbox table are written in the same local DB transaction. Local = ACID; both succeed or both fail.
  2. A separate relay process (Debezium-style CDC reading the WAL/binlog) asynchronously ships new outbox rows to Kafka.
  3. The relay tracks its own offset, delivers at-least-once; consumers dedupe on outbox row ID.
graph LR
    App["Order Service"]
    DB[("Postgres
orders + outbox
same local txn")] Debez["Debezium
reads WAL"] K["Kafka
order.events"] Cons1["Inventory Consumer"] Cons2["Analytics Consumer"] Cons3["Notification Consumer"] App -- "BEGIN
INSERT orders
INSERT outbox
COMMIT" --> DB DB -. WAL .-> Debez Debez --> K K --> Cons1 K --> Cons2 K --> Cons3
# Outbox in practice — Postgres + Debezium
CREATE TABLE orders (id BIGSERIAL PK, user_id ..., amount ..., status ...);
CREATE TABLE outbox (
    id BIGSERIAL PK,
    aggregate_id BIGINT,        -- order id (used as partition key)
    event_type TEXT,            -- "order.created" / "order.cancelled"
    payload JSONB,
    created_at TIMESTAMPTZ DEFAULT now()
);

-- Business code: write orders and outbox event in the same txn
BEGIN;
  INSERT INTO orders (...) VALUES (...) RETURNING id;
  INSERT INTO outbox (aggregate_id, event_type, payload)
       VALUES (42, 'order.created', '{"user":7,"amount":100}');
COMMIT;        -- atomic: both succeed or neither does

-- Debezium attaches to the Postgres replication slot,
-- tails the WAL → watches outbox INSERTs → transforms → pushes Kafka.
-- The business code doesn't need to know Kafka exists — fully decoupled.

# Why not CDC on the business table directly?
#  - Business schema changes often; downstreams become tightly coupled.
#  - A single row change may correspond to 0 or many domain events
#    (an order INSERT may emit "order.created" + "loyalty.points.eligible").
#    Outbox lets business code control event granularity.

# Why not just call Kafka producer in the same function?
#  - Producer.send() is not in the DB transaction — dual write problem.
#  - Even with Kafka transactions, you cannot span Postgres + Kafka.
Outbox trade-offs:
Real cases:

4. Idempotency: the last line of defense for distributed transactions

Mechanism: in distributed systems every RPC, message, and callback is at-least-once — a network timeout can't distinguish "request lost" from "response lost," so retries are inevitable. Idempotency guarantees "executing the same operation N times has the same effect as executing it once" — it is the safety net for saga retries, outbox redelivery, and compensations.

StrategyMechanismBest for
Naturally idempotent opsSET x=5, DELETE id=42 — same result if repeatedExpress as set when possible
Idempotency key + dedup tableClient generates a UUID; server stores (key → result); retries return cached resultPayments, orders, external callbacks (Stripe is the canonical example)
Optimistic locking / versionUPDATE … WHERE version=X, discard conflictsConcurrent updates on the same resource
Consumer-side dedupWhen consuming Kafka, write event_id to dedup table; skip on duplicateOutbox / event sourcing consumers
# Stripe-style idempotency key, server-side (Postgres)
# Reference: https://stripe.com/blog/idempotency

CREATE TABLE idempotency (
    key TEXT PRIMARY KEY,
    request_hash TEXT NOT NULL,     -- detect key reuse with different bodies
    response_code INT,
    response_body JSONB,
    status TEXT,                    -- 'in_progress' | 'completed'
    locked_at TIMESTAMPTZ,
    created_at TIMESTAMPTZ DEFAULT now()
);

def handle(key, req):
    h = hash(req.body)
    row = db.fetch_for_update("SELECT * FROM idempotency WHERE key=%s", key)

    if row and row.request_hash != h:
        raise Conflict("idempotency key reused with different body")

    if row and row.status == 'completed':
        return row.response_body                # return cached response

    if row and row.status == 'in_progress':
        if now() - row.locked_at > LEASE:
            pass                                # lease expired, retry ok
        else:
            raise Conflict("request in flight")

    # claim
    db.upsert("idempotency", key=key, request_hash=h,
              status='in_progress', locked_at=now())

    try:
        resp = do_business(req)
        db.update("idempotency", key=key,
                  status='completed', response_code=200,
                  response_body=resp)
        return resp
    except RetryableError:
        # don't update; let client retry
        raise
    except FatalError as e:
        db.update("idempotency", key=key,
                  status='completed', response_code=500,
                  response_body={"error": str(e)})
        raise

# Expiry policy: Stripe keeps keys for 24h, then prunes. Clients should
# not assume keys live forever.
Real cases:

Extension & Optimization

Common Pitfalls & Interview Questions

1. Treating Saga as ACID — forgetting isolation is gone. During a saga, an order may sit in a "created-but-not-paid" intermediate state visible to readers, who may make decisions based on incorrect state. Either use semantic locks to mark pending, or have the business accept "visible intermediate states" with UX mitigations.
2. Non-idempotent / failure-prone compensations. A failed compensation is worse than a failed forward step — compensation exists to save you. Compensations must be idempotent, and failures must escalate to DLQ / human action — never an infinite retry loop.
3. Treating "send Kafka + write DB" as atomic. The most common junior trap. The interviewer expects you to immediately name "dual write problem" and "Outbox."
4. Idempotency key as timestamp / auto-increment. Client retries must use the same key. If the client generates a new timestamp on each retry, dedup is broken — the key must be fixed at request generation time and reused on retries.
5. Using 2PC across external APIs (Stripe / email / WeChat Pay). External APIs don't support XA. Such scenarios require the trio: saga + compensation + idempotency.

Interview follow-ups:

  1. Walk through 2PC blocking: coordinator dies after phase-1 ACK — what do participants do? How does 3PC improve and why is it still rare?
  2. How does Saga differ from ACID in isolation? Give a bug caused by Saga's visible intermediate state.
  3. Outbox vs Kafka transactions — why is Outbox preferred in production?
  4. Design "cross-service transfer": service A debits 100, service B credits 100. Give the saga + compensation flow and failure-mode analysis.
  5. What's the expiry policy for idempotency keys? What happens if a client reuses the same key after expiry? How do you avoid double charges?
  6. "Should microservices avoid distributed transactions?" — how do you answer? Give an example of "converting a distributed transaction into a domain redesign."

Deep Resources

Deep Reflection

1. Cross-service transfer: A debits 100, B credits 100. Design saga + failure-mode analysis. If B credits succeed but the saga orchestrator itself crashes, what happens?

The classic saga interview question. The core issue is that the orchestrator's own failure must be recoverable — otherwise you've just moved the single point from 2PC's coordinator to your saga orchestrator, without solving anything.

Forward flow:

  1. T1: A.debit(100, idempKey=tx-42)
  2. T2: B.credit(100, idempKey=tx-42)
  3. completed

Compensation: T2 fails → C1: A.credit(100, idempKey=tx-42-comp) (not a refund, a reverse credit).

Critical failure modes:

  • Orch crashes after T1 completes: on recovery, replay from event log; see T1 done, resume T2. The prerequisite is that orch state is persisted at every step (Temporal's event-sourcing model).
  • Orch crashes after sending T2, not knowing whether B succeeded: on recovery, retry T2; B's idempotency key dedupes — if last time succeeded, return same result; if last time the request didn't arrive, process normally. This is exactly what idempotency keys are for.
  • T2 fails forever (B permanently unreachable): compensate with C1. But beware "compensation also fails" — what if A is down too? Industry practice: a failed compensation goes to a DLQ with alerting, and an on-call engineer intervenes; do not let sagas retry forever. Temporal has a default max-attempts policy.
  • T1 succeeds but outbox not written: impossible. If you use outbox, T1 = (UPDATE accounts; INSERT outbox) is one atomic transaction.

Deeper reflection: this example is often used to "prove" saga is weak. But in real banking systems, transfers don't rely on distributed transactions at all — they use double-entry bookkeeping: the ledger isn't "modify A, modify B," but "record a debit, record a credit"; the sum of both must be zero. Any imbalance is caught by end-of-day reconciliation. This is centuries-old financial engineering wisdom, long predating distributed databases. Lesson: many "distributed-transaction needs" disappear when you change the data model.

2. Saga drops isolation — give a real example where reading an intermediate state causes a bug. How do you solve it?

Scenario: an e-commerce saga has steps "deduct inventory → deduct balance → ship." User A triggers a saga; inventory deduction succeeds (stock -1), balance deduction not yet started. Another query endpoint is hit and reads stock as -1; the recommender concludes "low stock" and sends user B a "limited-time offer" push. Then A's saga fails balance deduction and compensates (stock +1).

Result: user B got a marketing push based on an inventory reduction that never actually happened, and may immediately order only to find stock is fine — business logic disordered. This is the isolation violation caused by saga's "intermediate state leak." Under ACID, uncommitted transactions are invisible to others, so this cannot happen.

Solutions, from cheap to expensive:

  1. Semantic Lock (Chris Richardson): add a pending column on inventory. During the saga, mark pending=true; other readers know "this is a saga in-flight state" and decide accordingly (the recommender can ignore pending rows).
  2. Commutative operations: change "deduct inventory" to "reserve +1, ship -1," making order irrelevant. Compensation is "reserve -1," leaving no ghost state.
  3. Reorder steps: push irreversible / high-side-effect steps to the end. Reversible steps (DB writes) first; non-compensable steps (sending emails, calling external APIs) last.
  4. Versioning: each saga write carries saga_id + version; readers decide visibility based on saga state (MVCC-style).

Underlying principle: Saga trades "eventual consistency + visible intermediate state" for "cross-service feasibility + high throughput." If the business cannot tolerate visible intermediate state, that data should not cross services — it should stay in one service with local transactions. Back to "draw the right domain boundary."

3. Outbox introduces "second-level lag from DB to Kafka." If a business needs "the order is searchable immediately after writing," what do you do? Give 3 approaches.

Real scenario: after placing an order, the user must immediately see it in "My Orders." If outbox + Debezium has 2s lag, the user wonders "where's my order?" This is the conflict between synchronous requirements and async outbox.

Approach 1: Dual-read (recommended, cheap)

  • "My Orders" reads the search index, plus an extra query against the primary DB for the last N minutes of orders, merged in the UI.
  • Client / API gateway aggregates both sources.
  • Cost: one more DB query, but limited to current user's recent orders (highly indexed, < 5ms).
  • UX: perfect read-your-writes; downstream search system has no "must be instant" pressure.

Approach 2: Dual-write + tolerate inconsistency (for "occasionally lose a bit" cases)

  • Business code writes DB and ES at the same time. Accept some inconsistency, fix via batch reconciliation.
  • Cost: the dual-write problem returns — but if the business tolerates "search occasionally misses one row," acceptable.
  • Never for financial scenarios.

Approach 3: Synchronous outbox (heavy, edge cases only)

  • After commit, synchronously call Kafka producer and wait for ACK before returning to the client. Outbox is the backup.
  • Cost: write latency rises from 5ms to 30-100ms (Kafka cross-broker sync); Kafka failures stall the write path.
  • Use when "writes must be readable within seconds" is mandatory (financial reconciliation, regulatory reports).

Deeper reflection: "instant consistency" is almost always over-engineering. Google's and Amazon's search both have index lag. Ask the business: "really, not even 1 second?" — 80% of the time the answer is "5 seconds is fine." For the remaining 20%, Approach 1 (dual-read) usually suffices. Treating "instant consistency" as default is a hidden tax on performance and availability.

4. Cross-chapter: Day 6 (Consistency) says "pick consistency per data type"; today says "avoid distributed transactions." Combine both for the same e-commerce system: draw a "consistency + transaction strategy" map.

This is the layered-decision skill required of architects. Each subdomain in one system should pick both consistency and transaction strategy independently.

SubdomainConsistency (Day 6)Transaction strategy (Day 7)Implementation
Inventory deductionLinearizableSingle-DB local txn + row lock / CASPostgres SELECT FOR UPDATE or Spanner
Checkout (across 5 services)Eventual + compensationSaga (Temporal orchestration)Per-step local txn + reverse compensation + idempotency
Payments (Stripe)External, at-most-once chargeIdempotency key + retry + reconciliationStripe key = saga_id + end-of-day reconciliation
Order status fanout (search, recs, msgs)EventualOutbox + Kafka + at-least-once consumersDebezium → Kafka → multiple consumers
"My Orders" displayRead-your-writesPrimary DB + secondary-index dual readSession token carries LSN + dual read
User analytics / telemetryEventual + lossy okFire-and-forget + batchLocal buffer → async Kafka push

Overall architecture:

  • "Hard correctness" data (inventory, balance, order state) → local ACID inside one service.
  • "Cross-service flow" → Saga + Outbox + Idempotency trio.
  • "Downstream fanout" → Eventual via Kafka.
  • "Display-layer consistency" → client tokens + dual read.

Anti-patterns:

  1. System-wide strong consistency ("to be safe"): Spanner everywhere; checkout P99 spikes; conversion drops.
  2. System-wide eventual ("to keep it simple"): oversells, double charges, support inferno.
  3. Distributed transactions everywhere: tight coupling, fault propagation; one service down brings everything down.

The architect's core skill: be able to draw this table in 5 minutes for a new business, with every row backed by numbers (QPS, SLO, cost of violation). That's the real boundary between "senior" and "architect."

5. Counterintuitive: "the best distributed transaction is none at all" — give a real example of "killing a distributed transaction via domain redesign."

This is the essence of Pat Helland's paper and the mark of microservice design maturity. When you find 90% of flows cross the same two services and saga compensations grow ever more complex, your boundaries are wrong.

Real case 1: e-commerce "order + inventory" used to be two services

  • Initial: order service, inventory service split. Every order requires a saga.
  • Pain: 90% of requests cross both services; saga compensation accounts for 40% of code; failure rate climbs; ops cost soars.
  • Refactor: merge into "order fulfillment service" (one DB, one team); order and inventory in the same local transaction — SELECT FOR UPDATE on the inventory row + INSERT order, done in 5ms. The saga disappears entirely.
  • Lesson: "more microservices = better" is a myth. Boundaries should follow "independent evolution + independent scaling + independent failure," not code module boundaries.

Real case 2: banking transfers replace distributed transactions with double-entry bookkeeping

  • Common-wrong design: account-A service, account-B service; transfers require distributed transactions.
  • Better design: all accounts in one "ledger service"; the ledger table is INSERT-only; every transaction writes two rows (debit / credit); sum == 0 means correct.
  • Reading balance = SUM(credits) - SUM(debits) (with snapshots for speed).
  • "Cross-account transfer" reduces to two INSERTs in one service's local transaction — distributed transaction eliminated. This is the core design of Stripe Ledger, Square, and real-world ledger systems.

Real case 3: event sourcing collapses multi-service state into a single source of truth

  • Each service has its own state → cross-service consistency issues sprout endlessly.
  • Switch to event sourcing: every service writes events to a central event store (Kafka / EventStoreDB); each service materializes its own view from the stream.
  • "Cross-service transaction" becomes "append an event" — a local op. View inconsistency is handled by consumers.
  • Examples: parts of Netflix, parts of LinkedIn, most modern fintech.

Philosophical takeaway: complexity in distributed systems never disappears, only moves. Saga moves "coordinator complexity" to "compensation complexity"; double-entry moves "cross-service transaction" to "data-model design"; event sourcing moves "cross-service consistency" to "event schema evolution."

Mark of a mature architect: identify where complexity hides and put it where it is easiest to understand and evolve. Not eliminate complexity, but place it well. This principle applies to all design decisions — fundamentally the same thinking as "push inconsistency to the UX layer, keep consistency at the DB layer."