Day 39 Hard Workflow Engine Durable Execution Saga Compensation

Workflow Engine — Long-Running Flows That Survive CrashesDurable Execution, Saga Orchestration, Compensating Transactions

Problem & Constraints

Design an order-fulfillment orchestration engine: 1M orders/day, each stepping through "charge → reserve inventory → create shipment → wait for carrier pickup (hours to days) → notify". This is a long-running, multi-service, all-or-compensate flow. The hard part isn't any single step — it's resuming precisely after a crash mid-flow: the worker gets OOM-killed, the payment callback is 2 days late, inventory times out. You must neither double-charge nor leave the order stuck forever.

Hand-rolling this ends badly: a DB status column driven by a cron poller quickly degenerates into hundreds of if state==... branches with retry and compensation logic scattered everywhere that nobody dares touch. A workflow engine pushes state persistence, retries, timeouts, and compensation down into infrastructure.

Scale: ~50 workflow starts/s at peak; ~5M workflows in flight at once (including those sleeping/waiting); a single flow can live for days.
Reliability SLO: on process crash / deploy / host loss, workflow state is never lost; it resumes from the exact point, with no duplicated side effects (no double charge, no double shipment).
Correctness: either the whole order fulfills, or completed steps are semantically compensated (refund, release inventory) — no distributed 2PC available.
Observability: for any order, see "which step it's stuck at, why, how many retries".

High-Level Architecture

graph TD
    C["Client / API
StartWorkflow(order)"]
    FE["Frontend service
gRPC ingress · auth"]
    HIST[("History store
Event Sourcing
Cassandra / MySQL")]
    MQ["Task Queue
Matching service"]
    W["Worker fleet
YOUR Workflow + Activity code"]
    EXT["External systems
payment / inventory / shipping"]

    C --> FE
    FE -->|① append event| HIST
    FE -->|② dispatch task| MQ
    MQ -->|③ long-poll| W
    W -->|④ replay history to recover| HIST
    W -->|⑤ invoke side effects| EXT
    W -->|⑥ result → new event| FE

    classDef client fill:#1a2530,stroke:#64c8ff,color:#e8eef5
    classDef core fill:#1a1a30,stroke:#ffb450,color:#e8eef5
    classDef store fill:#2a1530,stroke:#ff7ab6,color:#e8eef5
    class C,W client
    class FE,MQ core
    class HIST,EXT store

The engine (Frontend + History + Matching) only persists and schedules — it runs no business code; your logic runs in your own Workers, which recover by replaying History after a crash

Key Technical Points

1. Durable Execution — Hiding the "State Machine" Inside Plain Code

Principle: the core idea is event sourcing + deterministic replay. Every advance of a workflow (an Activity scheduled/completed, a Timer fired, a signal received) is appended in order to the History. When a worker crashes and restarts, the engine replays that History into your workflow code: at any step whose result is already recorded, it returns the cached value (without re-running the side effect); only the step not yet in History actually executes. So a flow that "runs for days across crashes" reads, in code, like a plain synchronous function.

Trade-off:

Hand-rolled state machine + DB column + cron: ✅ no new dependency; ❌ reinvent retry/timeout/recovery per flow, state explosion, concurrency bugs nearly guaranteed.
Durable-execution engine (Temporal/Cadence): ✅ code-is-the-flow, auto-recovery, built-in retry/timeout; ❌ workflow code must be deterministic (no direct now()/rand()/global reads), plus a learning curve and engine ops cost.
Managed Step Functions (AWS): ✅ fully managed, no ops; ❌ state machine expressed as JSON/DSL, weak for complex logic, vendor lock-in.

# Order-fulfillment workflow — looks synchronous, actually runs
# for days across process crashes
@workflow
def fulfill_order(order):
    pay = call(charge_payment, order)        # ① Activity = the real side effect
    if not pay.ok: return "payment_failed"
    call(reserve_inventory, order)           # ② each result appended to History
    ship = call(create_shipment, order)
    workflow.sleep(days=2)                    # ③ durable timer, holds no thread
    call(send_notification, order, ship)
    return "done"
# After a crash the worker replays History: completed Activities return
# their cached result (no double charge); only the unrecorded step runs.

Real-world:

Uber Cadence: built at Uber to orchestrate long cross-service flows (driver-incentive settlement, refunds), later open-sourced; makes "fault-tolerant stateful code" the programming model.
Temporal: founded by the Cadence originals; the flagship of durable execution — its blog "Beyond State Machines" argues why it replaces hand-rolled state machines.
Netflix Conductor: Netflix's open-source orchestrator; flows defined as JSON DSL + a web UI, used to orchestrate media-processing pipelines.
AWS Step Functions: managed state machines, used heavily inside Amazon for order/supply-chain long flows.

2. Orchestration vs Choreography

Principle: a multi-step cross-service transaction can be coordinated two ways. Orchestration: a central coordinator (the workflow) actively calls each participant in turn — "you charge, you reserve, you ship" — logic is centralized, visible, easy to compensate. Choreography: no central brain; each service emits an event when done, and the next service subscribes and picks up the baton (order-paid → inventory service listens → emits inventory-reserved → shipping listens).

Trade-off:

Orchestration: ✅ the flow is obvious at a glance, compensation order is clear, easy to add timeout/retry; ❌ the coordinator becomes a coupling hub, risks turning into a "god service".
Choreography: ✅ decoupled services, naturally async, no single point; ❌ the global flow is "owned by no one" — scattered across event subscriptions, painful to debug or change, prone to cyclic dependencies and invisible loops.

Rule of thumb: ≤ 3 steps and loosely coupled → choreography; many steps needing compensation and visibility → orchestration. A workflow engine is essentially orchestration turned into infrastructure.

Real-world:

Chris Richardson / microservices.io: the authoritative side-by-side of the two Saga implementations (orchestration / choreography), with the Eventuate Tram reference.
Netflix Conductor: explicitly chose orchestration — a central engine holds the flow definition, avoiding choreography's "flow logic scattered everywhere".

3. Compensating Transactions — No Rollback, Only "Semantic Undo"

Principle: distributed systems have no cross-service ACID rollback. A Saga splits a long transaction into local transactions T1…Tn, each Ti paired with a compensating transaction Ci: if a step fails, run the compensations of completed steps in reverse order (Cn-1…C1). Compensation isn't a physical rollback to the prior byte state — it's a business-semantic cancellation: charged → refund, reserved → release, emailed → send a correction. This is exactly the model laid down by Garcia-Molina & Salem's 1987 Sagas paper.

Trade-off (backward vs forward recovery):

Backward recovery (compensating rollback): on failure, undo what's done, back to "nothing happened". Fits cancelable flows like ordering.
Forward recovery (retry to success): some steps are irreversible (money already sent to a third party, email already sent) — you can only keep retrying until success. So the design rule is to push irreversible steps as late as possible; once you cross one, only forward recovery remains.

def saga(order):
    done = []                              # compensation stack of completed steps
    try:
        charge_payment(order);   done.append(refund_payment)
        reserve_inventory(order);done.append(release_inventory)
        create_shipment(order);  done.append(cancel_shipment)
    except StepFailed:
        for compensate in reversed(done):  # reverse-order compensation (LIFO)
            compensate(order)              # ⚠️ must be idempotent + infinitely retriable
        raise

Pitfall: compensation itself can fail. It must be idempotent and infinitely retriable, and must not trigger new side effects that themselves need compensating, or the compensation chain diverges forever. Truly-impossible compensations go to a "manual intervention" queue.

Real-world:

Sagas paper (SIGMOD 1987): the original definition of Ci semantically canceling Ti — the theoretical source of the modern microservices Saga.
Payment/e-commerce systems: refund on failed order, release stock on oversell — classic backward compensation; for money already sent to a third party, switch to forward retry + reconciliation as a backstop.

4. Activity Execution Guarantees — at-least-once + idempotency + timeout/heartbeat

Principle: the workflow itself is "logically exactly-once" via replay, but the Activities that actually touch external systems can only guarantee at-least-once: the engine dispatches a task to a worker; if the worker finishes but crashes before reporting the result, the engine never gets the ACK and, after a timeout, redispatches — so the same charge may run twice. The only fix is to make Activities idempotent: use a business-unique key (order id) as the idempotency key, and have downstream dedupe on it. Long tasks must also heartbeat periodically so the engine can tell "slow" from "dead" and not falsely time out.

Trade-off: delivery semantics — at-most-once (no retry, may drop, unacceptable) vs at-least-once (retries, paired with idempotency, the industry default). So-called exactly-once doesn't mean "executed only once" — it means executed repeatedly but taking effect once (idempotency-key dedup).

// Activity runs at least once → the business side MUST be idempotent
func ChargePayment(ctx, order) error {
    key := "charge:" + order.ID          // idempotency key = business uniqueness
    if seen(key) { return nil }          // a retry arrives → short-circuit
    heartbeat(ctx)                       // periodic heartbeat on long tasks
    charge(order.Amount, idemKey=key)    // pass the idem key to the payment gateway
    return nil
}

Real-world:

Temporal / Cadence Task Queue: workers long-poll for tasks; an Activity that times out or stops heartbeating is retried — the docs require Activities to be idempotent.
Stripe: the payments API uses an Idempotency-Key header; retrying the same key charges once — the canonical backstop for upstream workflow retries.

Scaling & Optimization

History-store sharding: as workflow counts explode, consistent-hash by workflow_id across Cassandra shards; don't let a single workflow's History grow too long (tens of thousands of events slow replay) — truncate with Continue-As-New, starting a fresh execution.
Versioning problem: change the workflow code and in-flight old workflows replay through mismatched new/old code paths → non-determinism errors. Use getVersion() / patch branches to stay compatible with old histories — the hardest operational aspect of long-flow engines.
Worker scaling: put workflow tasks and activity tasks on separate queues, scaled independently; isolate CPU-heavy activities so they don't starve lightweight workflow tasks of threads.
Multi-region: replicate History cross-region for DR (Cadence/Temporal global namespace / XDC); after failover, workflows continue from the replicated history.

Common Pitfalls + Interview Questions

1. Calling time.now() / rand() / HTTP directly inside a workflow. Breaks deterministic replay: on replay the value differs, history mismatches, and it errors out. All non-deterministic operations must live inside Activities or use the engine's deterministic APIs.

2. Assuming the engine gives exactly-once so you can skip idempotency. Activities are at-least-once; side effects must dedupe themselves or a retry becomes a double charge.

3. Placing irreversible steps too early. Once an irreversible action runs (money to a third party), a later failure can't roll back the whole order — only forward retry remains; order it as late as possible.

4. Common ask: why do microservices avoid distributed 2PC? (blocking, coordinator single point, long-held locks, participant-failure amplification.) What does Saga trade? (gives up isolation I for availability — intermediate state is externally visible, business must tolerate it.)

5. Common ask: how does a workflow engine guarantee no state loss and no duplicated side effects after a crash? (event-sourced persistence + deterministic replay + Activity idempotency keys.)

Further Resources

Designing Data-Intensive Applications, Ch. 9 + Ch. 12 (Kleppmann): the essence of distributed transactions, consensus, and "derived data / end-to-end idempotency".
Sagas (Garcia-Molina & Salem, ACM SIGMOD 1987): the theoretical origin of long transactions and compensating transactions — the ancestor of the microservices Saga.
Temporal blog — "Beyond State Machines": first-hand argument for why durable execution replaces hand-rolled state machines.
Netflix TechBlog — "Netflix Conductor: A microservices orchestrator": industrial-grade design and trade-offs of an orchestration engine.
microservices.io — Saga pattern (Chris Richardson): authoritative orchestration-vs-choreography comparison and implementation guide.

Going Deeper (click to expand)

1. A workflow engine claims precise recovery after a crash — on what basis? What if an Activity "finished but the worker crashed before reporting the result"?

On event sourcing + replay: the workflow's progress is appended as an immutable event stream persisted in History. After a crash, another machine replays History from the start, returning cached results for steps already recorded and only truly executing steps absent from history — so the workflow logic is "exactly-once".

But that awkward window is real: the Activity already charged the card, the result isn't yet in History, and the worker crashes. The engine gets no completion event and, after timeout, redispatches the Activity → a second charge. The engine can't eliminate this; only Activity idempotency can: key on the order id, have the payment gateway honor that key and return the first result on the second call. That's why "the engine gives you durability, but idempotency is on you".

2. Why must workflow code be "deterministic"? What concrete failure does a single if random() > 0.5 cause?

Recovery is by replay: feeding the same History into the workflow code must produce the exact same command sequence (which Activity to schedule, which Timer to set) to align line-by-line with history.

if random()>0.5 took branch A the first time (scheduled Activity A, written to history); on replay random() rolls again, this time taking branch B and wanting to schedule Activity B. The engine sees "code wants to schedule B, but history record N is A" — command doesn't match history, so it throws a non-determinism error and the workflow wedges. Hence randomness, current time, external reads, concurrent-map iteration order… are all mines and must move into Activities (whose results are persisted and not recomputed on replay).

3. If I'm already using a workflow engine, why do I still need Saga / compensation? Are they alternatives or complements?

Complements, not alternatives. The engine solves "how to reliably persist and recover flow state" (durable execution); Saga solves "how to semantically roll back a multi-step business failure" (compensation). The engine gives you a reliable place to orchestrate a Saga — call steps in order inside the workflow, catch failures with try/except, then call compensating Activities in reverse.

You can write a Saga without an engine (choreography, via events), but then compensation's state, retries, and timeouts must be persisted by hand — back into the hand-rolled state-machine swamp. So "a Saga workflow running inside Temporal" is today's most common combo: the engine provides durable + retry, the Saga provides business-rollback semantics.

4. There are 5M in-flight, not-yet-finished workflows, and you need to change the workflow logic. What breaks? How do you ship safely?

Disaster scenario: old workflows wrote History along the old code path; after deploying new code, those in-flight workflows replay along the new path, generating commands that don't match the old history → mass non-determinism errors and collective wedging. This is the nastiest ops trap of long-flow engines (flows live for days while code ships daily).

Safe approach: don't change branches directly — version for compatibility. Use the engine's getVersion()/patch API: v=getVersion("addStep",1,2); if v==1: old logic else: new logic. Old workflows on replay get the recorded old version and take the old path; newly started ones take the new path — both coexist in one codebase. Once all old workflows drain, remove the old branch. Or simply spin up a new task queue / new workflow type and let old flows finish naturally on old workers.

5. What if a compensating transaction fails? "Charge succeeded, inventory reserved, shipment failed, need refund but the refund API is also down" — design this backstop.

Compensation isn't a one-shot best-effort; it's itself a reliable Activity: configure infinite retry + exponential backoff, since the refund API is most likely just temporarily unavailable. The engine durably keeps this compensating Activity retrying and doesn't lose it across crashes — precisely the biggest win of running a Saga on an engine versus hand-rolling.

But guard two things: ① compensation must be idempotent (refund API takes an idempotency key), or retries could refund several times; ② set a retry ceiling / deadline, after which it stops auto-retrying and drops the order into a "manual intervention" queue with an alert for human reconciliation. Never let a compensation retry infinitely and quietly burn resources, and never fake success and move on — better to stall for a human than silently lose money. In production you'd also run an independent reconciliation job: daily scan for "charged but ultimately unfulfilled and unrefunded" orders and force the refund.

← Back to index