Day 39 Hard Workflow Engine Durable Execution Saga Compensation

Workflow Engine — Long-Running Flows That Survive CrashesDurable Execution, Saga Orchestration, Compensating Transactions

Problem & Constraints

Design an order-fulfillment orchestration engine: 1M orders/day, each stepping through "charge → reserve inventory → create shipment → wait for carrier pickup (hours to days) → notify". This is a long-running, multi-service, all-or-compensate flow. The hard part isn't any single step — it's resuming precisely after a crash mid-flow: the worker gets OOM-killed, the payment callback is 2 days late, inventory times out. You must neither double-charge nor leave the order stuck forever.

Hand-rolling this ends badly: a DB status column driven by a cron poller quickly degenerates into hundreds of if state==... branches with retry and compensation logic scattered everywhere that nobody dares touch. A workflow engine pushes state persistence, retries, timeouts, and compensation down into infrastructure.

High-Level Architecture

graph TD
    C["Client / API
StartWorkflow(order)"] FE["Frontend service
gRPC ingress · auth"] HIST[("History store
Event Sourcing
Cassandra / MySQL
")] MQ["Task Queue
Matching service"] W["Worker fleet
YOUR Workflow + Activity code"] EXT["External systems
payment / inventory / shipping"] C --> FE FE -->|① append event| HIST FE -->|② dispatch task| MQ MQ -->|③ long-poll| W W -->|④ replay history to recover| HIST W -->|⑤ invoke side effects| EXT W -->|⑥ result → new event| FE classDef client fill:#1a2530,stroke:#64c8ff,color:#e8eef5 classDef core fill:#1a1a30,stroke:#ffb450,color:#e8eef5 classDef store fill:#2a1530,stroke:#ff7ab6,color:#e8eef5 class C,W client class FE,MQ core class HIST,EXT store

The engine (Frontend + History + Matching) only persists and schedules — it runs no business code; your logic runs in your own Workers, which recover by replaying History after a crash

Key Technical Points

1. Durable Execution — Hiding the "State Machine" Inside Plain Code

Principle: the core idea is event sourcing + deterministic replay. Every advance of a workflow (an Activity scheduled/completed, a Timer fired, a signal received) is appended in order to the History. When a worker crashes and restarts, the engine replays that History into your workflow code: at any step whose result is already recorded, it returns the cached value (without re-running the side effect); only the step not yet in History actually executes. So a flow that "runs for days across crashes" reads, in code, like a plain synchronous function.

Trade-off:
# Order-fulfillment workflow — looks synchronous, actually runs
# for days across process crashes
@workflow
def fulfill_order(order):
    pay = call(charge_payment, order)        # ① Activity = the real side effect
    if not pay.ok: return "payment_failed"
    call(reserve_inventory, order)           # ② each result appended to History
    ship = call(create_shipment, order)
    workflow.sleep(days=2)                    # ③ durable timer, holds no thread
    call(send_notification, order, ship)
    return "done"
# After a crash the worker replays History: completed Activities return
# their cached result (no double charge); only the unrecorded step runs.
Real-world:

2. Orchestration vs Choreography

Principle: a multi-step cross-service transaction can be coordinated two ways. Orchestration: a central coordinator (the workflow) actively calls each participant in turn — "you charge, you reserve, you ship" — logic is centralized, visible, easy to compensate. Choreography: no central brain; each service emits an event when done, and the next service subscribes and picks up the baton (order-paid → inventory service listens → emits inventory-reserved → shipping listens).

Trade-off:

Rule of thumb: ≤ 3 steps and loosely coupled → choreography; many steps needing compensation and visibility → orchestration. A workflow engine is essentially orchestration turned into infrastructure.

Real-world:

3. Compensating Transactions — No Rollback, Only "Semantic Undo"

Principle: distributed systems have no cross-service ACID rollback. A Saga splits a long transaction into local transactions T1…Tn, each Ti paired with a compensating transaction Ci: if a step fails, run the compensations of completed steps in reverse order (Cn-1…C1). Compensation isn't a physical rollback to the prior byte state — it's a business-semantic cancellation: charged → refund, reserved → release, emailed → send a correction. This is exactly the model laid down by Garcia-Molina & Salem's 1987 Sagas paper.

Trade-off (backward vs forward recovery):
def saga(order):
    done = []                              # compensation stack of completed steps
    try:
        charge_payment(order);   done.append(refund_payment)
        reserve_inventory(order);done.append(release_inventory)
        create_shipment(order);  done.append(cancel_shipment)
    except StepFailed:
        for compensate in reversed(done):  # reverse-order compensation (LIFO)
            compensate(order)              # ⚠️ must be idempotent + infinitely retriable
        raise
Pitfall: compensation itself can fail. It must be idempotent and infinitely retriable, and must not trigger new side effects that themselves need compensating, or the compensation chain diverges forever. Truly-impossible compensations go to a "manual intervention" queue.
Real-world:

4. Activity Execution Guarantees — at-least-once + idempotency + timeout/heartbeat

Principle: the workflow itself is "logically exactly-once" via replay, but the Activities that actually touch external systems can only guarantee at-least-once: the engine dispatches a task to a worker; if the worker finishes but crashes before reporting the result, the engine never gets the ACK and, after a timeout, redispatches — so the same charge may run twice. The only fix is to make Activities idempotent: use a business-unique key (order id) as the idempotency key, and have downstream dedupe on it. Long tasks must also heartbeat periodically so the engine can tell "slow" from "dead" and not falsely time out.

Trade-off: delivery semantics — at-most-once (no retry, may drop, unacceptable) vs at-least-once (retries, paired with idempotency, the industry default). So-called exactly-once doesn't mean "executed only once" — it means executed repeatedly but taking effect once (idempotency-key dedup).
// Activity runs at least once → the business side MUST be idempotent
func ChargePayment(ctx, order) error {
    key := "charge:" + order.ID          // idempotency key = business uniqueness
    if seen(key) { return nil }          // a retry arrives → short-circuit
    heartbeat(ctx)                       // periodic heartbeat on long tasks
    charge(order.Amount, idemKey=key)    // pass the idem key to the payment gateway
    return nil
}
Real-world:

Scaling & Optimization

Common Pitfalls + Interview Questions

1. Calling time.now() / rand() / HTTP directly inside a workflow. Breaks deterministic replay: on replay the value differs, history mismatches, and it errors out. All non-deterministic operations must live inside Activities or use the engine's deterministic APIs.
2. Assuming the engine gives exactly-once so you can skip idempotency. Activities are at-least-once; side effects must dedupe themselves or a retry becomes a double charge.
3. Placing irreversible steps too early. Once an irreversible action runs (money to a third party), a later failure can't roll back the whole order — only forward retry remains; order it as late as possible.
4. Common ask: why do microservices avoid distributed 2PC? (blocking, coordinator single point, long-held locks, participant-failure amplification.) What does Saga trade? (gives up isolation I for availability — intermediate state is externally visible, business must tolerate it.)
5. Common ask: how does a workflow engine guarantee no state loss and no duplicated side effects after a crash? (event-sourced persistence + deterministic replay + Activity idempotency keys.)

Further Resources

Going Deeper (click to expand)

1. A workflow engine claims precise recovery after a crash — on what basis? What if an Activity "finished but the worker crashed before reporting the result"?

On event sourcing + replay: the workflow's progress is appended as an immutable event stream persisted in History. After a crash, another machine replays History from the start, returning cached results for steps already recorded and only truly executing steps absent from history — so the workflow logic is "exactly-once".

But that awkward window is real: the Activity already charged the card, the result isn't yet in History, and the worker crashes. The engine gets no completion event and, after timeout, redispatches the Activity → a second charge. The engine can't eliminate this; only Activity idempotency can: key on the order id, have the payment gateway honor that key and return the first result on the second call. That's why "the engine gives you durability, but idempotency is on you".

2. Why must workflow code be "deterministic"? What concrete failure does a single if random() > 0.5 cause?

Recovery is by replay: feeding the same History into the workflow code must produce the exact same command sequence (which Activity to schedule, which Timer to set) to align line-by-line with history.

if random()>0.5 took branch A the first time (scheduled Activity A, written to history); on replay random() rolls again, this time taking branch B and wanting to schedule Activity B. The engine sees "code wants to schedule B, but history record N is A" — command doesn't match history, so it throws a non-determinism error and the workflow wedges. Hence randomness, current time, external reads, concurrent-map iteration order… are all mines and must move into Activities (whose results are persisted and not recomputed on replay).

3. If I'm already using a workflow engine, why do I still need Saga / compensation? Are they alternatives or complements?

Complements, not alternatives. The engine solves "how to reliably persist and recover flow state" (durable execution); Saga solves "how to semantically roll back a multi-step business failure" (compensation). The engine gives you a reliable place to orchestrate a Saga — call steps in order inside the workflow, catch failures with try/except, then call compensating Activities in reverse.

You can write a Saga without an engine (choreography, via events), but then compensation's state, retries, and timeouts must be persisted by hand — back into the hand-rolled state-machine swamp. So "a Saga workflow running inside Temporal" is today's most common combo: the engine provides durable + retry, the Saga provides business-rollback semantics.

4. There are 5M in-flight, not-yet-finished workflows, and you need to change the workflow logic. What breaks? How do you ship safely?

Disaster scenario: old workflows wrote History along the old code path; after deploying new code, those in-flight workflows replay along the new path, generating commands that don't match the old history → mass non-determinism errors and collective wedging. This is the nastiest ops trap of long-flow engines (flows live for days while code ships daily).

Safe approach: don't change branches directly — version for compatibility. Use the engine's getVersion()/patch API: v=getVersion("addStep",1,2); if v==1: old logic else: new logic. Old workflows on replay get the recorded old version and take the old path; newly started ones take the new path — both coexist in one codebase. Once all old workflows drain, remove the old branch. Or simply spin up a new task queue / new workflow type and let old flows finish naturally on old workers.

5. What if a compensating transaction fails? "Charge succeeded, inventory reserved, shipment failed, need refund but the refund API is also down" — design this backstop.

Compensation isn't a one-shot best-effort; it's itself a reliable Activity: configure infinite retry + exponential backoff, since the refund API is most likely just temporarily unavailable. The engine durably keeps this compensating Activity retrying and doesn't lose it across crashes — precisely the biggest win of running a Saga on an engine versus hand-rolling.

But guard two things: ① compensation must be idempotent (refund API takes an idempotency key), or retries could refund several times; ② set a retry ceiling / deadline, after which it stops auto-retrying and drops the order into a "manual intervention" queue with an alert for human reconciliation. Never let a compensation retry infinitely and quietly burn resources, and never fake success and move on — better to stall for a human than silently lose money. In production you'd also run an independent reconciliation job: daily scan for "charged but ultimately unfulfilled and unrefunded" orders and force the refund.