In the demo your agent always succeeds; in production it spends most of its time handling failure.
// WHY THIS MATTERS
Anyone can write an agent's happy path: a while loop, a tool call, the model decides when to stop. But in production, 90% of the engineering lives outside the happy path — networks flake, rate limits hit, tools time out, the model emits half a JSON, step 38 crashes and the prior 37 are wasted, a retry sends the same email three times. None of this is fixable by prompting; it's the harness's resilience design. Seasoned distributed-systems engineers know this toolkit (retry / idempotency / saga / circuit breaker / checkpoint), but agents re-scramble it, because they add a non-deterministic decision layer: the model itself will "try once more," and the model itself will give up over a single error. Today covers the four things most likely to cause an incident: the two-layer structure of retry, idempotency and compensation for side-effecting tools, checkpoint/resume for long tasks, and timeout/circuit-breaker/partial-failure degradation.
// 01
The Two Layers of Retry: don't mix transport errors and semantic errors
Claim: an agent's retry must be layered — transport (deterministic, program retries) vs semantic (non-deterministic, fed back to the model). Mixing them is the #1 bug.
Background & Principle
There are two kinds of "failure" in an agent, handled in opposite ways.
Transport failure: HTTP 429 / 503, connection timeout, ECONNRESET — deterministic transient faults. The right move is a program-level retry with exponential backoff + jitter; the model should never even hear about it. Marc Brooker's classic AWS post shows that fixed backoff makes clients "synchronize" and hammer the same instant; full jitter (randomizing the whole backoff window) sharply cuts contention.
Semantic failure: the tool ran but the result is wrong — file not found, command errored, invalid argument. This is information the model needs to see and self-correct; feed the error back as a tool_result and let it try another way. Never retry it in program code — the same arguments fail the same way a hundred times.
The #1 bug is swapping the layers: treating a 429 as semantic and feeding it to the model (which won't back off and keeps hammering the dependency), or treating "file not found" as transport and retrying in code (infinite loop). The test is one sentence: is the error input-independent and merely transient? Yes → program backoff; No → feed back to the model.
A tool call failed — which way?
│
┌─────────┴──────────┐
▼ ▼
┌─────────────────┐ ┌──────────────────┐
│ transport/transient │ semantic/determ. │
│ 429·503·timeout │ │ bad arg·not found │
└────────┬────────┘ └────────┬─────────┘
▼ ▼
backoff + full jitter error → tool_result
fail-fast after N feed to model, retry there
(model never sees it) (no program retry)
Hands-on
import time, random, anthropic
TRANSIENT = (anthropic.RateLimitError, anthropic.APITimeoutError,
anthropic.InternalServerError)
def call_with_backoff(fn, max_retries=5):
for attempt in range(max_retries):
try:
return fn()
except TRANSIENT as e: # transport: deterministic transient → program backoffif attempt == max_retries - 1: raise# full jitter: randomize the whole window, avoid thundering herd
time.sleep(random.uniform(0, min(60, 2 ** attempt)))
# note: semantic errors (tool failures) are NOT caught here —# they go back as tool_result; the model decides what's next
Failure mode: catch-all except + retry. A permanent error (401 Unauthorized, invalid argument) gets backed off and retried 5 times, failing each time — you've only delayed surfacing the fault by 60 seconds and burned quota. Retry only makes sense for "transient and deterministic" faults; a permanent error must fail-fast.
Idempotency & Compensation: agents will repeat, so side-effecting tools must be idempotent
Claim: an agent's retry/resume safety depends 100% on tool idempotency, not on how elegant your retry logic is.
Background & Principle
Retry and resume share a hidden premise: the operation being retried can be run multiple times safely. Reads are naturally idempotent (reading a file twice is fine), but writes are not: create_order / send_email / charge_card run once more = the customer charged twice. Agents are more dangerous than traditional systems because "retry" has two sources: program-level backoff, and the model — when a tool_result is ambiguous (a timeout, did it succeed?) the model "calls it again just to be safe." That's exactly the riskiest moment, because the first call may already have succeeded.
Two engineering tools:
Idempotency key: every side-effecting tool call carries a client-generated unique key; the server dedupes on it — a second call with the same key returns the first result without re-executing. Standard practice in payment APIs like Stripe. The key must be derived from the task (deterministic), not random per call, or a retry's changed key defeats dedup.
Compensation (Saga): agents have no DB transaction to roll back with one call. If step 3 fails, the side effects of steps 1–2 (booked, charged) must be undone by explicit compensating actions (cancel, refund). Pair every side-effecting step with a compensating action, run them in reverse on failure.
Hands-on
# 1) idempotency key: makes retry safe. derive from task, not randomdef book_flight(args):
key = f"book:{args['trip_id']}:{args['flight_no']}"# deterministic keyreturn api.post("/bookings", json=args,
headers={"Idempotency-Key": key}) # server dedupes# 2) compensation: register an undo per side-effect; reverse on failure
saga = []
try:
r1 = book_flight(...); saga.append(lambda: cancel_flight(r1.id))
r2 = charge_card(...); saga.append(lambda: refund(r2.id))
r3 = book_hotel(...) # ← this step failsexcept Exception:
for undo in reversed(saga): # reverse: refund first, then cancel
undo()
raise
Failure mode: relying on "the model won't call it twice." When a tool_result is a network timeout (you genuinely don't know if the server ran it), the model often calls again — and the idempotency key is the only reliable defense. You cannot prompt the model "don't repeat"; that builds safety on the non-deterministic decision layer.
Checkpoint & Resume: agent state must live outside the process — an in-memory loop is a toy
Claim: long tasks must resume from a checkpoint, not restart from scratch — which requires externalized, persistent agent state.
Background & Principle
An agent runs 40 steps over half an hour and crashes at step 38 (OOM, deploy restart, a spot machine reclaimed). If state lives only in the in-memory messages array, the prior 37 steps are wasted — and their tokens and side effects are already spent. A production agent must persist state outside the process: checkpoint after each step, and on crash resume from the last checkpoint. This is the core of durable execution. Two ways to land it:
Library level: LangGraph's checkpointer snapshots state to SQLite/Postgres after each node and resumes by thread_id — "loads the last snapshot and resumes exactly where it left off."
Platform level: a durable execution engine like Temporal turns the whole agent workflow into something crash-recoverable — on crash it replays history and skips completed activities, "survive a crash and resume exactly where it left off."
The key is checkpoint granularity: too fine (every token) is costly, too coarse (only at the end) is useless. In practice, checkpoint per model call or per side-effecting step. And it must pair with idempotency (§2) — on resume, that last step may have run only halfway.
Hands-on
import json, os
def agent(task, ckpt_path):
if os.path.exists(ckpt_path): # resume: restore from checkpoint
s = json.load(open(ckpt_path)); msgs, step = s["msgs"], s["step"]
else:
msgs, step = [{"role":"user","content":task}], 0
while step < MAX_STEPS:
r = client.messages.create(model=MODEL, messages=msgs,
tools=SCHEMAS, max_tokens=4096)
msgs.append({"role":"assistant","content":r.content})
if r.stop_reason == "end_turn": break
results = run_tools(r.content) # run tools (must be idempotent!)
msgs.append({"role":"user","content":results}); step += 1
json.dump({"msgs":msgs,"step":step}, open(ckpt_path,"w")) # persist per step
Failure mode: you checkpoint state but the tool isn't idempotent — the crash lands between "tool executed" and "checkpoint written," so on resume the tool reruns and the side effect happens twice. Checkpoint and idempotency are a pair; neither works alone. Another trap: stuffing non-serializable objects (file handles, sockets) into state, which blow up on restore.
Claim: failing the whole task on any tool failure is an anti-pattern; a mature agent has an explicit degradation policy per failure class.
Background & Principle
Three related mechanisms:
Timeout: every tool call needs an upper bound, or one hung HTTP request stalls the whole agent forever. Timeouts should be layered: per-tool < per-step < a top-level task deadline.
Circuit breaker: when a downstream (an API, an MCP server) fails N times in a row, it "trips" — subsequent calls fail fast instead of waiting on timeouts, then "half-open" lets one probe through after a cooldown. This stops the agent from repeatedly hammering a dead dependency, saving tokens and avoiding cascading slowdowns. The classic three states: closed → open → half-open.
Partial failure: in a parallel fan-out (Day 13's orchestrator-worker), 5 workers run and 1 dies. The anti-pattern is failing the whole batch; the mature move is to collect the 4 successes and hand "1 failed" to the orchestrator as information — degrade and continue with 4, or treat the subtask as critical and stop.
Circuit Breaker three-state machine
┌────────┐ ≥N fails ┌────────┐
│ CLOSED │ ───────────▶ │ OPEN │
│ pass all │ │fail fast │
└────────┘ ◀─────────── └───┬────┘
▲ probe ok │ cooldown reset_after
│ ▼
│ ┌──────────────────────┐
└──── │ HALF-OPEN: one probe │
probe fail→OPEN └──────────────────────┘
Hands-on
class CircuitBreaker:
def __init__(self, fail_max=5, reset_after=30):
self.fail_max, self.reset_after = fail_max, reset_after
self.fails, self.opened_at = 0, Nonedef call(self, fn):
if self.opened_at: # OPENif time.time() - self.opened_at < self.reset_after:
raise CircuitOpen("dependency tripped, fail fast")
self.opened_at = None# → HALF-OPEN: let one probe throughtry:
out = fn(); self.fails = 0 # success → CLOSEDreturn out
except Exception:
self.fails += 1
if self.fails >= self.fail_max:
self.opened_at = time.time() # → OPEN: tripraise
Failure mode: (1) only per-tool timeouts, no task-level deadline — the agent never times out on any single step in a loop yet never finishes, until it burns the whole budget. You need a top-level deadline. (2) Hard-coding partial failure as "any worker fails = whole task fails," letting one flaky edge worker drag down an entire high-value task. The degradation policy should be decided by the orchestrator per subtask importance, not one-size-fits-all.
// PUTTING IT TOGETHER · Harden a toy agent into a recoverable one
Take any 100-line demo agent and harden it down this checklist — each item maps to a failure mode above:
Layered retry: wrap messages.create and network-type tools in call_with_backoff (full jitter); do not wrap semantic errors — feed them back to the model verbatim.
Idempotency keys on side-effecting tools: list every "write" tool, give each an Idempotency-Key derived from the task; if you can't change the server, keep your own "executed keys" table to dedupe.
Checkpoint per step: persist {msgs, step} each round (or use a LangGraph checkpointer); try to resume on startup. Test: kill -9 mid-run; restart should continue from the breakpoint, not from scratch.
Three-layer timeout + one breaker: per-tool timeout, a top-level task deadline, and a CircuitBreaker around the external dependency that flakes most.
Partial-failure degradation: if you fan out in parallel, turn gather into "collect successes + annotate failures," letting the upper layer decide continue vs stop.
Add one line of telemetry: count retries, breaker trips, and checkpoint resumes — without these metrics you simply can't see resilience problems when they happen.
When you're done the code goes from 100 to 300+ lines, and not one of the extra 200 lines is a "feature" — it's all resilience. That's the real gap between a demo agent and a production agent.
// DEEP THINKING
Hiding transport retries entirely from the model — could that give it a wrong picture of how fragile the environment is? When should you tell it?
Mostly hide it — the model shouldn't be responsible for backing off a 429, and exposing it pollutes reasoning. But two exceptions are worth surfacing: one, when backoff still ends in permanent failure (the dependency is truly down), so the model knows "this path is closed" and can switch tools or plans; two, when backoff causes significant latency and the model is making a time-sensitive decision. Principle: hide the transport fault's process, but the conclusion (success / permanent failure) must reach the model as a tool_result, or it keeps planning on a false premise.
Idempotency keys should be "derived from the task." But an agent's task is natural language with different wording each time — how do you derive a stable key?
Don't key on the natural-language task; key on the structured operation semantics. The key should come from the tool call's normalized parameters, not the upper instruction: charge:{user_id}:{order_id}:{amount}, built from deterministic fields, is identical no matter how the model phrases it or how many retries. If the operation lacks a natural unique identifier ("send an encouraging message"), have the harness generate a key when the agent enters that step and write it into the checkpoint — reuse it on resume. Essence: key stability comes from deterministic derivation in the harness, not from model output.
Checkpoint + idempotency let an agent resume after a crash. But if the crash was caused by a model-driven infinite loop, resuming just faithfully re-runs the loop. How do you break it?
This is durable execution's blind spot: it guarantees "faithful recovery," but faithfully recovering a bad state is worthless. Add loop detection on top of checkpoints: fingerprint the last K steps' (tool, normalized args); break or escalate to a human on repeats. Deeper, resume shouldn't blindly replay — inject a meta note at the recovery point: "you crashed/looped here, try a different strategy." Distinguishing two crash classes matters — infrastructure crashes (OOM/restart) should resume verbatim; logic crashes (loops/repeated failures) should resume with reflection.
A breaker protects the downstream, but has a side effect on the agent: once tripped, the tool "vanishes." How does the model react, and what does that imply for tool registry design?
The model gets confused — a tool it used last turn now throws CircuitOpen. If you just return a bare exception, it may retry repeatedly (hitting the breaker) or abandon the task. Better: translate the breaker state into model-usable semantics: "X is temporarily unavailable, expected back in ~30s; use Y or do other steps first." This points to a design principle: a tool's availability is dynamic state, and the harness should surface it as part of context proactively, not only as a failure on call. The breaker essentially promotes "infrastructure health" into a model-visible decision input.
This whole issue is problems distributed systems solved twenty years ago. Do agents really bring "new" resilience challenges, or just repackage old solutions?
Eighty percent is old solutions, but ten percent is genuinely new — and it's the hardest ten. The old part: retry/idempotency/saga/breaker/checkpoint, lifted wholesale. The new part comes from the non-deterministic decision layer: traditional control flow is coded and statically analyzable, even formally verifiable; an agent's control flow is generated by the model on the fly, so you can't know what it calls next. "Is retry safe" degrades from a provable property into a probability problem — the model may repeat an operation at the worst moment. That turns idempotency from "best practice" into a "lifeline." Another new problem: a traditional saga's steps are known, an agent's steps are emergent, so you must design compensation for side effects "not yet done but the model might do."