DAY 39 / PHASE 4 · PRODUCTION

Agent Error Recovery & Resilience

Retry · Idempotency · Checkpoint · Circuit Breaker

2026-06-18 · BigCat

In the demo your agent always succeeds; in production it spends most of its time handling failure.

// WHY THIS MATTERS

Anyone can write an agent's happy path: a while loop, a tool call, the model decides when to stop. But in production, 90% of the engineering lives outside the happy path — networks flake, rate limits hit, tools time out, the model emits half a JSON, step 38 crashes and the prior 37 are wasted, a retry sends the same email three times. None of this is fixable by prompting; it's the harness's resilience design. Seasoned distributed-systems engineers know this toolkit (retry / idempotency / saga / circuit breaker / checkpoint), but agents re-scramble it, because they add a non-deterministic decision layer: the model itself will "try once more," and the model itself will give up over a single error. Today covers the four things most likely to cause an incident: the two-layer structure of retry, idempotency and compensation for side-effecting tools, checkpoint/resume for long tasks, and timeout/circuit-breaker/partial-failure degradation.

// 01

The Two Layers of Retry: don't mix transport errors and semantic errors

Claim: an agent's retry must be layered — transport (deterministic, program retries) vs semantic (non-deterministic, fed back to the model). Mixing them is the #1 bug.

Background & Principle

There are two kinds of "failure" in an agent, handled in opposite ways.

The #1 bug is swapping the layers: treating a 429 as semantic and feeding it to the model (which won't back off and keeps hammering the dependency), or treating "file not found" as transport and retrying in code (infinite loop). The test is one sentence: is the error input-independent and merely transient? Yes → program backoff; No → feed back to the model.

A tool call failed — which way? │ ┌─────────┴──────────┐ ▼ ▼ ┌─────────────────┐ ┌──────────────────┐ │ transport/transient │ semantic/determ. │ │ 429·503·timeout │ │ bad arg·not found │ └────────┬────────┘ └────────┬─────────┘ ▼ ▼ backoff + full jitter error → tool_result fail-fast after N feed to model, retry there (model never sees it) (no program retry)

Hands-on

import time, random, anthropic

TRANSIENT = (anthropic.RateLimitError, anthropic.APITimeoutError,
             anthropic.InternalServerError)

def call_with_backoff(fn, max_retries=5):
    for attempt in range(max_retries):
        try:
            return fn()
        except TRANSIENT as e:        # transport: deterministic transient → program backoff
            if attempt == max_retries - 1: raise
            # full jitter: randomize the whole window, avoid thundering herd
            time.sleep(random.uniform(0, min(60, 2 ** attempt)))
        # note: semantic errors (tool failures) are NOT caught here —
        # they go back as tool_result; the model decides what's next
Failure mode: catch-all except + retry. A permanent error (401 Unauthorized, invalid argument) gets backed off and retried 5 times, failing each time — you've only delayed surfacing the fault by 60 seconds and burned quota. Retry only makes sense for "transient and deterministic" faults; a permanent error must fail-fast.
More · Marc Brooker / AWS Exponential Backoff And Jitter, aws.amazon.com/blogs/architecture · AWS Builders' Library Timeouts, retries, and backoff with jitter, aws.amazon.com/builders-library
// 02

Idempotency & Compensation: agents will repeat, so side-effecting tools must be idempotent

Claim: an agent's retry/resume safety depends 100% on tool idempotency, not on how elegant your retry logic is.

Background & Principle

Retry and resume share a hidden premise: the operation being retried can be run multiple times safely. Reads are naturally idempotent (reading a file twice is fine), but writes are not: create_order / send_email / charge_card run once more = the customer charged twice. Agents are more dangerous than traditional systems because "retry" has two sources: program-level backoff, and the model — when a tool_result is ambiguous (a timeout, did it succeed?) the model "calls it again just to be safe." That's exactly the riskiest moment, because the first call may already have succeeded.

Two engineering tools:

Hands-on

# 1) idempotency key: makes retry safe. derive from task, not random
def book_flight(args):
    key = f"book:{args['trip_id']}:{args['flight_no']}"   # deterministic key
    return api.post("/bookings", json=args,
                    headers={"Idempotency-Key": key})       # server dedupes

# 2) compensation: register an undo per side-effect; reverse on failure
saga = []
try:
    r1 = book_flight(...);  saga.append(lambda: cancel_flight(r1.id))
    r2 = charge_card(...);  saga.append(lambda: refund(r2.id))
    r3 = book_hotel(...)                  # ← this step fails
except Exception:
    for undo in reversed(saga):     # reverse: refund first, then cancel
        undo()
    raise
Failure mode: relying on "the model won't call it twice." When a tool_result is a network timeout (you genuinely don't know if the server ran it), the model often calls again — and the idempotency key is the only reliable defense. You cannot prompt the model "don't repeat"; that builds safety on the non-deterministic decision layer.
More · Stripe Idempotent requests, docs.stripe.com/api/idempotent_requests · Stripe Designing robust APIs with idempotency, stripe.com/blog/idempotency
// 03

Checkpoint & Resume: agent state must live outside the process — an in-memory loop is a toy

Claim: long tasks must resume from a checkpoint, not restart from scratch — which requires externalized, persistent agent state.

Background & Principle

An agent runs 40 steps over half an hour and crashes at step 38 (OOM, deploy restart, a spot machine reclaimed). If state lives only in the in-memory messages array, the prior 37 steps are wasted — and their tokens and side effects are already spent. A production agent must persist state outside the process: checkpoint after each step, and on crash resume from the last checkpoint. This is the core of durable execution. Two ways to land it:

The key is checkpoint granularity: too fine (every token) is costly, too coarse (only at the end) is useless. In practice, checkpoint per model call or per side-effecting step. And it must pair with idempotency (§2) — on resume, that last step may have run only halfway.

Hands-on

import json, os
def agent(task, ckpt_path):
    if os.path.exists(ckpt_path):                    # resume: restore from checkpoint
        s = json.load(open(ckpt_path)); msgs, step = s["msgs"], s["step"]
    else:
        msgs, step = [{"role":"user","content":task}], 0
    while step < MAX_STEPS:
        r = client.messages.create(model=MODEL, messages=msgs,
                                   tools=SCHEMAS, max_tokens=4096)
        msgs.append({"role":"assistant","content":r.content})
        if r.stop_reason == "end_turn": break
        results = run_tools(r.content)               # run tools (must be idempotent!)
        msgs.append({"role":"user","content":results}); step += 1
        json.dump({"msgs":msgs,"step":step}, open(ckpt_path,"w"))  # persist per step
Failure mode: you checkpoint state but the tool isn't idempotent — the crash lands between "tool executed" and "checkpoint written," so on resume the tool reruns and the side effect happens twice. Checkpoint and idempotency are a pair; neither works alone. Another trap: stuffing non-serializable objects (file handles, sockets) into state, which blow up on restore.
More · LangChain Persistence / Checkpointers, docs.langchain.com/.../persistence · Temporal Durable Execution meets AI, temporal.io/blog
// 04

Timeout, Circuit Breaker & Partial Failure: distinguish "degradable" from "must stop"

Claim: failing the whole task on any tool failure is an anti-pattern; a mature agent has an explicit degradation policy per failure class.

Background & Principle

Three related mechanisms:

Circuit Breaker three-state machine ┌────────┐ ≥N fails ┌────────┐ │ CLOSED │ ───────────▶ │ OPEN │ │ pass all │ │fail fast │ └────────┘ ◀─────────── └───┬────┘ ▲ probe ok │ cooldown reset_after │ ▼ │ ┌──────────────────────┐ └──── │ HALF-OPEN: one probe │ probe fail→OPEN └──────────────────────┘

Hands-on

class CircuitBreaker:
    def __init__(self, fail_max=5, reset_after=30):
        self.fail_max, self.reset_after = fail_max, reset_after
        self.fails, self.opened_at = 0, None
    def call(self, fn):
        if self.opened_at:                          # OPEN
            if time.time() - self.opened_at < self.reset_after:
                raise CircuitOpen("dependency tripped, fail fast")
            self.opened_at = None                   # → HALF-OPEN: let one probe through
        try:
            out = fn(); self.fails = 0              # success → CLOSED
            return out
        except Exception:
            self.fails += 1
            if self.fails >= self.fail_max:
                self.opened_at = time.time()       # → OPEN: trip
            raise
Failure mode: (1) only per-tool timeouts, no task-level deadline — the agent never times out on any single step in a loop yet never finishes, until it burns the whole budget. You need a top-level deadline. (2) Hard-coding partial failure as "any worker fails = whole task fails," letting one flaky edge worker drag down an entire high-value task. The degradation policy should be decided by the orchestrator per subtask importance, not one-size-fits-all.
More · Martin Fowler CircuitBreaker, martinfowler.com/bliki/CircuitBreaker · HumanLayer 12-Factor Agents (own your control flow / errors), github.com/humanlayer/12-factor-agents

// PUTTING IT TOGETHER · Harden a toy agent into a recoverable one

Take any 100-line demo agent and harden it down this checklist — each item maps to a failure mode above:

  1. Layered retry: wrap messages.create and network-type tools in call_with_backoff (full jitter); do not wrap semantic errors — feed them back to the model verbatim.
  2. Idempotency keys on side-effecting tools: list every "write" tool, give each an Idempotency-Key derived from the task; if you can't change the server, keep your own "executed keys" table to dedupe.
  3. Checkpoint per step: persist {msgs, step} each round (or use a LangGraph checkpointer); try to resume on startup. Test: kill -9 mid-run; restart should continue from the breakpoint, not from scratch.
  4. Three-layer timeout + one breaker: per-tool timeout, a top-level task deadline, and a CircuitBreaker around the external dependency that flakes most.
  5. Partial-failure degradation: if you fan out in parallel, turn gather into "collect successes + annotate failures," letting the upper layer decide continue vs stop.
  6. Add one line of telemetry: count retries, breaker trips, and checkpoint resumes — without these metrics you simply can't see resilience problems when they happen.

When you're done the code goes from 100 to 300+ lines, and not one of the extra 200 lines is a "feature" — it's all resilience. That's the real gap between a demo agent and a production agent.

// DEEP THINKING

Hiding transport retries entirely from the model — could that give it a wrong picture of how fragile the environment is? When should you tell it?
Mostly hide it — the model shouldn't be responsible for backing off a 429, and exposing it pollutes reasoning. But two exceptions are worth surfacing: one, when backoff still ends in permanent failure (the dependency is truly down), so the model knows "this path is closed" and can switch tools or plans; two, when backoff causes significant latency and the model is making a time-sensitive decision. Principle: hide the transport fault's process, but the conclusion (success / permanent failure) must reach the model as a tool_result, or it keeps planning on a false premise.
Idempotency keys should be "derived from the task." But an agent's task is natural language with different wording each time — how do you derive a stable key?
Don't key on the natural-language task; key on the structured operation semantics. The key should come from the tool call's normalized parameters, not the upper instruction: charge:{user_id}:{order_id}:{amount}, built from deterministic fields, is identical no matter how the model phrases it or how many retries. If the operation lacks a natural unique identifier ("send an encouraging message"), have the harness generate a key when the agent enters that step and write it into the checkpoint — reuse it on resume. Essence: key stability comes from deterministic derivation in the harness, not from model output.
Checkpoint + idempotency let an agent resume after a crash. But if the crash was caused by a model-driven infinite loop, resuming just faithfully re-runs the loop. How do you break it?
This is durable execution's blind spot: it guarantees "faithful recovery," but faithfully recovering a bad state is worthless. Add loop detection on top of checkpoints: fingerprint the last K steps' (tool, normalized args); break or escalate to a human on repeats. Deeper, resume shouldn't blindly replay — inject a meta note at the recovery point: "you crashed/looped here, try a different strategy." Distinguishing two crash classes matters — infrastructure crashes (OOM/restart) should resume verbatim; logic crashes (loops/repeated failures) should resume with reflection.
A breaker protects the downstream, but has a side effect on the agent: once tripped, the tool "vanishes." How does the model react, and what does that imply for tool registry design?
The model gets confused — a tool it used last turn now throws CircuitOpen. If you just return a bare exception, it may retry repeatedly (hitting the breaker) or abandon the task. Better: translate the breaker state into model-usable semantics: "X is temporarily unavailable, expected back in ~30s; use Y or do other steps first." This points to a design principle: a tool's availability is dynamic state, and the harness should surface it as part of context proactively, not only as a failure on call. The breaker essentially promotes "infrastructure health" into a model-visible decision input.
This whole issue is problems distributed systems solved twenty years ago. Do agents really bring "new" resilience challenges, or just repackage old solutions?
Eighty percent is old solutions, but ten percent is genuinely new — and it's the hardest ten. The old part: retry/idempotency/saga/breaker/checkpoint, lifted wholesale. The new part comes from the non-deterministic decision layer: traditional control flow is coded and statically analyzable, even formally verifiable; an agent's control flow is generated by the model on the fly, so you can't know what it calls next. "Is retry safe" degrades from a provable property into a probability problem — the model may repeat an operation at the worst moment. That turns idempotency from "best practice" into a "lifeline." Another new problem: a traditional saga's steps are known, an agent's steps are emergent, so you must design compensation for side effects "not yet done but the model might do."

// FURTHER READING