Over the past year, "agent" has been abused to the point of losing meaning — every script with an LLM call brands itself an agent; every demo wires up three roles to debate each other. But the moment you put these things in production, you meet a brutal curve: the more "agentic" the system, the worse the reliability, the higher the latency, the more the tokens explode. Anthropic's Building Effective Agents captured the takeaway people now quote everywhere: "Find the simplest solution possible, and only increase complexity when needed." This issue doesn't argue framework choice (LangGraph vs CrewAI vs AutoGen is pointless). It covers four things that actually determine reliability: when not to use an agent, the real difference between ReAct / Plan / Reflexion, why multi-agent is almost always an anti-pattern, and the five typical failure modes that kill agent projects in production — with diagnostic tooling.
// 01
Workflow vs Agent: Ask Whether You Need an Agent First
Claim: 90% of self-described "agent" projects are actually workflows; running a workflow through an agent loop is slower, more expensive, and less reliable.
Background & Principles
Anthropic's Building Effective Agents gives an underrated dichotomy:
Workflow: LLM and tools orchestrated along a predefined code path. You write the code that decides what's next; the LLM only judges or generates at nodes.
Agent: the LLM dynamically decides which tool to use and what path to take, looping until it considers the task done.
The distinction isn't "uses tools or not" — it's who owns the control flow. A 5-step RAG prompt chain is a workflow, not an agent. Something that decides for itself whether to search again or switch tools — that's an agent.
Why does this matter? Because agent reliability roughly equals "per-step success" raised to the Nth power. An agent at 95% per step has 60% over 10 steps; 36% over 20. Workflows freeze the unnecessary "self-decisions" into code — turning exponential decay into linear. Same task: write it as a workflow if you can. This isn't aesthetic; it's a physical fact about reliability.
When should you go agent? Anthropic's single criterion: the step count and path cannot be enumerated in advance. "Fix this bug" — you don't know how many files you'll read or edit. "Research this open-source project" — you don't know how many links you'll fetch. The other 90% of scenarios (extract + rewrite + validate + persist; classify + route; translate + polish + back-translate to check) are workflows.
Workflow (path is hardcoded; LLM judges at nodes)
input ──▶ [classify] ──┬──▶ [extract] ──▶ [validate] ──▶ output
└──▶ [summarize] ──▶ [validate] ──▶ output
✓ fixed steps ✓ unit-testable ✓ bounded p99 ✓ predictable cost
Agent (dynamic path; LLM picks next step)
input ──▶ ┌─ LLM ─┐ ──tool_use──▶ [tool] ──tool_result──┐
│ │ ◀──────────────────────────────────┘
└─ stop?─┘ ──end_turn──▶ output
✓ unenumerable path ✗ unfixed steps ✗ hard to unit test ✗ unbounded p99
Hands-on Example
"Generate structured minutes + todos + key decisions from a meeting recording" — workflow, not agent:
# —— This is a workflow (5 lines of business code + 3 LLM calls) ——
transcript = whisper.transcribe(audio_path)
summary = llm(SUMMARY_PROMPT, transcript)
todos = llm(TODO_PROMPT, transcript)
decisions = llm(DECISION_PROMPT, transcript)
return {"summary":summary, "todos":todos, "decisions":decisions}
# fixed steps = 3; no agent loop / tool selection / state machine needed# p99 = 3 × upper-bound of one LLM call; cacheable, batchable, parallelizable
"Fix this issue: user reports the dashboard intermittently white-screens on Safari" — agent:
# —— This is an agent (path is unenumerable) ——# Possible steps:# read issue → grep dashboard entry → read entry → check git log →# find Safari-related polyfill → run tests → edit → re-run → write PR# But "possible" means the actual route depends on what each step reveals
agent(task=issue_body, tools=[read,grep,bash,edit,write,fetch])
Decision heuristic: if you can draw all possible paths on a whiteboard, it's a workflow; if you can't, it's an agent.
Failure modes: (1) turning a fixed pipeline into an agent and asking the model to "decide" extract vs. summarize at each step — success drops from 99% to 80% because you introduced unnecessary selection risk. (2) Using an agent for routing — "greeting / technical / account issue" is fine as a classifier workflow; wrapping it as an agent with tools is over-engineering. (3) Hybrids — the path is fixed but you run it through an agent "for the AI feel"; latency / cost / observability all suffer.
ReAct vs Plan-and-Execute vs Reflexion: The Real Division of Labor
Claim: default to ReAct. Escalate to Plan only when steps ≥ 8 and per-step cost is high. Only reach for Reflexion when the task is verifiable.
Background & Principles
LangChain / LlamaIndex docs make these three sound like "framework choices". They aren't. They're three different return curves for three different complexity regimes. Understanding when each works (and doesn't) matters 100× more than memorizing the names.
ReAct (Yao et al. 2022, arXiv:2210.03629) — alternating Thought / Action / Observation, each step looking at the result before deciding the next. Simple, low latency, fault tolerant (mistakes can be corrected next step). The agent loops in Claude Code and Cursor are basically ReAct variants. This is the default.
Plan-and-Execute (Wang et al. 2023, arXiv:2305.04091) — one LLM call produces a complete plan (List[step]), then execute step by step. Pros: steps can run in parallel, the plan stage is human-reviewable, token utilization is high. Cons: the plan is based on incomplete information; surprises during execution force either pushing through or a full replan. Right when: many steps (≥ 8), each step is expensive (money / time), and you have enough up-front info to plan well.
Reflexion (Shinn et al. 2023, arXiv:2303.11366) — after a run, have the LLM reflect on what went wrong, write the reflection into memory, redo with the reflection. The paper shows real gains on HumanEval / AlfWorld, but only because results are verifiable (code can be run against tests; games have a score). On open-ended tasks ("write an article", "customer chat") without ground truth, reflection becomes self-congratulatory noise — the "next time I should…" the model produces is often wrong.
ReAct (default)
┌─ Thought ─▶ Action ─▶ Observation ─┐
└─────────────────◀──────────────────┘
look at result before deciding ✓ fault-tolerant ✓ low latency ✗ hard to parallelize
Plan-and-Execute (many steps + high cost)
┌─ Plan ─▶ [s1,s2,s3,...,s10]
│ │ │ │ │
│ ▼ ▼ ▼ … ▼ (parallelizable)
└────────── Execute ────── output
✓ plan is reviewable ✗ no dynamic adjustment
Reflexion (verifiable tasks)
┌─ Trajectory ─▶ Verifier ─▶ pass? ──end
│ │ fail
│ ▼
└────── Reflect ──▶ Memory ────┘
✓ lifts HumanEval/AlfWorld ✗ mostly useless without a verifier
Hands-on Example
# —— ReAct skeleton (the hot path of your own harness) ——whileTrue:
r = llm(system=SYS, tools=TOOLS, messages=msgs)
msgs.append({"role":"assistant","content":r.content})
if r.stop_reason == "end_turn": break
results = [run_tool(b) for b in r.content if b.type=="tool_use"]
msgs.append({"role":"user","content":results})
# —— Plan-and-Execute skeleton ——
plan = llm(PLAN_PROMPT, task) # List[Step] with inter-step depsfor wave in topo_sort(plan): # waves by dependencyawait asyncio.gather(*[execute(s) for s in wave])
if not satisfied(plan.goal):
plan = llm(REPLAN_PROMPT, task, history) # full replan, not local patch# —— Reflexion skeleton (only when a verifier exists) ——for attempt in range(MAX_ATTEMPTS):
traj = react_agent(task, memory=reflections)
result = verifier(traj) # e.g. run tests / validate schemaif result.passed: return traj
reflections.append(llm(REFLECT_PROMPT, traj, result.errors))
Selection tree:
Path unenumerable → use an agent; otherwise workflow (see §1).
Average steps ≤ 7, per-step cost low → ReAct.
Many steps (10+), high per-step cost (sending email / paid API / production change), enough info to plan → Plan-and-Execute.
No oracle but want reflection? Stop. Build an oracle first, or add an LLM-as-judge (see Day 6).
Failure modes: (1) defaulting to Plan-and-Execute because it "feels fancier" — the plan is built on wrong info, then needs full replan on the fly; slower and more expensive than ReAct. (2) Slapping Reflexion onto every task — without a verifier, reflection is self-talk; the numbers in the paper come from experiments with oracles. (3) Combining all three — "agent plans, executes with ReAct, then reflects" sounds great and is a debugging nightmare. Stabilize one pattern first.
Multi-agent Is Almost Always an Anti-pattern — Except When It Isn't
Claim: default to a single agent + good prompt + good tools. Multi-agent only wins when "context isolation gain > orchestration cost".
Background & Principles
2023–2024 AutoGen / CrewAI / MetaGPT hyped multi-agent as "the future". The engineering reality in 2025: most multi-agent systems underperform the same model + a well-written system prompt + a good toolset. The reasons are physical:
Token cost explodes: N agents chatting means every agent reads every other agent's last message each turn — token usage grows roughly N². Anthropic's own number in How we built our multi-agent research system: multi-agent uses ~15× the tokens of single agent. Which is why they only used it for "research" — a parallelizable, high-value, large-context scenario.
Error compounding: every agent carries its own error rate; chaining N agents compounds those errors. The "Manager / Worker / Critic" triangle looks reasonable and often underperforms a single agent.
Orchestration complexity: who speaks first? when do we stop? how do we resolve conflicts? Every piece of meta-logic is a bug source.
When does multi-agent actually pay? Anthropic gives concrete criteria:
Naturally parallel tasks — decomposable into independent sub-tasks with near-zero coupling. Example: "research 5 different topics then synthesize".
Each sub-agent consumes a large context — splitting means no single agent's window gets swamped; each agent has its own small window.
Orchestrator-worker, not peer debate — one clear lead splits, dispatches, aggregates; workers don't talk to each other. Peer debate (multiple agents arguing) almost never wins on real tasks.
That's why Claude Code's Task tool is orchestrator-worker: the main agent invokes Task to spin up a sub-agent for independent research, the sub-agent returns one summary. Sub-agents do not chat with each other.
✓ Orchestrator-Worker (recommended) ✗ Peer Debate (usually anti-pattern)
┌─ Lead Agent ─┐ ┌──▶ Agent A ◀──┐
│ plan │ │ │ │
┌────┼────┬────┐ │ │ ▼ │
▼ ▼ ▼ ▼ │ ▼ Agent B ▼
WkA WkB WkC WkD │ Agent C ──▶ ◀── Agent D
│ │ │ │ │ ↑ ↓ │
└────┴────┴────┴────┘ └───────┴───────┘
parallel execution → aggregate everyone talks to everyone
isolated context, no peer chat token N², error compounding, who stops?
Hands-on Example
"Research 5 competitors and produce a comparison report" — this is where multi-agent genuinely wins:
# —— Orchestrator-Worker: lead splits, workers research in parallel, lead synthesizes ——async def research_orchestrator(query):
plan = lead_agent(f"Break this research into 3-6 independent subtopics: {query}")
# plan = ["company A pricing", "company B product", ...]
subreports = await asyncio.gather(*[
worker_agent(topic, tools=[web_search, fetch], max_iters=15)
for topic in plan
]) # each worker has its own context window; no peer chatreturn lead_agent(f"Synthesize into a report: {subreports}")
# Win: each worker independently processes ~50k tokens of web content# single-agent serial would need ~250k context — slow AND lost-in-the-middle# with 5 workers, lead only sees 5 summaries ~5k tokens → high-quality synthesis
"Write a function" — multi-agent should not appear here:
# BAD: "coder + reviewer + tester" three-agent debate# - 3× token, 3× latency, errors compound# - reviewer can't see context the coder didn't verbalize# - running tests is a tool, not an LLM step# GOOD: 1 agent + test tool + good system prompt
agent(task, tools=[read, edit, run_tests],
system="Write code, run tests, fix until green.")
# 1× token, verifier (tests) replaces reviewer/tester agents
Heuristic for whether to go multi-agent: "can workers avoid talking to each other?" If yes → orchestrator-worker is on the table. If no (workers must debate / collaborate / depend on each other) → fall back to single agent or split into a workflow.
Failure modes: (1) "Manager / Engineer / Critic / PM" four roles — agents reviewing each other, tokens explode, nobody actually stops. (2) Using peer debate "for accuracy" — research shows weak gains on objective tasks (math/code) and outright degradation on subjective tasks (writing). (3) Forgetting that a verifier is 10× stronger than another agent — if you can run tests, don't pay another agent to "review code"; if you can validate a schema, don't pay another agent to "check output".
Agent Failure Mode Handbook: 5 Ways It Dies + How to Diagnose
Claim: every agent that's ever shipped to production has died one of these 5 deaths. Name them first, then defend against them.
Background & Principles
Once you've run a few real agent projects you notice failures rhyme — they show up repeatedly as one of these 5 deaths. Turning them into an internal incident taxonomy doubles your iteration speed.
Loop Lock: the agent stalls in read → think → read same file → think; or repeatedly calls the same failing tool expecting a different result. Root cause: no new understanding from the latest result, but the stop condition didn't fire.
Over-planning: the agent burns several turns on plan, writes todos, reasons about "what I should do next" — half the token budget gone before it does anything. Root cause: the system prompt rewarded verbosity, or the task is abstract enough to trigger the model's "show your work" habit.
Tool Thrashing: ping-ponging across similar tools, or after a tool error switching tools instead of fixing arguments. Root cause: overlapping tool descriptions (W04 §2), or unreadable error messages (W04 combo).
Context Bloat: context grows to 80%+ of the window, the model starts forgetting the original task (lost-in-the-middle, W02), misusing early data, drifting further with each turn. Root cause: every tool_result is dumped back verbatim with no compaction.
Silent Stop / premature end_turn: agent says "Done!" with the task not actually finished. Root cause: ambiguous completion criteria; the model treats "I did something" as "goal reached".
Almost none of these surface in dev — engineers run 3 happy paths and ship; production catches all five in the first week. Diagnosis isn't reading logs — it's trace: record (tool_name, args_hash, result_hash, token_delta, elapsed) per turn, then scan for anomalous patterns.
Hands-on Example
Add a 30-line failure detector to your harness:
from hashlib import md5
def hash_call(name, args):
return md5(f"{name}|{sorted(args.items())}".encode()).hexdigest()[:8]
class AgentMonitor:
def __init__(self):
self.history = [] # list of (turn, name, args_hash, result_hash)
self.token_path = []
def record(self, turn, name, args, result, tokens):
ah, rh = hash_call(name, args), md5(str(result).encode()).hexdigest()[:8]
self.history.append((turn, name, ah, rh))
self.token_path.append(tokens)
def check(self):
h = self.history
# 1. Loop lock: same (name, args_hash) ≥ 3 in a row, result_hash barely changesif len(h) >= 3 and len(set((x[1],x[2]) for x in h[-3:])) == 1:
return"LOOP_LOCK"# 2. Tool thrashing: 4+ distinct tools in the last 5 turnsif len(h) >= 5 and len(set(x[1] for x in h[-5:])) >= 4:
return"TOOL_THRASHING"# 3. Context bloat: monotonic token growth past 80% of windowif self.token_path and self.token_path[-1] > 0.8 * MAX_CTX:
return"CONTEXT_BLOAT"returnNone# Call from the main harness loopif (state := monitor.check()):
if state == "LOOP_LOCK": inject_msg("You're repeating. Try a different approach or ask the user.")
elif state == "TOOL_THRASHING": inject_msg("Stop trying tools. State your hypothesis first.")
elif state == "CONTEXT_BLOAT": compact_history(msgs)
Corresponding defenses:
Loop Lock: harness enforces max_iters; on detection inject "You appear to be stuck, try a different approach or ask the user"; log to trace.
Over-planning: system prompt with "Be terse. Don't restate the plan each turn"; cap plan length in plan mode.
Tool Thrashing: merge semantically-near tools (W04); error messages tell the model what to do next; harness halts after 3 consecutive identical errors.
Context Bloat: summarization every K turns (W02); replace old tool_results with "[summarized: N bytes]". Claude Code does exactly this.
Silent Stop: completion is decided by a verifier, not by the LLM — run tests where you can, validate schemas where you can; only allow end_turn after a verifier passes.
Failure modes (meta): (1) shipping after happy-path testing — dev passing ≠ production-ready; agents must be iterated against trace sampling and failure replay. (2) "Switch the model when something breaks" — many "ReAct stuck" failures get blamed on "Claude isn't smart enough"; the real issue is a harness with no max_iters / verifier / compaction. Models don't fix systems problems. (3) "More complex framework will fix it" — swapping LangChain for LangGraph doesn't fix your loop lock; a 30-line monitor does.
// Putting it together · 7-step decision checklist before starting a new agent project
Condense the four sections into a "before you start" checklist. Next time someone says "I want to build an AI agent", run them through this:
Can it be a workflow? If the path is enumerable, don't build an agent. Solve it with code + a few LLM calls (§1).
Step count and cost? Avg ≤ 7 steps, low cost → ReAct; > 10 steps and high cost → Plan-and-Execute (§2).
Is there a verifier? Yes (tests / schema / score) → consider wrapping Reflexion. No → don't touch Reflexion; build a verifier first.
Do you actually need multi-agent? Default single agent + good tools. Only escalate when "naturally parallel + large sub-task context + orchestrator-worker topology" (§3).
Tool registry ≤ 10? See W04 §2 — past 15 tools you're already in the degradation zone.
Harness defends against all 5 deaths? max_iters, loop detector, context compactor, tool-error ceiling, verifier-based stop — one for each (§4).
Do you have an eval? Not a happy-path demo — a 20+ scenario regression suite (Day 6's topic). An agent without an eval is a folk-magic project.
Being able to answer "we don't need an agent / single agent is enough / no multi-agent" honestly through these 7 steps is senior engineering. The Anthropic / Cognition / Cursor consensus boils down to one line — complexity is a tax; pay it only when it's worth it.
// Deep Thinking
ReAct is the default, but its core is the Reason-Act-Observe loop. How is that fundamentally different from Chain-of-Thought?
CoT is single-shot reasoning (generate the full chain, then the answer). ReAct interleaves (think → act → see result → think again). Core difference: CoT assumes the answer can be derived purely; ReAct assumes external feedback is required. ReAct must pair with tools (without tools it collapses back to CoT). Yao 2022: ReAct beats CoT by ~30% on knowledge-grounded tasks (HotpotQA); they tie on pure reasoning (GSM8K).
Plan-and-Execute looks more systematic than ReAct — why isn't it the default?
Three reasons: (1) plan-stage cost — one inference produces a 10-step plan, but if step 3 reveals the plan is wrong, you re-plan everything, burning tokens; (2) static plans don't adapt to dynamic environments (e.g. web content changing); (3) measured: Plan-and-Execute loses 5–10pp to ReAct on short tasks (≤5 steps). It only wins on long tasks (10+) with a stable environment (e.g. code generation).
Multi-agent debate has lots of papers but almost no production use. Beyond orchestration complexity, what's the deeper problem?
(1) Inference cost scales with N (number of agents); (2) agents reinforce each other's errors (echo chamber), especially with the same base model; (3) headline metrics rarely move (Anthropic experiments: debate lifts reasoning tasks 2–5%, at 4× the cost). The exception is when context-isolation gain is large: long-doc analysis (each agent reads a chunk, then aggregate) — there multi-agent clearly beats single.
Reflexion makes agents self-improve — but it needs a verifier. Which tasks have a good verifier and which don't?
Verifier exists: code (run tests), math (recompute), SQL (execute), search (result count ≥ threshold). No verifier: writing (subjective quality), conversation (no ground truth), design (many right answers). Without a verifier, Reflexion loops on its own hallucinated "scores" and degrades. If you can write a verifier, use Reflexion. If not, skip.
Which failure mode is most common in agents? How do you detect it early?
Most common: "tool-selection loop lock" — the model keeps invoking the same tool because it doesn't know why it's failing. Detection: (1) keep a last_N_tools deque (N=3); if it's [A,A,A] force-break; (2) after each tool result, check a progress signal (output changing? closer to goal?) — no progress → escalate. Second most common: context overflow — history too long, the original instruction gets squeezed out.