DAY 13 / PHASE 2 · SYSTEMS

Multi-agent Systems

When to use · Orchestrator-Worker · Debate · Protocols & the coordination tax

2026-05-27 · BigCat

In June 2025 Anthropic and Cognition published nearly simultaneously, taking opposite positions — Anthropic "here's how we built multi-agent", Cognition "don't build multi-agent". This issue lifts the engineering hood on that debate: when does multi-agent actually pay, and when is it just 15× token waste.

Prerequisite concept → ai-ml-daily Day 7 (Multi-Agent Systems)

Foundation concepts → ai-ml-daily Day 7: Multi-Agent Systems (AutoGen, CrewAI)

// WHY THIS MATTERS

June 2025 was the most informative week of the year. Anthropic published How we built our multi-agent research system, dissecting how their Research feature uses Opus 4 as orchestrator and Sonnet 4 as subagents — beating single-agent Opus 4 by 90.2% on their internal eval. The same week, Cognition's (Devin team) Walden Yan published Don't Build Multi-Agents, arguing that multi-agent is a fragile system and advocating "single agent + long context + full trace". Two frontier teams, public collision. The point isn't "who's right" — it's that they use multi-agent for different task structures: Anthropic's Research is parallel retrieval (each subagent independent, then aggregate), Cognition's Devin is sequential coding (each step depends on the prior step's context). The engineering value of multi-agent depends entirely on whether the task decomposes into independent subtasks. This issue assumes you already know what an agent is (covered in ai-ml-daily Day 7) and goes straight to four engineering layers that decide multi-agent ROI: ① when NOT to multi-agent → ② Orchestrator-Worker's real constraints → ③ Debate / multi-perspective — what actually delivers value → ④ The coordination tax and protocols (A2A vs MCP).

Multi-agent decision line (2026 edition) Problem: single agent + tools + long context is not enough │ ▼ ┌─────────────────────────────────────────────────────────┐ │ ① Can subtasks run independently in parallel │ │ (no shared intermediate state)? │ │ Yes → orchestrator-worker candidate │ │ No → stay on single agent + long context │ └─────────────────────────────────────────────────────────┘ │ Yes ▼ ┌─────────────────────────────────────────────────────────┐ │ ② Does the single-agent context truly overflow │ │ (>200K still not enough)? │ │ Yes → multi-agent can split context │ │ No → longer ctx / RAG / compaction is simpler │ └─────────────────────────────────────────────────────────┘ │ Yes ▼ ┌─────────────────────────────────────────────────────────┐ │ ③ Is task value > 15× chat-token cost? │ │ Anthropic data: multi-agent ≈ 15× chat tokens │ │ High-value research → worth it │ │ Generic tasks → not worth it │ └─────────────────────────────────────────────────────────┘ │ Yes ▼ ┌─────────────────────────────────────────────────────────┐ │ ④ Are failures independently attributable │ │ (each subagent verifiable on its own)? │ │ Yes → ship multi-agent │ │ No → failures cascade; stay on single + reflection │ └─────────────────────────────────────────────────────────┘ Most business tasks stop at ① or ② — that's Cognition's stance. Research / parallel retrieval / cross-tool inquiry → multi-agent flies.

// 01

Anti-pattern: 80% of "multi-agent" is single agent with extra ceremony

Claim: Cognition's Walden Yan is right — multi-agent is a fragile system by default. The problem isn't agent count, it's context-sharing collapse: subagents can't see each other's mid-flight decisions, conflicting actions overwrite each other, the orchestrator's summarization loses information. For sequentially-coupled tasks, single agent + full trace + long context is almost always more reliable, cheaper, and easier to debug.

Background & mechanics

Cognition's Don't Build Multi-Agents (June 2025) nails the anti-pattern. Core claim: multi-agent failure isn't that agents aren't smart enough — it's that they can't see each other. One subagent decides "refactor this file first, then add the feature"; a concurrent subagent decides "just add the feature without touching structure". Both are internally consistent. At merge time they collide, and the orchestrator can neither judge which is better nor roll back. This implicit decision conflict is the hardest multi-agent failure mode to treat because there's no error, no stack trace — just a result that "looks weird but runs".

Cognition's two engineering principles: (1) Share full context, not summaries — every subagent must see the entire trace (the orchestrator's goal, what other agents have done). Summarization is lossy compression and almost always loses something critical. (2) Actions carry implicit decisions — every tool call implies a judgment ("I chose this path, not the other"). When concurrent agents each make implicit decisions without a negotiation channel, coding/planning/anything-where-each-step-depends-on-the-last is strictly better with a single agent.

This doesn't contradict Anthropic's Research system — their task structure is the opposite: parallel retrieval across 100 web pages, then orchestrator aggregation. Subagents are independent by construction, so the implicit decision conflict problem doesn't apply. Which means the first question when deciding multi-agent isn't "is this task complex?" — it's "do subtasks share implicit state?"

Another often-overlooked anti-pattern: splitting one agent's prompt across multiple agents to "reduce complexity" — the typical example being a "PM agent + Engineer agent + Reviewer agent" sequential pipeline. This split almost never improves quality — it just relocates prompt-engineering complexity from one system prompt to three inter-agent message formats. Papers like MetaGPT (encoding SOPs as agent roles) show what's possible, but require tuning each role's prompt to production quality — which is itself 3× the work of single-agent, and inevitably introduces inter-agent message drift. Unless your product SOP is actually stable enough to encode as roles (rare), single agent is always the baseline.

Task structure → agent count decision Task type Parallel Shared Recommended arch state ──────────────────────────────────────────────────────────── Research / literature review High Low Orchestrator + N workers Web scraping & aggregation High Low Orchestrator + N workers Cross-tool inquiry High Low Orchestrator + N workers ────────────────────────────────────────────────────────── Coding agent (Devin-like) Low High Single agent + long ctx Long-form writing Low High Single agent + long ctx Debugging / refactoring Low High Single agent + long ctx ────────────────────────────────────────────────────────── Factual QA / math Mid Mid Debate (2-5 agents) Critical decision review Mid Mid Debate / cross-check ────────────────────────────────────────────────────────── "Role pipeline" splits Low High ❌ Usually anti-pattern → High shared state + low parallelism = single agent always wins

Hands-on example

The ROI decision tree as 6 self-check questions — if ≥3 fail, multi-agent is over-engineering:

# multi_agent_check.py — sanity check before building
def should_use_multi_agent(task) -> str:
    checks = {
        "parallelizable":
            task.can_split_into_independent_subtasks,
        "low_implicit_state":
            not task.requires_step_n_depends_on_step_n_minus_1,
        "context_truly_overflows":
            task.estimated_tokens > 200_000,  # long-ctx really not enough
        "value_per_query_high":
            task.dollar_value_per_run >= 1.0,  # 15× token only justifies
        "subagent_output_verifiable":
            task.each_subagent_result_can_be_validated_independently,
        "single_agent_tried_and_failed":
            task.has_single_agent_baseline_with_eval_score,
    }
    failed = [k for k,v in checks.items() if not v]
    if len(failed) >= 3:
        return f"❌ Stay on single agent. Missing: {failed}"
    if "parallelizable" in failed or "low_implicit_state" in failed:
        return "❌ Task structure fights multi-agent. Use single agent + long ctx."
    return "✅ Good fit. Start with orchestrator-worker; budget 15× tokens."

Failure modes: (1) "Split one agent into PM + Engineer + Reviewer" — each pipeline stage only sees upstream summaries; downstream agents don't know why upstream decided what it did, decision context is lost, quality usually drops. (2) Concurrent subagents share tools but not trace — two subagents both write the same file, call search with different queries, overwrite each other. (3) Orchestrator passes summaries instead of raw trace — summaries lose precisely the details a subagent needs to make decisions. (4) Using multi-agent debate to "improve quality" without ground-truth eval — debate easily lets all agents become confidently wrong together, harder to detect than a single agent's error. (5) Same model, same prompt across all agents — no diversity, so debate/voting degenerates into amplifying correlated errors.

Further reading · Cognition · Don't Build Multi-Agents (2025), cognition.ai/blog/dont-build-multi-agents · Anthropic · Building effective agents, anthropic.com/engineering/building-effective-agents

// 02

Orchestrator-Worker: Anthropic's bet and its real cost

Claim: Orchestrator-Worker is the only production-scale validated multi-agent pattern of 2026. Anthropic Research disclosed that Opus 4 orchestrator + Sonnet 4 workers beats single-agent Opus 4 by 90.2% — but at 15× chat-token cost. Understanding the four engineering constraints that make this trade-off work (spawn criteria, subagent prompt isolation, aggregation strategy, concurrency cap) decides whether it flies or burns cash.

Background & mechanics

Anthropic's June 2025 post How we built our multi-agent research system spells out the pattern. Architecture: user query → Lead Agent (Opus 4) plans, decomposes, spawns N parallel Subagents (Sonnet 4) each researching a different direction → Lead aggregates all subagent findings → produces the final answer. Key numbers: on their internal research eval, this beat single-agent Opus 4 by 90.2%, while consuming about 15× chat token (a typical agent is ~4× chat).

Four engineering constraints that make it actually work:

(1) Spawn criteria: the Lead can't spawn freely — it must be taught what subqueries deserve a subagent. Anthropic's lead prompt provides judgment criteria for breadth, complexity, parallel payoff. Counter-example: telling the lead "spawn whenever you can" → it spawns a subagent for every search term, token explodes with no quality gain.
(2) Subagent prompt isolation: each subagent receives a prompt containing the task goal but NOT the work of other subagents — exactly opposite to Cognition's stance. Anthropic's design premise: subagents are already independent (different sub-directions), so cross-visibility would just inject confirmation bias. This design only works on truly parallel tasks; on sequential coding/refactoring it will collapse.
(3) Aggregation strategy: the Lead doesn't simply concat subagent results — it does a second pass of reasoning: cross-validate, dedupe, surface contradictions, fill gaps. Anthropic stresses that using a stronger model (Opus) for aggregation is critical; a weak aggregator just averages 5 noisy signals into 1 noisy signal.
(4) Concurrency cap: more subagents isn't better — beyond 5-10 the return curve flattens hard (extra agents just repeat angles already covered), while aggregation complexity rises with N. Anthropic doesn't publish exact numbers but implies 3-7 as the typical range.

How to read the 15× token cost? A chat turn = 1 LLM call; an ordinary agent ≈ 4× because of ReAct loops; multi-agent research ≈ 15× because the lead itself loops (~4×) + each subagent loops (~4×) × N concurrent subagents (~3-5) + aggregation. Which means every query must be worth > $1 to make ROI. Low-value generic tasks on multi-agent is just burning money; research, due diligence, cross-tool inquiry, deep SWE-bench — tasks where one result is worth real money — match.

Orchestrator-Worker standard topology User Query │ ▼ ┌───────────────┐ │ Lead Agent │ ← Opus / GPT-5 (strong model) │ - planning │ │ - task split │ │ - aggregation │ └───────────────┘ / │ │ │ \ / │ │ │ \ ▼ ▼ ▼ ▼ ▼ ┌────┐ ┌────┐┌────┐┌────┐┌────┐ │Sub₁││Sub₂││Sub₃││Sub₄││Sub₅│ ← Sonnet/Haiku (3-7 concurrent) │tool││tool││tool││tool││tool│ └────┘ └────┘└────┘└────┘└────┘ │ │ │ │ │ └─────┴────┴────┴─────┘ │ ▼ Lead aggregates + reasons │ ▼ Final Answer Token math: Lead (~4×) + N·Sub (~4× ea) + agg ≈ 15× chat Fits: value/query > $1, subtasks independent & parallel

Hands-on example

Minimal working orchestrator-worker (Anthropic SDK, pseudo-code — grasp the pattern, not the details):

import anthropic, asyncio, json

client = anthropic.AsyncAnthropic()

# ============ Subagent: independently researches one sub-direction ============
async def subagent(task: dict) -> dict:
    # NOTE: prompt only contains the sub-task, not other subagents' work
    msg = await client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2000,
        system="You are a research subagent. Focus ONLY on the assigned "
               "sub-question. Use search tools. Return: findings + sources.",
        tools=[search_tool, fetch_tool],
        messages=[{"role":"user",
                   "content": f"Sub-question: {task['question']}\n"
                              f"Scope: {task['scope']}"}],
    )
    # tool loop omitted — actually loop until stop_reason='end_turn'
    return {"task": task, "findings": msg.content[0].text}

# ============ Lead Agent: plan + aggregate ============
async def lead_agent(user_query: str) -> str:
    # Step 1: planning — strong model decides whether to split, and how
    plan = await client.messages.create(
        model="claude-opus-4-7",
        max_tokens=1500,
        system="You are a research lead. Decide if this query benefits from "
               "parallel subagents. If YES, output JSON list of 3-7 INDEPENDENT "
               "sub-questions. If NO (task is sequential or simple), output "
               "{\"single\": true} and you'll handle it alone.",
        messages=[{"role":"user","content":user_query}],
    )
    spec = json.loads(plan.content[0].text)

    # Step 2a: doesn't fit splitting → fall back to single agent (most queries)
    if spec.get("single"):
        return await single_agent_run(user_query)

    # Step 2b: spawn subagents in parallel
    results = await asyncio.gather(*[subagent(t) for t in spec["tasks"]])

    # Step 3: aggregation — strong model again; cross-check, do not concat
    aggregation = await client.messages.create(
        model="claude-opus-4-7",
        max_tokens=4000,
        system="You are the research lead synthesizing N subagent reports. "
               "Cross-validate, identify contradictions, fill gaps. "
               "Output: comprehensive answer + confidence notes.",
        messages=[{"role":"user",
                   "content":f"Original query: {user_query}\n\n"
                             f"Subagent findings:\n{json.dumps(results, indent=2)}"}],
    )
    return aggregation.content[0].text

# ============ Cost guard: hard ceiling against runaway ============
# In prod: per-run budget + N_subagent <= 7 + timeout — mandatory

Failure modes: (1) Weak model as Lead — Haiku-as-Lead is like a junior PM: poor planning, can't cross-check contradictions during aggregation. The Lead must be the strongest model in your stack. (2) No cap on subagent count — Leads tend to "spawn a few more for insurance"; without a cap they spawn 20, token explodes, no quality gain. Hardcode N ≤ 7. (3) Aggregation is just concat — concatenating 5 subagent outputs isn't aggregation, it's information piling. Must use the strong model for a real second-pass reasoning step. (4) No budget guard — runaway multi-agent can burn $10+ per query; hard token + tool-call + wallclock budgets are mandatory. (5) Multi-agent on sequential tasks — "let 5 subagents each write parts of the code" will fail at merge. Sequential tasks: single agent, always. (6) Not monitoring single-agent baseline — once on multi-agent, teams forget to A/B against single agent. The 90.2% gain might not exist for your task structure.

Further reading · Anthropic · How we built our multi-agent research system (2025), anthropic.com/engineering/multi-agent-research-system · Wu et al. · AutoGen: Multi-Agent Conversation Framework, arxiv.org/abs/2308.08155

// 03

Debate / Cross-check: when multi-perspective actually improves quality

Claim: Multi-agent debate (Du et al. 2023) is an overrated paradigm — it really does lift accuracy on math and factual QA (5-15pp), but on open writing, subjective judgment, and value evaluation it often degenerates into homogenized consensus. What actually works isn't "more agents" — it's where the diversity comes from: different models, different system prompts, different roles. Same model + same prompt × N runs = noise averaging, marginal gain near zero.

Background & mechanics

Du et al. 2023 (Improving Factuality and Reasoning in Language Models through Multiagent Debate, arXiv 2305.14325) is the paradigm's origin. Mechanism: N LLMs answer independently → each sees the others' answers + reasoning → multi-round iteration produces consensus. The paper showed significant gains over single-answer and self-consistency on MMLU, math, factual QA. The result remains robust in 2026 — when the task has a clear correct answer.

But debate fails in 3 settings:

(1) Open subjective tasks — "write product copy", "evaluate this business plan" — no ground truth, agents converge on a "mutually acceptable mediocre answer", much worse than letting one agent generate freely and humans pick.
(2) Same model + same prompt × N — "same-source diversity" is very weak; debate degenerates into self-consistency, and positive bias amplifies in sync (all agents tend toward yes-man, debate makes it worse). For debate to work, you must construct real diversity: Claude vs GPT vs Gemini, or same model but role-prompts strongly opposed ("you are the defender / you are the critic").
(3) Long-form generation — debate fits "short answer with right/wrong" tasks. Asking agents to debate every paragraph of a 5000-word article almost always degenerates into mutual flattery or stalls.

A higher-ROI "multi-perspective" engineering pattern is asymmetric cross-check: use a different model / different prompt / dedicated verifier agent to audit the main agent's output. Example: main agent uses Claude Opus to write code → verifier agent uses GPT-5 to brainstorm test ideas + find edge cases. This "primary + auditor" structure is more robust in engineering than N-agent debate: clear ownership, controllable cost (2× not N×), independently attributable failures. Constitutional AI's self-review, self-refine, Reflexion are all variants of the same "primary + auditor" structure unfolded along different time dimensions.

One counter-intuitive finding: asking a model to "debate its own prior answer" (self-debate / self-reflection) often works worse than having another agent critique it. The Cognition-style backward rationalization problem: once a model has produced answer A, it tends to defend A and find supporting evidence, not to genuinely question. Self-debate often turns into self-confirmation. So cross-check > self-reflection in reliability terms — the real engineering sweet spot for multi-agent.

Debate / Cross-check pattern selection Task type Recommended structure Typical gain ──────────────────────────────────────────────────────────── Math / logic puzzles N-agent debate (3-5) +10-15pp acc Factual QA / knowledge Debate or RAG + cite +5-10pp acc Code correctness Primary + verifier +20% test pass Critical decision review Primary + cross-model audit 30%+ bugs found ────────────────────────────────────────────────────────── Creative writing Single agent + human pick debate → mediocre Marketing copy / biz call Single agent + prompt variants debate converges Long-form reports Single agent + sectioned self-refine ────────────────────────────────────────────────────────── ❌ Same model + same prompt N times → just self-consistency, zero diversity ❌ Self-debate (one agent) → backward rationalization bias Key variable: source of diversity > number of agents

Hands-on example

Asymmetric cross-check pattern (primary + cross-vendor verifier) — 2× cost for meaningful quality gain:

import anthropic, openai

claude = anthropic.Anthropic()
gpt = openai.OpenAI()

def primary_solve(question: str) -> str:
    # Primary: Claude Opus answers
    msg = claude.messages.create(
        model="claude-opus-4-7", max_tokens=1500,
        messages=[{"role":"user","content":question}],
    )
    return msg.content[0].text

def cross_check(question: str, primary_answer: str) -> dict:
    # Verifier: cross-vendor + adversarial prompt
    # Key 1: different model (diversity)   Key 2: oppositional role
    prompt = f"""You are a SKEPTICAL reviewer. Your job is to find what is
WRONG with the candidate answer below. Do NOT defend it.

QUESTION: {question}

CANDIDATE ANSWER:
{primary_answer}

Output JSON:
{{"verdict": "correct"|"incorrect"|"uncertain",
  "issues": [],
  "missing": []}}"""
    rsp = gpt.chat.completions.create(
        model="gpt-5", temperature=0.2,
        messages=[{"role":"user","content":prompt}],
        response_format={"type":"json_object"},
    )
    return json.loads(rsp.choices[0].message.content)

def arbitrate(question, primary, review) -> str:
    # Only re-run when verifier finds issues — cost-controlled
    if review["verdict"] == "correct" and not review["issues"]:
        return primary
    # Let primary revise after seeing the critique (selective, not overwrite)
    revision = claude.messages.create(
        model="claude-opus-4-7", max_tokens=2000,
        messages=[
            {"role":"user","content":question},
            {"role":"assistant","content":primary},
            {"role":"user","content":
              f"A skeptical reviewer raised these issues:\n"
              f"{json.dumps(review, indent=2)}\n\n"
              f"Revise ONLY if the issues are valid. If you disagree, explain why."}],
    )
    return revision.content[0].text

# Usage
answer = primary_solve(question)
review = cross_check(question, answer)
final  = arbitrate(question, answer, review)
# Cost: ~2× single agent. Quality: +10-20pp on factual / code tasks

Failure modes: (1) Same-model debate — Claude vs Claude almost certainly converges to the same bias. Cross at least vendor or system-prompt. (2) Verifier with same prompt style — must explicitly use oppositional roles ("find errors", "skeptical"); neutral verifiers say "looks fine" too easily. (3) Cross-check on every query — doubles cost but most queries don't need it. Route by risk first (numbers, code, critical decisions). (4) Overwriting primary answer with cross-check result — verifiers also err; let primary selectively revise after seeing critique. (5) Using debate for open subjective tasks — "let 5 agents debate this slogan" inevitably converges to mediocre. Use single agent + prompt variants + human judging.

Further reading · Du et al. · Improving Factuality and Reasoning through Multiagent Debate (ICML 2024), arxiv.org/abs/2305.14325 · Shinn et al. · Reflexion: Language Agents with Verbal Reinforcement Learning, arxiv.org/abs/2303.11366

// 04

The coordination tax & protocols: A2A, MCP, and the cost of N²

Claim: The hidden cost of multi-agent systems isn't in tokens — it's coordination. As agent count N grows, communication complexity is O(N²), decision latency stacks linearly, failure attribution gets exponentially harder. Google's A2A protocol (April 2025) tries to standardize this, but choosing A2A vs MCP vs a homegrown message bus is fundamentally matching communication topology to task structure.

Background & mechanics

The multi-agent "coordination tax" is the distributed-systems problem of old reborn for LLMs. Three key overheads:

(1) Communication complexity — arbitrary inter-agent comms is O(N²) edges; each edge has message encoding, token cost, context sync overhead. Past 5-7 agents performance noticeably drops (each agent's context bloats with other agents' state).
(2) Decision latency stacking — sequential agent pipelines stack latency (5 agents × 8s = 40s); parallel agents are bounded by the slowest subagent + aggregation latency (~1.5× single-agent). User-perceived latency is usually the biggest engineering pressure point pushing against multi-agent.
(3) Failure attribution — single-agent failures are easy to locate (read the trace, find the wrong step). Multi-agent failure means answering "which agent at which step did what that led to the final error" — subagent traces + lead aggregation trace + cross-agent state, debug complexity explodes. Anthropic's engineering post admits this is the hardest engineering problem in multi-agent — they solve it via per-agent full traces + cross-trace correlation IDs.

April 2025 Google launched A2A (Agent-to-Agent) protocol, standardizing "agents from different frameworks and vendors discover each other, delegate tasks, and coordinate". Stack: HTTP + SSE + JSON-RPC 2.0. Each agent publishes an Agent Card (JSON describing its capabilities), client agent finds suitable targets via Cards, Task is the work-delegation abstraction. In June, Google donated A2A to the Linux Foundation; 150+ organizations now back it.

A2A vs MCP — division of labor is often muddled:

MCP (Anthropic 2024): standardizes LLM agent ↔ tools/data sources. One agent calls GitHub, Slack, local files through an MCP server — a vertical connection (agent acquires capability).
A2A (Google 2025): standardizes agent ↔ agent. One agent delegates a task to another agent through A2A — a horizontal connection (agents collaborate).
Complementary, not exclusive: real systems typically use MCP for tools, A2A for other agents. Google's own framing: "A2A and MCP are complementary".

For independent developers: don't reach for protocols first. A 2-3-agent system with bare Python function calls + asyncio.gather takes a few hours. Adding A2A/MCP for general-purpose use adds serialization, network, and version-compat complexity. Protocols' value is in cross-org / cross-vendor / long-lived agent ecosystems — for one person's product with 3 internal agents, direct function calls are 10× simpler, 10× faster, 10× easier to debug than JSON-RPC over HTTP. When the system genuinely needs to expose internal agents externally or integrate third-party agents, migrate to A2A then.

Communication topology by agent count + boundary Agents Boundary Recommended Why ────────────────────────────────────────────────────────── 1 Single process Direct func call Zero overhead 2-3 In-process asyncio + dict Don't over-engineer 3-7 Single product Internal msg bus Standardized trace 5-20 Cross-service/team MCP (tools) + own MCP is de-facto standard N Cross-vendor/org A2A Interop required ────────────────────────────────────────────────────────── Counter-example: 2-agent PoC straight to A2A → 50% more code, 5× harder to debug, zero interop gain

Hands-on example

The most pragmatic "protocol-less" multi-agent template for indie devs (asyncio + shared trace):

import asyncio, uuid, time, json
from dataclasses import dataclass, field

# —— Shared trace: the debug lifeline of multi-agent ——
@dataclass
class SharedTrace:
    run_id: str = field(default_factory=lambda: str(uuid.uuid4()))
    events: list = field(default_factory=list)

    def log(self, agent: str, kind: str, payload: dict):
        self.events.append({
            "t": time.time(), "agent": agent,
            "kind": kind, "payload": payload,
        })

# —— Subagent: just an async function, no RPC ——
async def subagent(name: str, task: dict, trace: SharedTrace,
                   budget_tokens: int = 8000) -> dict:
    trace.log(name, "start", {"task": task})
    used = 0
    try:
        # LLM call + tool loop (omitted)
        result = await run_llm_loop(task, budget_tokens=budget_tokens)
        used = result["tokens_used"]
        trace.log(name, "done", {"tokens": used, "preview": result["text"][:200]})
        return result
    except Exception as e:
        trace.log(name, "error", {"err": str(e), "tokens": used})
        return {"failed": True, "reason": str(e)}

# —— Orchestrator: concurrent + budget + timeout, three layers of safety ——
async def orchestrate(user_query: str) -> dict:
    trace = SharedTrace()
    trace.log("lead", "query", {"query": user_query})

    plan = await lead_plan(user_query, trace)
    if len(plan["tasks"]) > 7:
        plan["tasks"] = plan["tasks"][:7]   # hard cap

    # 3 protections: per-agent budget / global timeout / failure isolation
    try:
        results = await asyncio.wait_for(
            asyncio.gather(*[
                subagent(f"sub-{i}", t, trace, budget_tokens=8000)
                for i, t in enumerate(plan["tasks"])
            ], return_exceptions=True),
            timeout=120,                 # 2 min global budget
        )
    except asyncio.TimeoutError:
        trace.log("lead", "timeout", {})
        results = [{"failed": True, "reason": "global timeout"}]

    # Aggregate over partial success (failure isolation)
    good = [r for r in results if not (isinstance(r,dict) and r.get("failed"))]
    final = await lead_aggregate(user_query, good, trace)

    # Persist trace: lifeline for debug + eval
    json.dump({"run_id":trace.run_id, "events":trace.events, "final":final},
              open(f"runs/{trace.run_id}.json","w"), ensure_ascii=False)
    return {"answer": final, "trace_id": trace.run_id}

Failure modes: (1) No budget guard — LLM agent loops occasionally hang in a tool-call death loop, burning $50 on one query overnight. Hard caps on tokens + tool calls + wallclock are mandatory. (2) No timeout — asyncio.gather without timeout means one stuck subagent drags down the whole run. (3) Fail-fast on single-agent failure — multi-agent should use return_exceptions=True + partial-success aggregation, failure isolation. (4) No persisted trace — without full trace persistence, post-mortem multi-agent debugging is hopeless. Trace is the lifeline of multi-agent engineering. (5) Reaching for A2A with only 3 agents — protocol overhead vastly outweighs benefit; migrate only when cross-vendor needs arise. (6) Using MCP for agent-to-agent comms — MCP is agent ↔ tools, not agent ↔ agent; misusing the protocol introduces wrong abstraction.

Further reading · Google · Agent2Agent (A2A) Protocol (2025), a2a-protocol.org · Anthropic · Model Context Protocol (MCP), modelcontextprotocol.io · Hong et al. · MetaGPT: Meta Programming for Multi-Agent Framework, arxiv.org/abs/2308.00352

// End-to-end · Two-week path to a multi-agent system

Say you want to upgrade an internal task to multi-agent (e.g. "give me a weekly AI industry research brief"). Two weeks from single agent to steady-state multi-agent:

Day 1 · Single-agent baseline + eval: Write the simplest single-agent version (Claude Opus + search tool). Run 10 real queries, human-rate each output. That score is the baseline for every decision after. No baseline → don't ship multi-agent.
Day 2 · Structural diagnosis: Run #01's 6-question check against your task. ≥3 failures → stop; it's not a multi-agent problem. Research-style tasks typically pass.
Day 3-4 · Minimal orchestrator-worker: Use #04's asyncio template. Lead = Opus, subagents = Sonnet/Haiku, hard cap N ≤ 5. Re-run the same 10 queries and compare to baseline.
Day 5 · Trace persistence: Write every run's full trace as JSON, build the simplest viewer (which agent at which step spent how many tokens producing what). Looks like overkill until the first multi-agent debug session.
Day 6 · Aggregation quality: Verify the Lead's aggregation prompt actually does cross-validation, not concat. Common bug: Lead dumps all subagent outputs verbatim to user, with no dedupe or contradiction detection.
Day 7 · Decision point: Does the multi-agent eval score beat single-agent by 10pp+? No → task structure doesn't fit, revert to single agent. Yes → continue.
Day 8-9 · Asymmetric cross-check: Add a verifier (different model + adversarial prompt) at the critical output. Key: don't cross-check every query — use a router to judge risk first. Usually adds another 5-10pp.
Day 10-11 · Three-layer budgets: token / tool-call / wallclock all get hard caps. Runaway multi-agent burning $50 per query is real; this step saves you.
Day 12 · Failure-mode audit: Deliberately construct 5 edge cases (all subagents fail / timeout / tool unavailable / contradictory conclusions / empty results) and verify the system degrades gracefully.
Day 13-14 · Long-tail observation: Run 50 real queries. Watch token distribution, p50/p95 latency, failure rate. Now decide on A2A/MCP — for most indie devs, the answer is "not needed, asyncio is enough".

Two weeks of this path: multi-agent isn't "I'll add more agents to look advanced", it's "I've established a baseline, diagnosed task structure, validated 15× token ROI for 90% gain, prepared trace and budget guards, then went multi-agent". That's the real 2026 posture for multi-agent engineering — most business tasks reach the Day-7 decision point and find that single agent + long context is the correct answer.

// Further reading

Anthropic · How we built our multi-agent research system (2025) — Orchestrator-worker case study; source of 15× token / 90.2% gain numbers
Cognition · Don't Build Multi-Agents (Walden Yan, 2025) — The counter-argument; context-sharing thesis
Anthropic · Building Effective Agents — Workflow vs agent decision framework
Du et al. · Multiagent Debate (ICML 2024) — Origin of the debate paradigm; factuality/reasoning gains
Shinn et al. · Reflexion (NeurIPS 2023) — Self-reflection paradigm; cousin of asymmetric cross-check
Wu et al. · AutoGen (COLM 2024) — Microsoft's multi-agent conversation framework
Hong et al. · MetaGPT (ICLR 2024) — Encoding SOPs as agent roles
Google · Agent2Agent (A2A) Protocol — Agent ↔ agent interop standard
Anthropic · Model Context Protocol (MCP) — Agent ↔ tools standard; complementary to A2A

// Deeper questions

Anthropic and Cognition published opposing multi-agent positions in the same week (June 2025) — not contradictory, since each side's "best practice" is the other's "anti-pattern". What does this split reveal about LLM engineering?

It reveals: architecture is a function of task, not of model. Anthropic Research's core task is "parallel retrieval + aggregation" — subtasks are naturally independent; multi-agent is a fitting parallel-compute abstraction. Cognition Devin's core task is "sequential coding + debugging" — every step heavily depends on the prior step's implicit state; multi-agent inevitably collapses on implicit decision conflicts. Both are right, in different task spaces. The deeper engineering philosophy: LLM-era "best practices" must be discussed bound to task structure, or the discussion is empty. Most debates about "multi-agent vs not", "long context vs RAG", "fine-tune vs prompt" stall because the task-structure variable is omitted. Senior engineering judgment is largely about "reading task structure" — once you can tell whether a task is parallel-decomposable, sequentially dependent, or mixed, the architectural choices follow almost automatically.

Anthropic publicly disclosed 15× chat-token consumption — what does this mean for AI economics? Will "> $1 per query to make ROI" lock multi-agent into enterprise forever?

Short-term yes, medium-term no. Short-term: $1+ per query rules out consumer-scale (C-side query value typically < $0.01) and most small-to-medium light-B scenarios. What survives 15× tokens: research, due diligence, enterprise search, complex SWE-bench, parallel outbound (lead research / candidate background checks) — high single-value tasks. Medium-term (2-3 years), three forces shift the equation: (1) Inference cost exponentially dropping — Haiku 4.5 is 20×+ cheaper than Claude 1 era; cheap-subagent strategies become viable. (2) Prefix caching becoming standard — multi-agent shared system prompt + tool defs hit cache aggressively; actual token cost is closer to 6-8× than 15×. (3) Specialized small models rising — fine-tuned subagent / verifier / aggregator models reduce per-subagent cost by ~10×. Together, 2-3 years out multi-agent reaches mid-value tasks (~$0.10/query); consumer still hard.

Multi-agent debate gains 10-15pp on ground-truth tasks but degenerates into homogeneous consensus on open subjective tasks — is this an LLM trait or a debate-paradigm limit? Do human debates have this problem?

Mostly an LLM trait amplifying inherent limits of debate, with partial human analog. Debate's theoretical assumption: agents have real diversity — independent information sources, priors, reasoning paths. Same model + same prompt × N satisfies none of these: priors come from the same training data, reasoning from the same attention patterns, failure modes are homogenized. So absent ground-truth anchoring, N agents naturally converge on the "most common mediocre answer in training data" — a direct expression of RLHF + large-model averaging tendencies. Humans have a similar problem: in-group homogenization (echo chambers) produces mediocre consensus. But human debate usually relies on real experiential diversity + interest diversity for variance; LLM agents have neither. This means multi-agent debate has to explicitly inject diversity to work: cross-vendor models, strong-opposing role prompts, different knowledge bases / tool stacks. "More agents naturally produces diversity" is a wrong intuition for the LLM era.

A2A (Google 2025) already has 150+ org support, but Anthropic and Cognition both used internal message buses (not A2A) in their public multi-agent implementations. Early-stage normality, or structural problem with A2A?

More likely early-stage normality, with some A2A choices still needing time to prove out. Anthropic's Research is single-product internal multi-agent — all subagents are Anthropic's own Claude, cross-vendor interop yields zero benefit; A2A would just add HTTP/JSON-RPC overhead for no upside. Same for Cognition. A2A's real target scenarios are cross-org / cross-vendor agent ecosystems — e.g. your agent delegating tasks to Slack's agent or querying Salesforce's agent — not yet at scale. Structurally A2A has two open questions: (1) Prompt injection across trust boundaries — external agents injecting malicious instructions via task descriptions has a much larger attack surface than a single MCP tool. (2) Billing / SLA model unclear — one agent delegating to another: who pays for tokens, how are timeouts handled — the protocol doesn't say. These need solving for A2A to truly take off. MCP went the same path: published 2024, internal teams validated for 1 year, ecosystem at scale in 2025-2026. A2A likely hits that phase in 2026-2027.

If by 2027 single-agent context windows are universally 10M+ tokens and reasoning is further improved, will multi-agent's engineering value disappear? Or will new "only multi-agent can solve" scenarios emerge?

It will shift but not disappear; new scenarios will emerge. Short-term: long context + stronger reasoning will erode multi-agent's middle ground — today's "single-agent context overflows → go multi-agent" cases (research briefs, long-doc processing) will be doable with single-agent + 10M context + stronger reasoning tomorrow; ROI in those scenarios drops. But multi-agent gets more valuable in truly parallel + truly heterogeneous scenarios: (1) Cross-model collaboration — Claude on text + Gemini on multimodal + a dedicated code agent for verification, with complementary capabilities context size cannot replicate. (2) Cross trust-boundary collaboration — your agent collaborating with someone else's agent (A2A scenarios) cannot ever merge into one context. (3) Real-time high-parallelism tasks — e.g. monitoring 100 data streams for sub-second decisions physically requires parallel agents. (4) Specialized agent ecosystems — deeply fine-tuned domain agents (medical diagnosis / legal audit / code security) that a single general agent will never catch. So by 2027, multi-agent shifts from "remedy for context limits" to "true collaboration / interop architecture" — higher bar, narrower scenarios, but more irreplaceable.