AI/ML Deep Dive: Multi-Agent Systems

Day 7 · 2026-05-25
For engineers experienced in coding but new to AI/ML
Engineering counterpart → super-individual D13: Multi-agent Systems (Orchestrator-worker, debate, when it's an anti-pattern)

AutoGenMicrosoft AutoGen

FrameworkConversational
One-line analogy

Think of multi-agent systems as an internal Slack channel — each Agent is a channel member and all collaboration happens via messages. You don't write state machines or explicit control flow; you define who can join, who initiates, who's allowed to reply, and let the conversation evolve toward a result. Backend analogy: trade orchestrated RPC for an event-driven message bus.

What it solves + how it works

A single Agent has limited context and limited skills. For a task like "read code + write tests + run tests + fix bug," you can stuff it all into one Agent, but the prompt swells and attention dilutes. AutoGen (Microsoft, open-sourced 2023, rewritten as v0.4 in 2024) splits the task across specialized Agents that collaborate through conversation. Two core abstractions only:

  • ConversableAgent — a node that can receive, send, and call tools; under the hood it's just LLM + tools + memory;
  • GroupChat + GroupChatManager — the "moderator" that picks who speaks next. Common strategies: round-robin, let the LLM choose, or rule-based by role.

The whole system is essentially the actor model — each Agent is an actor, messages are the only coupling. v0.4 takes this further with a real async message bus and distributed runtime.

User Proxy Manager (picks speaker)
    ↓ routes to
Coder Agent Tester Agent Reviewer Agent
↑ all messages enter the same GroupChat history ↑
Code example
from autogen_agentchat.agents import AssistantAgent
from autogen_agentchat.teams import RoundRobinGroupChat
from autogen_ext.models.openai import OpenAIChatCompletionClient

model = OpenAIChatCompletionClient(model="gpt-4o")

# Each Agent does one thing — the system prompt defines its "persona"
coder = AssistantAgent("coder", model_client=model,
    system_message="You are a Python engineer. Output runnable code only, no prose.")

reviewer = AssistantAgent("reviewer", model_client=model,
    system_message="You are a code reviewer. Point out bugs/improvements; reply 'APPROVED' when satisfied.")

# Group chat: take turns until reviewer says APPROVED
team = RoundRobinGroupChat([coder, reviewer],
    termination_condition=lambda msgs: "APPROVED" in msgs[-1].content)

async def main():
    async for msg in team.run_stream(task="Write binary search that handles empty arrays."):
        print(f"[{msg.source}] {msg.content}")
Pitfall + practical scenario
"More agents = smarter system" — wrong. AutoGen testing shows that groups of 4-5+ agents easily fall into mutual flattery loops or silent deadlock (everyone assumes someone else will respond), with token costs exploding and zero progress. Production lesson: start with 2 agents (executor + critic), add roles only as needed, and always set max_turns as a safety net.
📌 Your scenario: your "investment research desk" — three Agents play bull analyst, bear analyst, and risk manager, each arguing about the same stock; you (User Proxy) make the final call. Instead of a single "objective but bland" report, you watch a structured debate.
Takeaway + question
💡 AutoGen reduces multi-agent to two things: a message bus and roles. Don't write workflows — write conversation rules.
🤔 In the workflows you currently orchestrate by hand, which steps are essentially conversations rather than flowcharts?

CrewAIRole-based Agent Orchestration

FrameworkWorkflow
One-line analogy

If AutoGen is a "Slack channel," CrewAI is a "project team with a PM" — each Agent has a clear Role, Goal, and Backstory; Tasks are explicitly assigned to specific Agents; execution can be Sequential or Hierarchical (with a Manager Agent). Backend analogy: shift from "message-driven" back to an explicit workflow engine — more like Airflow + Slack mixed together.

What it solves + how it works

AutoGen's conversational style is flexible but unpredictable — the same task may take wildly different conversational paths across runs, which is hostile to production. CrewAI (open-sourced 2024, now one of the most popular multi-agent frameworks) takes the opposite philosophy: make the flow explicit, keep the roles stable. You define "3 Agents + 5 Tasks as a directed graph" upfront, then run it. Two execution modes:

  • Sequential: Task1 → Task2 → Task3; previous Task's output feeds the next (pipeline-style);
  • Hierarchical: a Manager Agent (auto-generated, uses a stronger model) breaks down, dispatches, and reviews; worker Agents do the actual work — like a corporate hierarchy.

"Backstory" isn't decoration — it materially shapes the model's tone and decisions. Writing "you are a senior financial analyst with 15 years of experience, known for being rigorous and conservative" produces noticeably better output than "you are an analyst" (replicated across multiple benchmarks).

Code example
from crewai import Agent, Task, Crew, Process

researcher = Agent(
    role="Market Researcher",
    goal="Find 3 latest trends for {topic} with sources",
    backstory="Industry analyst with 10 years of primary-source research experience",
    tools=[search_tool])

writer = Agent(
    role="Content Editor",
    goal="Rewrite the research into a tight 800-word brief",
    backstory="Ex-Economist editor; prizes fact density and readability")

# Tasks are explicitly bound to Agents; `context` declares dependencies
research = Task(description="Research AI Agent trends for 2026", agent=researcher,
                expected_output="3 trends + citation links")
write    = Task(description="Rewrite as a brief", agent=writer,
                context=[research],   # waits for research to finish
                expected_output="800-word markdown")

crew = Crew(agents=[researcher, writer], tasks=[research, write],
            process=Process.sequential)
result = crew.kickoff(inputs={"topic": "AI Agent"})
Pitfall + practical scenario
"Longer Role/Backstory = better" — wrong. Backstories over ~200 words crowd out tool descriptions and task context in the attention budget, often hurting performance. Best practice: one-sentence role, a goal with a verifiable output (not "do good analysis" but "output 3 trends with citations"), and a backstory that highlights 1-2 character traits.

📌 Your scenario: a Sunday-night "Family Weekly Crew" — 4 Agents: schedule aggregator (next week's calendar), nutrition planner (meal plan), kids' homework tracker, and family CFO (bills/budget). Run them Sequentially and you get a single family brief — far fewer errors than one generalist Agent.
Takeaway + question
💡 CrewAI picks predictability via "roles + flow"; AutoGen picks flexibility via "conversation." Different trade-offs, not better/worse.
🤔 For your workflow: which matters more — stable structure or exploratory flexibility? Your answer picks the framework.

Role SpecializationWhy Narrow Agents Beat Generalists

DesignCognition
One-line analogy

Splitting one Agent into several roles is the backend world's "monolith → microservices" move — not because "division of labor" is virtuous, but because small context + focused prompt + restricted toolset together cut a single Agent's cognitive load, making each inference sharper. The costs mirror microservices too: communication, debugging, and overall consistency all get harder.

What it solves + how it works

Hang 20 tools + 5 paragraphs of system prompt + full task context on one Agent and you get intent drift: it starts writing code, switches to explaining architecture, then begins interrogating the user. Anthropic's "Constitutional AI" work in 2024 and OpenAI's "Specialized Agents" experiments both confirmed the same finding: narrowing role scope significantly boosts task completion. Three mechanisms behind it:

  • Attention focus — short prompt + few tools means every token "knows what it's doing," similar to RAG narrowing retrieval (Day 4);
  • Failure isolation — one Agent's failure doesn't pollute the entire chain's conversation history;
  • Independent optimization — give critical roles a stronger model (Opus), auxiliary roles a cheaper one (Haiku) — better cost/quality on both axes.

Simple heuristic for "should I split this?": when an Agent's failures cluster around "did the wrong kind of thing" (wrote code when it should advise, summarized when it should expand) rather than "didn't do it well enough," it's time to split.

Code example
# Anti-pattern: a generalist Agent that writes, reviews, and runs
generalist = AssistantAgent("engineer",
    system_message="You are an engineer. You write code, review code, run tests, fix bugs, write docs...",
    tools=[write_code, run_tests, lint, format, git, search_docs])

# Better: three specialized roles, each with short prompt and tight tools
coder = AssistantAgent("coder", model_client=opus,
    system_message="Write the implementation only. No tests, no docs.",
    tools=[write_code, search_docs])

tester = AssistantAgent("tester", model_client=haiku,  # cheaper model
    system_message="Write pytest cases and run them. Report pass/fail.",
    tools=[run_tests])

reviewer = AssistantAgent("reviewer", model_client=opus,
    system_message="Review code. List the 1-3 most serious issues. Don't rewrite.",
    tools=[lint])
Pitfall + practical scenario
"The finer the split, the better" — wrong. Each additional Agent adds another LLM call and another lossy context handoff; over-splitting causes information loss, latency bloat, and debugging hell. Empirical thresholds: when an Agent's tool count is under 5-7 and its system prompt is under 100-200 words, it's already focused enough — don't keep splitting. Microservices have "right-sized" granularity; Agents do too.
📌 Your scenario: a "cross-disciplinary learning assistant" for reading papers — split into a translation Agent (Chinese↔English), an analogy Agent (relate to familiar fields), and a devil's-advocate Agent (find weak points). Each Agent has a tiny prompt and tightly scoped output — information density is much higher than a single "help me read this paper" generalist.
Takeaway + question
💡 The point of role specialization isn't "simulating human teams" — it's offloading the LLM so each inference only carries necessary context.
🤔 Take that 1000+ word prompt you've fed a single LLM in the past — can you split it into three 300-word roles? After splitting, what are you most worried about losing?

Collaboration ProtocolsCoordinating Multiple Agents

DistributedPatterns
One-line analogy

How multiple Agents "exchange info and reach agreement" is the LLM-era version of classic distributed systems coordination — Paxos, leader election, gossip, blackboard pattern all have analogs here. Today's multi-agent frameworks are essentially combinations of these few patterns.

What it solves + how it works

With N Agents, you have two questions to answer: (1) Who speaks/acts when? (control flow) (2) How is shared state managed? (data flow). Four mainstream protocols:

① Hierarchical (Manager-Worker)
Manager decompose A B C aggregate
Analogy: CTO + engineers; matches CrewAI Hierarchical

② Sequential / Pipeline
A → output → B → output → C
Analogy: CI/CD pipeline; CrewAI Sequential default

③ Debate / Group Chat
ABC (shared conversation history, take turns)
Analogy: meetings; AutoGen GroupChat, Multi-Agent Debate (Du et al. 2023)

④ Blackboard / Shared Memory
shared state (DB/KV)A B C (independent read/write)
Analogy: actors + Redis; LangGraph State, Letta, and Google's A2A protocol are standardizing this

Which to use depends on the task: splittable + mergeable → Hierarchical; clear order dependency → Sequential; multi-perspective debate → Debate; any subset needs any intermediate result → Blackboard. Production systems typically mix — outer Sequential, with Debate inside one step.

2024-2025 also brought new protocol layers: Google's A2A (Agent-to-Agent) standardizes cross-vendor Agent interop (like Day 6's MCP did for tools); Anthropic Claude's Computer Use + sub-agents lets an Agent spawn sub-Agents. Both are still engineering iterations on these four patterns.

Code example
# LangGraph: graph + shared state to implement hybrid protocols (production favorite)
from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated
import operator

class State(TypedDict):
    question: str
    research: Annotated[list, operator.add]   # blackboard shared region
    answer: str

def researcher(s): return {"research": [search_web(s["question"])]}
def critic(s):     return {"research": [critique(s["research"])]}
def writer(s):     return {"answer": synthesize(s["research"])}

g = StateGraph(State)
g.add_node("research", researcher); g.add_node("critic", critic)
g.add_node("writer", writer)
g.add_edge("research", "critic")        # sequential
g.add_conditional_edges("critic",            # conditional routing
    lambda s: "writer" if good_enough(s) else "research")
g.add_edge("writer", END)
g.set_entry_point("research")
app = g.compile()
result = app.invoke({"question": "AI Agent trends in 2026?", "research": []})
Pitfall + practical scenario
"Use Debate so N agents can vote out hallucinations" — only half-true. Research (Du et al. 2023) shows Debate improves factual reasoning accuracy by ~5-15%, but is nearly useless on knowledge-heavy questions — all Agents share the same base model, so errors are correlated and voting can't fix systematic bias. To actually reduce hallucinations, adding RAG beats adding Agent votes.
📌 Your scenario: a "family decision assistant" in LangGraph — the blackboard holds the family calendar, budget, and recent mood logs for the kids; three Agents read/write independently (schedule planning / financial impact / family-time score) and produce a multi-dimensional recommendation on "should we add a tutoring session this weekend." Shared state is more suitable than multi-turn conversation when family decisions need factual consistency.
Takeaway + question
💡 The heart of multi-agent protocol design isn't "AI collaboration" — it's distributed systems coordination. Your distributed background is the core leverage.
🤔 If you view N Agents as N microservices, how do their failure modes match (and differ from) classical distributed systems? (Hint: think about what idempotency and causal consistency mean when outputs come from LLMs.)

Further ReadingResources

Deep QuestionsReflection

1. Anthropic's "Building Effective Agents" explicitly says "use workflows before multi-agent" — why? When is multi-agent truly irreplaceable?
Core argument: multi-agent introduces extra failure surface (each added Agent adds another LLM failure probability) + communication cost (token waste in message passing) + debugging difficulty (non-determinism compounds). Most "multi-agent" tasks can be solved with explicit workflow + single Agent more reliably and cheaply. Truly irreplaceable scenarios: (a) strong task parallelism — 10 independent subproblems, single Agent takes 10× time serially, 10 Agents in parallel take 1×; (b) need for adversarial perspectives — red/blue debate, generate-vs-judge; one Agent playing both roles will "self-comfort"; (c) hard role/permission isolation — customer-service Agent must not see financial Agent's data, security mandates process separation; (d) long task wants different models — cheap model for bulk filtering, strong model for final decision. Otherwise prefer workflows. Same engineering wisdom as "monolith first, microservices later."
2. Where does multi-agent's non-determinism come from? How do you make outputs reproducible and testable in production?
Three layers of randomness: (a) LLM sampling at temperature > 0; (b) Agent message order is unfixed under async; (c) tool calls (external APIs) are themselves unstable. Mitigations: (1) temperature=0 + fixed seed — OpenAI/Anthropic both support seed; eliminates LLM-side randomness; (2) explicit flow (CrewAI Sequential / LangGraph) over free-form conversation (AutoGen GroupChat) — moves "who speaks" from LLM decision to code decision; (3) mock external tools — replace search/DB calls with fixtures during eval; strip out external noise; (4) evaluate via distributions, not single runs — run the same task N times and record pass rate, not "did this one run pass"; closer to ML-model evaluation thinking; (5) structured output — force each Agent to output JSON Schema so intermediate state is diffable. The whole approach reduces multi-agent from "magic" to "testable system."
3. Pricing multi-agent: how do you estimate the per-task cost of a 5-agent crew? Which hidden costs are most often underestimated?
Baseline formula: total tokens = Σ (each Agent's system_prompt + cumulative conversation history + tool descriptions) × times that Agent is invoked. Most underestimated hidden costs: (a) conversation history bloat — AutoGen GroupChat by default has all Agents share the full history; 5 Agents × 10 rounds = 50 LLM calls each carrying full history, tokens grow O(N²); (b) tool description duplication — one tool used by 3 Agents has its schema in each Agent's prompt; (c) retry on failure — Agent output rejected for format errors, tokens double; (d) unnecessary "polite check-ins" — Agents acknowledging each other ("got it, starting now") burn 5-10% of tokens. Cost-cutting moves: prompt caching (mentioned Day 6), periodic summarization of history into memory, kill politeness prompts ("deliver result directly, no acknowledgement"), prefer Hierarchical over Debate (former is O(N) communication, latter O(N²)). A typical 5-agent task going from 50K tokens to 8K tokens is common — single-task cost drops 80%.
4. Compare "multi-Agent collaboration" with the distributed consensus algorithms (Paxos / Raft) you know: what's the same, what's different? What does that teach about multi-agent design?
Same: both solve "how do multiple unreliable nodes agree on a value" — LLMs, like distributed nodes, can "fail" (wrong output), "partition" (inconsistent context), and behave "Byzantine" (malicious / prompt-injected). Different: (a) Paxos/Raft assume node failures are independent, but multiple Agents on the same base model fail in highly correlated ways (same biases, same blind spots) — voting consensus is weak; (b) Paxos optimizes for correctness + latency; multi-agent optimizes for correctness + creativity — the former eliminates disagreement, the latter exploits it; (c) distributed consensus has formal proofs (safety/liveness), multi-agent has essentially none. Lessons: (1) multi-agent voting must use different base models (Opus + GPT-4 + Gemini) to decorrelate; (2) inject external ground truth (RAG, tools) to break "group hallucination"; (3) critical nodes need veto power (human-in-the-loop = veto); (4) accept that multi-agent "liveness" can't be guaranteed — always have timeouts and fallbacks. Carrying distributed systems' pessimistic assumptions into multi-agent design avoids 80% of pitfalls.
5. Multi-agent seems to simulate "organizations" — but human organizations are inefficient and politically draining. If we treat multi-agent as a "sandbox for new organizational designs," what can it teach human management?
Interesting reverse direction. Several findings repeatedly verified in multi-agent experiments hold for human organizations too: (a) clearer role boundaries → more efficient collaboration — 80% of multi-agent failures are "unclear responsibility"; humans the same; (b) explicit process beats flexible negotiation — AutoGen GroupChat deadlocks easily, CrewAI explicit Tasks are stable; mirrors "meeting culture vs async-doc culture"; (c) specialized + short context > generalist + long context — generalists coordinate, specialists go deep, same principle; (d) critics are cheaper than executors — using a cheap model for the reviewer works fine; mirrors "good code review doesn't need the most senior person, just someone with clear rules"; (e) hierarchy (Manager) beats pure democracy (Debate) for getting results — but the Manager must have a higher-quality "brain" (stronger model / broader vision). Conversely, multi-agent reminds us: organizational design isn't "add more people, get smarter," it's "in what context does structured division of labor's benefit outweigh coordination cost." Which loops back to the sober Anthropic line — if a single Agent / single person can solve it, don't pile on people. Complexity always has a price. This kind of cross-disciplinary mapping is exactly your strongest mode of thinking.