DAY 04 / PHASE 1 · ENGINEERING

Tool Use & Function Calling

Schema as Prompt · Tool Granularity · Selection Collapse · Parallel Calls

2026-05-23 · BigCat

A tool's description determines your agent's ceiling more than model choice.

Foundation concepts → ai-ml-daily Day 6: Tool Use (Function Calling, MCP protocol)

// WHY THIS MATTERS

Day 3 framed the harness as the agent's OS. Tools run on top of that OS, and what those tools look like is decided by the tool schema. A fact almost everyone underrates: the few lines of English you write in description influence final success rate more than swapping the model. Anthropic's internal SWE-bench ablations repeatedly confirm this — same Sonnet, rewriting the descriptions of just 6 tools shifts pass@1 by 10+ points. Tool schema is part of the prompt — it goes into the same KV cache as the system message, participates in attention, and conditions every downstream token. This issue covers four things: writing schemas as prompts, why 20 tools usually underperform 6, the real atomic-vs-composite trade-off, and why parallel tool calls is the most wasted capability in agent design. Closing with Anthropic's "7 tool-design rules" checklist.

// 01

Tool Schema Is a Prompt: Description Matters 10× More Than Name

Claim: whether the model picks your tool doesn't depend on name — it depends on the wording in description and every field in input_schema.

Background & Principles

How does a tool definition actually enter the model? Anthropic's Tool Use Overview spells it out: every registered tool is serialized into a structured text block appended to the system prompt. From the model's perspective, tools=[...] is not a "function registry" — it's a document it must read and understand. From that document, the model must decide: when to call, which one to call, what arguments, and when not to call.

That explains several frequently observed effects:

Expanding description from one line to five lifts accuracy materially — you're handing the model not just a name, but use cases.
Each parameter's own description (not its name) clarifying units, formats, and edges helps more than renaming start_date → start_date_iso8601. The model reads descriptions; it doesn't really "read" snake_case names.
Adding "when to pick this value" notes on enum entries beats a bare ["fast","accurate","creative"] by 2–3× selection accuracy.

Mechanism: once the tools block is in KV cache, every generated token attends back to that text. The more specific and scenario-grounded the description, the more conditioning the model has at the two key decisions — "should I call?" and "which one?". This isn't magic; it's a physical property of attention.

Hands-on Example

Same weather tool, two ways of writing it — and a wildly different hit rate:

# —— BAD: schema treated as a function signature ——
{
  "name": "get_weather",
  "description": "Get weather",
  "input_schema": {
    "type": "object",
    "properties": {
      "location": {"type":"string"},
      "unit":     {"type":"string", "enum":["c","f"]}
    }
  }
}

# —— GOOD: schema treated as prompt ——
{
  "name": "get_weather",
  "description": "Look up the CURRENT weather (now ± 1h) for a single city.
    Use ONLY for present-time questions like 'is it raining in Tokyo now'.
    DO NOT use for: forecasts >24h ahead (use get_forecast), historical
    weather (use get_weather_history), or air quality (use get_aqi).",
  "input_schema": {
    "type": "object",
    "properties": {
      "location": {
        "type": "string",
        "description": "City name in English, optionally followed by
          country, e.g. 'Tokyo', 'Paris, FR'. Do not pass GPS coords."
      },
      "unit": {
        "type": "string", "enum": ["celsius","fahrenheit"],
        "description": "Temperature unit. Default to celsius unless the user
          explicitly mentions °F or asks in a US locale context."
      }
    },
    "required": ["location"]
  }
}

Three upgrades that matter: (1) description states use cases + counter-examples (DO NOT use for…) — the cheapest fix for cross-tool mis-selection; (2) every parameter has its own description with format, constraints, and default inference rules; (3) enum values are full names (celsius not c) — readability = model comprehension. Treat these three as a checklist and run every tool through it.

Failure modes: (1) writing only "description": "do X" — the model doesn't know when not to call and over-calls on edge cases. (2) Encoding semantics in the tool name (get_user_v2_by_email_only) — the model reads description, not your snake_case tokens. (3) Omitting all parameter descriptions — the model has to guess format from type and falls flat on dates / paths / IDs.

Going deeper · Anthropic Tool Use Overview, docs.claude.com/.../tool-use · Anthropic How to implement tool use, docs.claude.com/.../implement-tool-use · OpenAI Function Calling Guide, platform.openai.com/.../function-calling

// 02

More Tools Make Models Dumber: The Real Selection-Collapse Curve

Claim: past ~15 tools, tool-selection accuracy drops sharply. Subtract before you add.

Background & Principles

A commonly ignored engineering fact: tool count and accuracy are not monotonic — they form an inverted U. Zero tools: nothing to do. 3–7 orthogonal tools: peak agent performance. 20+ tools: the model gets confused, mis-selects, skips the right call, and uses wrong arguments. The Berkeley Function Calling Leaderboard (BFCL) multi-turn scenarios show this curve; Anthropic's Building Effective Agents says it explicitly: "Tool definitions deserve as much prompt engineering attention as your main prompt."

Why the collapse? Three real reasons:

Token budget: each tool definition is ~100–300 tokens. 20 tools = 4–6K tokens in system prompt — squeezes context and pollutes attention.
Semantic confusion: more tools = inevitable semantic near-duplicates (search_docs / find_in_docs / lookup_documentation); the model oscillates.
Decision path explosion: N tools means N counter-examples ("don't call") for attention to track. Capacity gets spread thin and every decision is shakier.

That's also why Claude Code's core tool registry has only ~10 atomic tools (Read / Edit / Write / Bash / Grep / Glob / Task / WebFetch / WebSearch / TodoWrite) — anything else attaches on demand via MCP, never resident.

Accuracy │ 1.0 │ ╭───────╮ │ ╱ ╲ 0.8 │ ╱ ╲___ │ ╱ ╲___ 0.6 │ ╱ ╲___ │ ╱ ╲___ 0.4 │╱ ╲____ │ ╲___ 0.2 │ └──────────────────────────────────────────▶ # tools 3 7 12 20 35 60 100 Illustrative curve (based on BFCL multi-turn and various agent evals). Sweet spot: 5–12. Past 20, it drops fast.

Hands-on Example

If your MCP / agent has already stacked 30+ tools, apply the 3-step "tool diet":

# Step 1: rank by call frequency, look at the long tail
sqlite3 agent.db "SELECT tool_name, COUNT(*) c FROM tool_calls 
  GROUP BY tool_name ORDER BY c DESC;"
#  Long-tail tools (< 1% of calls) can almost always be deleted or merged

# Step 2: find semantic twins and merge
#  search_files / find_files / list_matching → merge into search_files(pattern, mode)

# Step 3: split by scenario, not by API
#  BAD:  get_user_by_id / get_user_by_email / get_user_by_username
#  GOOD: get_user(query: {id?|email?|username?})  ← one tool, self-describing input

One counter-intuitive trick: dynamic tool surfaces. Cursor and Claude Code expose different tool sets in different modes — plan mode physically hides Write/Edit, so the model never has to distribute selection pressure across read vs. write. You can do the same: switch the registered tool set in your harness by task type — research tasks get web/fetch, coding tasks get read/edit/bash.

Failure modes: (1) treating MCP like npm install — attaching every plausibly useful MCP server until half the system prompt is tool definitions and the model is dazed before reading your real question. (2) Assuming "if it won't pick wrong it won't call" — the actual collapse is under-calling tools that should have been called, because the model can't distinguish them; this silent failure is harder to spot than an error.

Going deeper · Berkeley Function Calling Leaderboard, gorilla.cs.berkeley.edu/leaderboard · Anthropic Writing tools for Claude, docs.claude.com/.../best-practices · Less is More for Long Context Tool Use (2024), arXiv:2411.15399

// 03

Atomic vs Composite: Tool Granularity Determines Reliability

Claim: packaging multiple atomic operations into one macro tool buys determinism — at the cost of agent ceiling.

Background & Principles

Handing an agent edit_file(path, old, new) is a different worldview from handing it refactor_function(path, fn_name, new_impl). The first is an atomic tool — the model composes; the second is a composite / macro tool — a single call performs multi-step business logic. This is the most important and most overlooked trade-off in tool design:

Atomic tools: highly composable, reusable, cover unknown tasks; but the model has to plan multiple steps, leans hard on reasoning, and failure modes diversify.
Composite tools: one call finishes the business flow — high reliability, easy to eval; but the codebase grows, coverage narrows, every new task wants a new tool.

Claude Code chose the atomic-leaning path: Read / Edit / Write / Bash — Unix-philosophy small tools, with composition done by the model. This lets it handle any coding task but demands strong planning capability — which is why it underperforms on weaker models. Traditional RPA goes composite-heavy: a dedicated tool per process (process_invoice / onboard_employee) — reliable but brittle, every new process requires new code.

In real engineering the two should coexist in layers: atomic tools at the bottom for "exploration / debugging / one-off tasks", composite tools on top for "high-frequency / reliability-sensitive / well-understood tasks".

Hands-on Example

Real scenario: an agent cutting a release on a git repo. Atomic vs. composite:

# —— Atomic path: 4 low-level tools, model composes ——
tools = [run_bash, read_file, write_file, git_command]
#  agent must plan: bump version → update changelog → commit → tag → push
#  Pro: if the plan needs to change (run tests before bump), agent adapts. Works on new repos.
#  Con: 1–2 out of 10 runs: wrong order / missing tag / committed to wrong branch

# —— Composite path: one macro tool ——
tools = [{
  "name": "release",
  "description": "Run the full release flow: bump version, regenerate
    CHANGELOG, commit with 'chore: release vX.Y.Z', tag, push branch+tag.
    Aborts on any failing step. Use when user asks to 'cut a release' or
    'publish a new version'. Does NOT publish to npm—call npm_publish after.",
  "input_schema": {"type":"object",
    "properties":{"bump":{"type":"string","enum":["patch","minor","major"]}},
    "required":["bump"]}
}]
#  Pro: 100 runs, same flow. Easy to eval.
#  Con: doesn't transfer to a different repo. Agent can't adapt to exceptions.

Decision tree in practice:

Task runs ≥ 5 times/day with stable steps → write a composite tool, freeze domain knowledge into code.
Task is occasional and varies each time → atomic tools, let the model compose.
Task is on a critical path with no error tolerance (production deploy, payments) → composite + strong schema validation + double-confirm.
Task is exploratory / debug → atomic, maximize flexibility.

Failure modes: (1) "build the macro first" — wrapping 5 steps before the task boundary is clear; every requirement change forces a code change. Run atomic tools 20 times first, find the stable pattern, then freeze. (2) "Composite but 12 parameters" — fill-error rate scales exponentially with required params. Over 5 required params: split it. (3) Atomic granularity too fine — open_file / read_lines / close_file in 1980s C-API style, three calls per operation.

Going deeper · Anthropic Building Effective Agents — tool granularity section, anthropic.com/.../building-effective-agents · MCP server design patterns, modelcontextprotocol.io/.../architecture · ToolLLM (2023) on tool abstraction levels, arXiv:2307.16789

// 04

Parallel Tool Calls: The Most Wasted Free Lunch

Claim: 90% of agents don't actually use parallel tool calls — the latency and cost dividend on the table is huge.

Background & Principles

Since Sonnet 3.5, Claude can return multiple tool_use blocks in a single assistant turn; OpenAI's GPT-4o / o3 do the same. The harness just needs to recognize multiple tool_uses, execute them concurrently, and concatenate the tool_results into the next turn. The dividend is massive:

Read 5 files and summarize: serial = 5 × (LLM RTT + tool latency); parallel = 1 × (LLM RTT + max(tool latency)).
Three independent API lookups (weather + flights + hotels) cut wall-clock by 2/3.
On expensive models (Opus 4.7) it pays double — each saved turn skips a multi-thousand-token prefill.

But four engineering preconditions actually need to hold; miss one and it silently degrades to serial:

System prompt explicitly encourages it: the model defaults conservative — without "prefer parallel", it'll call sequentially. Claude Code's system prompt literally says "make all of the independent calls in the same response".
Tools must be independent: the model can only parallelize independent work. Parallelizing read + edit is wrong — edit depends on read.
Harness actually concurrent: many harnesses hardcode a for loop over tool_uses. The model emitted them in parallel; the runtime serialized them. Switch to asyncio.gather / a thread pool.
Permission gate is non-blocking: if every tool needs human approval, "parallel" becomes serial dialog. Auto-allow obviously-safe tools.

Hands-on Example

Upgrade the §3 atomic harness to concurrent execution:

import asyncio, anthropic
client = anthropic.AsyncAnthropic()

async def dispatch(block):                              # a single tool, async
    handler = TOOLS[block.name]["handler_async"]
    try:
        out = await handler(block.input)
    except Exception as e:
        out = f"ERROR: {type(e).__name__}: {e}"
    return {"type":"tool_result", "tool_use_id":block.id, "content":str(out)}

async def agent(task, max_iters=20):
    msgs = [{"role":"user","content":task}]
    sys  = ("You are a careful agent. "
            "IMPORTANT: when multiple tool calls are independent, "     # ← key line
            "emit them in the SAME response so they run in parallel.")
    for _ in range(max_iters):
        r = await client.messages.create(model=MODEL, system=sys,
                tools=SCHEMAS, messages=msgs, max_tokens=4096)
        msgs.append({"role":"assistant", "content":r.content})
        if r.stop_reason == "end_turn": return r
        uses = [b for b in r.content if b.type == "tool_use"]
        results = await asyncio.gather(*[dispatch(b) for b in uses])  # ← concurrent
        msgs.append({"role":"user", "content": results})

Measured: ask the agent "read README.md / package.json / .github/workflows/ci.yml and tell me what kind of project this is" — serial ~9s, parallel ~3.5s, with effectively no accuracy delta. The gap widens with more files.

Failure modes: (1) parallelizing dependent tools — "look up user id, then place order" run in parallel calls place_order with an empty id. Either tell the model in the prompt "must obtain id before order" or split into turns. (2) Concurrent write tools causing race conditions — parallel edit_file(same_path) overwrite each other. Add file-level locks for write tools. (3) Mistaking "multiple tool calls" for parallel — the model has to emit them in the same response; calling tools across multiple turns is sequential.

Going deeper · Anthropic Parallel tool use, docs.claude.com/.../parallel-tool-use · OpenAI Parallel function calling, platform.openai.com/.../parallel-function-calling · LangChain Parallel tool calls reference, python.langchain.com/.../tool-calling-parallel

// Putting it together · Anthropic engineering's 7 tool-design rules

Compress everything above into a wall-pinnable checklist. Run it before designing or reviewing a tool set:

Small and orthogonal. Cut to ≤ 10 first. Merge semantic overlaps. Delete anything called < 1% of the time.
Description = use case, not feature. One line for "what it does" + one line for "when to use" + one line for "when not to use". Counter-examples should be specific.
Every parameter gets a description. Units, format, default-inference rule, typical values. 10× more important than tweaking the parameter name.
Enum values are annotated. "fast" is worse than "fast: prioritize latency, accept ±10% accuracy".
Errors must be readable. Handler errors should include "why it failed + what to do next" — the agent reads them to self-correct. "ENOENT" is useless; "File '/x.json' not found. Use list_dir to see available files." is gold.
Idempotent first. Design tools so retry doesn't break things — create_or_update beats create. Agents retry constantly.
Parallel-friendly. Don't artificially couple independent tools in the schema (don't merge read+write into one tool); make handlers async internally; explicitly encourage parallel in the system prompt.

These 7 rules come from Anthropic's Building Effective Agents and Tool Use Best Practices guidance. Make it a PR checklist template — every new tool fills one in. Your agent project will shed 80% of its silent failures.

// Deep Thinking

Description matters 10× more than name — but how long should a description be? One sentence vs. one paragraph, how big is the gap?

Measured: 50–150 characters significantly outperforms < 20 or > 300. Too short and the model doesn't know when to use it; too long and it pollutes context and eats prompt budget. Best shape: one sentence on function + one on when to use + one on when NOT to use. Example: "Search docs by semantic similarity. Use when user asks find/similar. Do NOT use for exact strings — use grep_tool." Anthropic's official tool-use guide recommends this three-sentence structure.

Past 15 tools, accuracy drops — is it diluted attention or descriptions interfering with each other?

Mostly the latter. Tool selection is zero-shot classification, and similar tools confuse the model ("search_docs" vs. "search_web" vs. "search_code"). Anthropic's internal numbers: adding tool count degrades faster than adding description length. Fixes: (1) tool grouping (pick group, then tool); (2) differentiated descriptions ("unlike X tool, this one…"); (3) past ~20 tools, switch to on-demand MCP attach.

Atomic tools make agents flexible but error-prone; composite tools make them reliable but limited. Which should a production agent use?

Depends on task type. Customer support / data lookup wants composite ("create_ticket" beats "auth + lookup_user + create + notify" as a sequence). Development / exploration wants atomic (Claude Code doesn't bundle read+grep+edit because debugging needs flexible composition). Heuristic: is the task's step path enumerable? Yes → composite. No → atomic.

Parallel tool calls is a free lunch. When does it actually slow the agent down?

Three scenarios: (1) tools with dependencies (A's output feeds B), where parallel makes B see null; (2) shared-resource contention (concurrent writes to the same file); (3) the model misjudges and parallelizes unrelated tools, wasting all of them. Anthropic's parallel tool call is opt-in because models default conservative-serial — production needs "when independent, call in parallel" in the system prompt to actually use it.

Field names in a tool schema are shorter than descriptions, and "name doesn't matter" is conventional wisdom. When does the name actually matter?

The name doesn't matter at tool-selection time (descriptions drive that), but it matters at argument-generation time — field names are the model's prior for value style. "query" vs. "search_string": the former biases toward natural language; the latter toward keyword combos. Production combo: short unambiguous names + thorough descriptions.

// Further Reading

Anthropic · How to implement tool use (incl. best practices) — the most authoritative official tool-design guide
Anthropic · Building Effective Agents — tool granularity and agent design
Berkeley Function Calling Leaderboard — public function-calling benchmark
OpenAI · Function Calling Guide — OpenAI's take on schema design
MCP · Concepts & Architecture — tool / resource / prompt three-layer abstraction