A tool's description determines your agent's ceiling more than model choice.
Day 3 framed the harness as the agent's OS. Tools run on top of that OS, and what those tools look like is decided by the tool schema. A fact almost everyone underrates: the few lines of English you write in description influence final success rate more than swapping the model. Anthropic's internal SWE-bench ablations repeatedly confirm this — same Sonnet, rewriting the descriptions of just 6 tools shifts pass@1 by 10+ points. Tool schema is part of the prompt — it goes into the same KV cache as the system message, participates in attention, and conditions every downstream token. This issue covers four things: writing schemas as prompts, why 20 tools usually underperform 6, the real atomic-vs-composite trade-off, and why parallel tool calls is the most wasted capability in agent design. Closing with Anthropic's "7 tool-design rules" checklist.
name — it depends on the wording in description and every field in input_schema.How does a tool definition actually enter the model? Anthropic's Tool Use Overview spells it out: every registered tool is serialized into a structured text block appended to the system prompt. From the model's perspective, tools=[...] is not a "function registry" — it's a document it must read and understand. From that document, the model must decide: when to call, which one to call, what arguments, and when not to call.
That explains several frequently observed effects:
description from one line to five lifts accuracy materially — you're handing the model not just a name, but use cases.description (not its name) clarifying units, formats, and edges helps more than renaming start_date → start_date_iso8601. The model reads descriptions; it doesn't really "read" snake_case names.["fast","accurate","creative"] by 2–3× selection accuracy.Mechanism: once the tools block is in KV cache, every generated token attends back to that text. The more specific and scenario-grounded the description, the more conditioning the model has at the two key decisions — "should I call?" and "which one?". This isn't magic; it's a physical property of attention.
Same weather tool, two ways of writing it — and a wildly different hit rate:
# —— BAD: schema treated as a function signature ——
{
"name": "get_weather",
"description": "Get weather",
"input_schema": {
"type": "object",
"properties": {
"location": {"type":"string"},
"unit": {"type":"string", "enum":["c","f"]}
}
}
}
# —— GOOD: schema treated as prompt ——
{
"name": "get_weather",
"description": "Look up the CURRENT weather (now ± 1h) for a single city.
Use ONLY for present-time questions like 'is it raining in Tokyo now'.
DO NOT use for: forecasts >24h ahead (use get_forecast), historical
weather (use get_weather_history), or air quality (use get_aqi).",
"input_schema": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City name in English, optionally followed by
country, e.g. 'Tokyo', 'Paris, FR'. Do not pass GPS coords."
},
"unit": {
"type": "string", "enum": ["celsius","fahrenheit"],
"description": "Temperature unit. Default to celsius unless the user
explicitly mentions °F or asks in a US locale context."
}
},
"required": ["location"]
}
}
Three upgrades that matter: (1) description states use cases + counter-examples (DO NOT use for…) — the cheapest fix for cross-tool mis-selection; (2) every parameter has its own description with format, constraints, and default inference rules; (3) enum values are full names (celsius not c) — readability = model comprehension. Treat these three as a checklist and run every tool through it.
"description": "do X" — the model doesn't know when not to call and over-calls on edge cases. (2) Encoding semantics in the tool name (get_user_v2_by_email_only) — the model reads description, not your snake_case tokens. (3) Omitting all parameter descriptions — the model has to guess format from type and falls flat on dates / paths / IDs.
A commonly ignored engineering fact: tool count and accuracy are not monotonic — they form an inverted U. Zero tools: nothing to do. 3–7 orthogonal tools: peak agent performance. 20+ tools: the model gets confused, mis-selects, skips the right call, and uses wrong arguments. The Berkeley Function Calling Leaderboard (BFCL) multi-turn scenarios show this curve; Anthropic's Building Effective Agents says it explicitly: "Tool definitions deserve as much prompt engineering attention as your main prompt."
Why the collapse? Three real reasons:
search_docs / find_in_docs / lookup_documentation); the model oscillates.That's also why Claude Code's core tool registry has only ~10 atomic tools (Read / Edit / Write / Bash / Grep / Glob / Task / WebFetch / WebSearch / TodoWrite) — anything else attaches on demand via MCP, never resident.
If your MCP / agent has already stacked 30+ tools, apply the 3-step "tool diet":
# Step 1: rank by call frequency, look at the long tail
sqlite3 agent.db "SELECT tool_name, COUNT(*) c FROM tool_calls
GROUP BY tool_name ORDER BY c DESC;"
# Long-tail tools (< 1% of calls) can almost always be deleted or merged
# Step 2: find semantic twins and merge
# search_files / find_files / list_matching → merge into search_files(pattern, mode)
# Step 3: split by scenario, not by API
# BAD: get_user_by_id / get_user_by_email / get_user_by_username
# GOOD: get_user(query: {id?|email?|username?}) ← one tool, self-describing input
One counter-intuitive trick: dynamic tool surfaces. Cursor and Claude Code expose different tool sets in different modes — plan mode physically hides Write/Edit, so the model never has to distribute selection pressure across read vs. write. You can do the same: switch the registered tool set in your harness by task type — research tasks get web/fetch, coding tasks get read/edit/bash.
npm install — attaching every plausibly useful MCP server until half the system prompt is tool definitions and the model is dazed before reading your real question. (2) Assuming "if it won't pick wrong it won't call" — the actual collapse is under-calling tools that should have been called, because the model can't distinguish them; this silent failure is harder to spot than an error.
Handing an agent edit_file(path, old, new) is a different worldview from handing it refactor_function(path, fn_name, new_impl). The first is an atomic tool — the model composes; the second is a composite / macro tool — a single call performs multi-step business logic. This is the most important and most overlooked trade-off in tool design:
Claude Code chose the atomic-leaning path: Read / Edit / Write / Bash — Unix-philosophy small tools, with composition done by the model. This lets it handle any coding task but demands strong planning capability — which is why it underperforms on weaker models. Traditional RPA goes composite-heavy: a dedicated tool per process (process_invoice / onboard_employee) — reliable but brittle, every new process requires new code.
In real engineering the two should coexist in layers: atomic tools at the bottom for "exploration / debugging / one-off tasks", composite tools on top for "high-frequency / reliability-sensitive / well-understood tasks".
Real scenario: an agent cutting a release on a git repo. Atomic vs. composite:
# —— Atomic path: 4 low-level tools, model composes ——
tools = [run_bash, read_file, write_file, git_command]
# agent must plan: bump version → update changelog → commit → tag → push
# Pro: if the plan needs to change (run tests before bump), agent adapts. Works on new repos.
# Con: 1–2 out of 10 runs: wrong order / missing tag / committed to wrong branch
# —— Composite path: one macro tool ——
tools = [{
"name": "release",
"description": "Run the full release flow: bump version, regenerate
CHANGELOG, commit with 'chore: release vX.Y.Z', tag, push branch+tag.
Aborts on any failing step. Use when user asks to 'cut a release' or
'publish a new version'. Does NOT publish to npm—call npm_publish after.",
"input_schema": {"type":"object",
"properties":{"bump":{"type":"string","enum":["patch","minor","major"]}},
"required":["bump"]}
}]
# Pro: 100 runs, same flow. Easy to eval.
# Con: doesn't transfer to a different repo. Agent can't adapt to exceptions.
Decision tree in practice:
open_file / read_lines / close_file in 1980s C-API style, three calls per operation.
Since Sonnet 3.5, Claude can return multiple tool_use blocks in a single assistant turn; OpenAI's GPT-4o / o3 do the same. The harness just needs to recognize multiple tool_uses, execute them concurrently, and concatenate the tool_results into the next turn. The dividend is massive:
But four engineering preconditions actually need to hold; miss one and it silently degrades to serial:
asyncio.gather / a thread pool.Upgrade the §3 atomic harness to concurrent execution:
import asyncio, anthropic
client = anthropic.AsyncAnthropic()
async def dispatch(block): # a single tool, async
handler = TOOLS[block.name]["handler_async"]
try:
out = await handler(block.input)
except Exception as e:
out = f"ERROR: {type(e).__name__}: {e}"
return {"type":"tool_result", "tool_use_id":block.id, "content":str(out)}
async def agent(task, max_iters=20):
msgs = [{"role":"user","content":task}]
sys = ("You are a careful agent. "
"IMPORTANT: when multiple tool calls are independent, " # ← key line
"emit them in the SAME response so they run in parallel.")
for _ in range(max_iters):
r = await client.messages.create(model=MODEL, system=sys,
tools=SCHEMAS, messages=msgs, max_tokens=4096)
msgs.append({"role":"assistant", "content":r.content})
if r.stop_reason == "end_turn": return r
uses = [b for b in r.content if b.type == "tool_use"]
results = await asyncio.gather(*[dispatch(b) for b in uses]) # ← concurrent
msgs.append({"role":"user", "content": results})
Measured: ask the agent "read README.md / package.json / .github/workflows/ci.yml and tell me what kind of project this is" — serial ~9s, parallel ~3.5s, with effectively no accuracy delta. The gap widens with more files.
edit_file(same_path) overwrite each other. Add file-level locks for write tools. (3) Mistaking "multiple tool calls" for parallel — the model has to emit them in the same response; calling tools across multiple turns is sequential.
Compress everything above into a wall-pinnable checklist. Run it before designing or reviewing a tool set:
"fast" is worse than "fast: prioritize latency, accept ±10% accuracy"."ENOENT" is useless; "File '/x.json' not found. Use list_dir to see available files." is gold.create_or_update beats create. Agents retry constantly.These 7 rules come from Anthropic's Building Effective Agents and Tool Use Best Practices guidance. Make it a PR checklist template — every new tool fills one in. Your agent project will shed 80% of its silent failures.