DAY 04 / PHASE 1 · ENGINEERING

Tool Use & Function Calling

Schema 即 Prompt · 工具粒度 · Selection 退化 · Parallel Calls

2026-05-23 · BigCat

Tool 的 description 比模型选择更决定你的 agent 上限。

前置概念 → ai-ml-daily Day 6: Tool Use 概念（Function Calling, MCP 协议）

// WHY THIS MATTERS

Day 3 我们说 harness 是 agent 的 OS。OS 之上跑的是工具，而工具的样子由 tool schema 决定。一条几乎所有人都低估的事实：你写在 description 里的那几行英文，对最终成功率的影响，比换模型还大。Anthropic 在 SWE-bench 的内部消融里反复证实——同一个 Sonnet，仅重写 6 个工具的 description，pass@1 能差 10+ 分。tool schema 就是 prompt 的一部分，它和 system message 一样进 KV cache、参与 attention、影响下游每一个 token 的分布。这一期讲四件事：怎么把 schema 当 prompt 写、为什么 20 个工具往往不如 6 个、原子 vs 组合工具的真实 trade-off、以及 parallel tool calls 这个被严重浪费的能力。最后给出 Anthropic 工程团队的「7 条 tool design 经验」清单。

// 01

Tool Schema 即 Prompt：description 比 name 重要 10 倍

论断：模型选不选你的工具，看的不是 name，是 description 和 input_schema 里每个字段的措辞。

背景与原理

tool definition 最终是怎么进入模型的？Anthropic 在 Tool Use Overview 文档里讲过：所有注册的 tool 会被序列化成一段结构化文本，拼到 system prompt 末尾。也就是说，从模型视角看，tools=[...] 不是「函数注册表」，而是一段它必须读懂的文档。它要靠这段文档决定：什么时候调？调哪个？参数怎么填？什么时候不该调？

这就解释了几个被反复观察到的现象：

把 description 从 1 行扩到 5 行，准确率明显提升——因为你给的不只是名字，是使用场景。
参数的 description（注意不是参数的名字）写清楚单位、格式、边界，比改名字 start_date → start_date_iso8601 有用得多。模型读 description，不太「读」名字。
给 enum 字段加每个值的「什么时候选这个」说明，比单纯列出 ["fast","accurate","creative"] 命中率高 2-3 倍。

底层机制：tools block 进 KV cache 后，每个 token 生成都会 attend 这段文本。description 越具体、越场景化，模型在「要不要调」「调哪个」这两个决策点上得到的 conditioning 越强。这不是玄学，是 attention 的物理事实。

实战示例

同一个查天气工具，两种写法的命中率差距：

# —— BAD：把 schema 当函数签名 ——
{
  "name": "get_weather",
  "description": "Get weather",
  "input_schema": {
    "type": "object",
    "properties": {
      "location": {"type":"string"},
      "unit":     {"type":"string", "enum":["c","f"]}
    }
  }
}

# —— GOOD：把 schema 当 prompt ——
{
  "name": "get_weather",
  "description": "Look up the CURRENT weather (now ± 1h) for a single city.
    Use ONLY for present-time questions like 'is it raining in Tokyo now'.
    DO NOT use for: forecasts >24h ahead (use get_forecast), historical
    weather (use get_weather_history), or air quality (use get_aqi).",
  "input_schema": {
    "type": "object",
    "properties": {
      "location": {
        "type": "string",
        "description": "City name in English, optionally followed by
          country, e.g. 'Tokyo', 'Paris, FR'. Do not pass GPS coords."
      },
      "unit": {
        "type": "string", "enum": ["celsius","fahrenheit"],
        "description": "Temperature unit. Default to celsius unless the user
          explicitly mentions °F or asks in a US locale context."
      }
    },
    "required": ["location"]
  }
}

三处关键升级：（1）description 写明使用场景 + 反例（DO NOT use for…），这是降低跨工具误选的最便宜手段；（2）每个参数自己的 description 写清格式、约束、默认推断规则；（3）enum 值用全称（celsius 而非 c），可读性 = 模型理解度。把这三条当 checklist，每个工具都过一遍。

失败模式：（1）只写 "description": "do X"——模型不知道什么时候不该调，于是会在边界场景过度调用。（2）依赖工具名表达语义（get_user_v2_by_email_only）——模型看 description，不看 snake_case 名字的 token 拆分。（3）参数 description 全省略——模型只能从 type 推格式，遇到日期 / 路径 / ID 这种就靠猜。

进阶资源 · Anthropic Tool Use Overview, docs.claude.com/.../tool-use · Anthropic How to implement tool use, docs.claude.com/.../implement-tool-use · OpenAI Function Calling Guide, platform.openai.com/.../function-calling

// 02

Tool 越多越笨：Selection 退化的真实曲线

论断：超过 ~15 个 tool 后，模型的工具选择准确率会显著下降；先做减法，再考虑加法。

背景与原理

这是一个常被忽视的工程事实：tool 数量和准确率不是单调关系，而是一条倒 U 曲线。给 0 个工具，没事干；给 3-7 个正交工具，agent 表现最好；给 20+ 个工具，模型开始混淆、误选、漏选、用错参数。Berkeley Function Calling Leaderboard（BFCL）的多轮场景里能看到这条曲线，Anthropic Building Effective Agents 里也明说：「Tool definitions deserve as much prompt engineering attention as your main prompt.」

为什么会退化？三个真实原因：

Token budget：每个 tool definition 大概 100-300 token。20 个工具就 4-6K token 在 system prompt，挤压 context 还污染 attention。
语义混淆：tool 数量越多，语义近似的工具必然出现（search_docs / find_in_docs / lookup_documentation），模型在它们之间摇摆。
Decision path 爆炸：N 个工具有 N 种「不调用」的反例要 attention 处理。模型容量分散，每条决策的可靠性都下降。

这也是为什么 Claude Code 的核心 tool registry 只有 ~10 个原子工具（Read / Edit / Write / Bash / Grep / Glob / Task / WebFetch / WebSearch / TodoWrite）——其余能力通过 MCP 按需挂载，而不是常驻。

Accuracy │ 1.0 │ ╭───────╮ │ ╱ ╲ 0.8 │ ╱ ╲___ │ ╱ ╲___ 0.6 │ ╱ ╲___ │ ╱ ╲___ 0.4 │╱ ╲____ │ ╲___ 0.2 │ └──────────────────────────────────────────▶ # tools 3 7 12 20 35 60 100 示意曲线（基于 BFCL multi-turn 与多个 agent eval 的常见形态）甜区在 5–12 之间；超过 20 后掉得很快。

实战示例

当你的 MCP / agent 已经堆了 30+ 工具，应用 3 步「工具减肥法」：

# Step 1：按调用频率排序，看长尾
sqlite3 agent.db "SELECT tool_name, COUNT(*) c FROM tool_calls 
  GROUP BY tool_name ORDER BY c DESC;"
#  长尾里的 tool（< 1% 调用）几乎都可以删掉或合并

# Step 2：找「语义双胞胎」并合并
#  search_files / find_files / list_matching → 合并为 search_files(pattern, mode)

# Step 3：按场景而非按 API 切分
#  BAD：get_user_by_id / get_user_by_email / get_user_by_username
#  GOOD：get_user(query: {id?|email?|username?})  ← 一个工具，input 自描述

另一个反直觉技巧：「dynamic tool surface」。Cursor / Claude Code 在不同 mode 下暴露不同工具集——plan mode 物理隐藏 Write/Edit，让模型不需要在「读」「写」两类工具间分散选择压力。你也可以在自己的 harness 里按 task type 切换 tool registry：研究任务给 web / fetch，写代码任务给 read / edit / bash。

失败模式：（1）把 MCP 当 npm install——把所有看起来有用的 MCP server 都挂上，结果 system prompt 半数被 tool definition 占据，model 没看你的真实问题就先迷糊了。（2）以为模型「不会选错就不调用」——实际表现是少调用了该调用的工具，因为它判断不清；这种 silent failure 比报错更难发现。

进阶资源 · Berkeley Function Calling Leaderboard, gorilla.cs.berkeley.edu/leaderboard · Anthropic Writing tools for Claude, docs.claude.com/.../best-practices · 论文 Less is More for Long Context Tool Use（2024）, arXiv:2411.15399

// 03

原子 vs 组合：工具粒度决定 reliability

论断：把多步原子操作打包成一个 macro tool，能换来确定性，但代价是 agent 能力上限。

背景与原理

给 agent 一个 edit_file(path, old, new) 和给它一个 refactor_function(path, fn_name, new_impl)，是两种世界观。前者是原子工具（atomic）——模型自己组合；后者是组合工具（composite / macro）——一个调用完成多步业务逻辑。这是 tool design 里最重要、最被忽略的 trade-off：

原子工具：组合性强、可复用、覆盖未知任务；但要求模型自己规划多步，对推理能力依赖高，错误模式更多样。
组合工具：单次调用就完成业务流，可靠性高、易于 eval；但代码量大、覆盖窄、新任务需要新工具。

Claude Code 选了偏原子的路线：Read / Edit / Write / Bash 这种 Unix 哲学的小工具，组合性靠模型。这让它能处理任意 coding 任务，但要求模型有非常强的规划能力——也是为什么它在弱模型上效果一般。相反，传统 RPA 工具走偏组合路线：每个流程一个专用 tool（process_invoice / onboard_employee），可靠但脆，遇到新流程就要写代码。

真实工程中，两种工具应该分层共存：底层原子工具暴露给「探索 / 调试 / 一次性任务」，上层组合工具暴露给「高频 / 可靠性敏感 / 已模式化的任务」。

实战示例

一个真实场景：让 agent 给 git repo 做 release。原子路线 vs 组合路线：

# —— 原子路线：4 个底层工具，模型自己组合 ——
tools = [run_bash, read_file, write_file, git_command]
#  agent 必须自己规划：bump version → update changelog → commit → tag → push
#  优点：万一规划要变（先跑测试再 bump）也能自己改；新 repo 直接用
#  缺点：跑 10 次有 1-2 次顺序错 / 漏 tag / 提交到错分支

# —— 组合路线：1 个 macro tool ——
tools = [{
  "name": "release",
  "description": "Run the full release flow: bump version, regenerate
    CHANGELOG, commit with 'chore: release vX.Y.Z', tag, push branch+tag.
    Aborts on any failing step. Use when user asks to 'cut a release' or
    'publish a new version'. Does NOT publish to npm—call npm_publish after.",
  "input_schema": {"type":"object",
    "properties":{"bump":{"type":"string","enum":["patch","minor","major"]}},
    "required":["bump"]}
}]
#  优点：跑 100 次都按同一个 flow，eval 简单
#  缺点：换个 repo 流程不一样就废了；agent 不能针对异常情况微调

实战决策树：

任务每天跑 ≥ 5 次，且步骤稳定 → 写组合工具，把领域知识固化进代码。
任务偶尔做、每次都不太一样 → 用原子工具，让模型组合。
任务关键路径 + 不能出错（生产部署、付款）→ 组合工具 + 强 schema 校验 + 二次确认。
任务探索性 / 调试 → 原子工具，最大化灵活性。

失败模式：（1）「先写 macro」——还没搞清楚任务边界就把 5 步打包，每次需求变就改代码。先用原子工具跑 20 次找出稳定 pattern，再固化成 macro。（2）「macro 但参数太多」——一个 tool 收 12 个参数，模型填错率指数上升。超过 5 个 required 参数就该拆。（3）原子工具粒度太细——open_file / read_lines / close_file 这种 1980 年代 C API，每次操作要 3 个 tool call。

进阶资源 · Anthropic Building Effective Agents — tool granularity 段落, anthropic.com/.../building-effective-agents · MCP server design patterns, modelcontextprotocol.io/.../architecture · 论文 ToolLLM（2023）讨论 tool 抽象层级, arXiv:2307.16789

// 04

Parallel Tool Calls：被严重浪费的免费午餐

论断：90% 的 agent 没真正用上 parallel tool call —— 拿到的延迟和成本红利是巨大的。

背景与原理

从 Sonnet 3.5 开始，Claude 在一次 assistant turn 里可以返回多个 tool_use block；OpenAI 的 GPT-4o / o3 也支持类似能力。harness 只需要识别多个 tool_use 后并发执行、把多个 tool_result 拼回下一轮就行。这件事的红利是巨大的：

读 5 个文件总结，串行要 5 × (LLM RTT + tool latency)，并行只要 1 × (LLM RTT + max(tool latency))。
跨 3 个 API 查信息（天气 + 航班 + 酒店），并行直接砍 2/3 的 wall-clock。
对昂贵模型（Opus 4.7）尤其值——每省一轮就是一次几千 token 的 prefill 不重复。

但能让它「真的」并行起来，有四个工程前提，缺一个就退化回串行：

System prompt 显式鼓励：模型默认偏保守，不写「prefer parallel」它常常一个一个调。Claude Code 的 system prompt 里就有「make all of the independent calls in the same response」。
Tool 之间无序无依赖：模型只能并行独立任务。让它并行 read + edit 是错的——edit 依赖 read 的结果。
Harness 真的并发：很多人 harness 写死了 for loop 顺序执行 tool_use，名义上模型并发了，实际还是串行。改成 asyncio.gather / 线程池。
Permission gate 不阻塞：如果每个 tool 都要人工 approve，并发就变成串行 dialog。给「明显安全」的工具自动放行。

实战示例

把 §3 的原子 harness 升级成并发执行：

import asyncio, anthropic
client = anthropic.AsyncAnthropic()

async def dispatch(block):                              # 单个 tool 异步执行
    handler = TOOLS[block.name]["handler_async"]
    try:
        out = await handler(block.input)
    except Exception as e:
        out = f"ERROR: {type(e).__name__}: {e}"
    return {"type":"tool_result", "tool_use_id":block.id, "content":str(out)}

async def agent(task, max_iters=20):
    msgs = [{"role":"user","content":task}]
    sys  = ("You are a careful agent. "
            "IMPORTANT: when multiple tool calls are independent, "     # ← 关键
            "emit them in the SAME response so they run in parallel.")
    for _ in range(max_iters):
        r = await client.messages.create(model=MODEL, system=sys,
                tools=SCHEMAS, messages=msgs, max_tokens=4096)
        msgs.append({"role":"assistant", "content":r.content})
        if r.stop_reason == "end_turn": return r
        uses = [b for b in r.content if b.type == "tool_use"]
        results = await asyncio.gather(*[dispatch(b) for b in uses])  # ← 并发
        msgs.append({"role":"user", "content": results})

实测：让 agent「读 README.md / package.json / .github/workflows/ci.yml 然后告诉我这是个什么项目」——串行版 ~9s，并行版 ~3.5s，准确率几乎无差。读 5+ 文件时差距更大。

失败模式：（1）让模型并行调依赖 tool——比如「先查用户 id 再下单」并行下去会用空 id。要么靠 prompt 明确说「下单前必须先拿到 id」，要么直接拆 turn。（2）并发 write tool 引发 race condition——并行 edit_file(same_path) 互相覆盖。给写类工具加 file-level 锁。（3）误以为 multi-tool 就是 parallel——模型在同一个 response 里返回多个 tool_use 才是 parallel；分多 turn 调多 tool 是 sequential。

进阶资源 · Anthropic Parallel tool use, docs.claude.com/.../parallel-tool-use · OpenAI Parallel function calling, platform.openai.com/.../parallel-function-calling · LangChain Parallel tool calls 实现参考, python.langchain.com/.../tool-calling-parallel

// 综合实战 · Anthropic 工程团队的 7 条 tool design 经验

把前面四点浓缩成一份可贴墙的 checklist。下次设计或 review 一组 tool 时逐条过：

少而正交。先砍到 ≤ 10 个。语义重叠的合并；调用频率 < 1% 的删掉。
description 写场景，不写功能。一行讲「干什么」+ 一行讲「什么时候用」+ 一行讲「什么时候不用」。反例越具体越好。
每个参数都有 description。写清单位、格式、默认推断规则、典型值示例。这比改参数名重要 10 倍。
enum 值带说明。"fast" 不如 "fast: prioritize latency, accept ±10% accuracy"。
错误必须可读。handler 抛出的 error 要包含「为什么错 + 下一步怎么办」，因为 agent 会读它来 self-correct。"ENOENT" 没用，"File '/x.json' not found. Use list_dir to see available files." 才有用。
幂等优先。tool 设计成可重试不出乱子的——比如 create_or_update 而非 create。agent 重试是常态。
并行友好。无依赖的 tool 不要在 schema 上人为耦合（不要把 read+write 合成一个工具）；handler 内部用 async；显式在 system prompt 鼓励并行。

这 7 条来自 Anthropic Building Effective Agents 与 Tool Use Best Practices 的工程指南。把它做成一个 PR checklist 模板，每次新增工具时填一遍——agent 项目能少 80% 的 silent failure。

// ENGLISH GLOSSARY

Tool Schema: 定义工具名称、用途、参数结构的 JSON。模型靠它决定何时调、如何调。
Function Calling: OpenAI 系对 tool use 的叫法；语义同 Anthropic 的 tool use。
Input Schema: tool 参数的 JSON Schema 描述。包含 type / properties / required / description。
Tool Selection: 模型在多个候选工具中选哪个调用的决策过程。工具越多越难。
Atomic Tool: 原子工具，做单一小操作（如 read_file），需要模型组合。
Composite / Macro Tool: 组合工具，一次调用完成多步业务流（如 release）。
Parallel Tool Use: 同一次 assistant turn 返回多个独立 tool_use，harness 并发执行。
Tool Result: tool 执行结果，以 user role 的 content block 回传给模型。
MCP (Model Context Protocol): Anthropic 提出的 tool/resource 暴露协议，让 tool 集合可热插拔。
BFCL: Berkeley Function Calling Leaderboard，function calling 公开评测榜单。
Tool Granularity: 工具粒度，原子 vs 组合的设计取舍维度。

// 深入思考

Tool description 比 name 重要 10 倍——但 description 写多长合适？一句话 vs 一段话差别多大？

实测：50-150 字符显著优于 < 20 或 > 300 字符。短了模型不知何时用；长了污染 context 占 prompt 配额。最佳格式：一句话功能 + 一句话适用场景 + 一句话不适用。例：'Search docs by semantic similarity. Use when user asks find/similar. Do NOT use for exact string — use grep_tool.' 这个三句结构 Anthropic 官方 tool use guide 也推荐。

15 个 tool 后准确率下降，是模型 attention 被稀释，还是 tool description 之间互相干扰？

主要是后者。Tool selection 是 zero-shot classification，相似 tool 会让模型 confuse（'search_docs' vs 'search_web' vs 'search_code'）。Anthropic 内部数据：加 tool 数量比加 description 长度退化更显著。解决：1) tool grouping（先选 group 再选 tool）；2) 描述写差异化（'unlike X tool, this one...'）；3) > 20 tool 用 MCP 按需挂载。

原子 tool 让 agent 灵活但易错；组合 tool 让 agent 可靠但失能。一个 production agent 该用哪种？

看任务类型。客户支持 / 数据查询用组合 tool（'create_ticket' 比 'auth + lookup_user + create + notify' 一组靠谱）。开发 / 探索类用原子 tool（Claude Code 不把 read+grep+edit 打包，因为 debug 需要灵活组合）。判断标准：任务的 step path 是否可枚举？可 → 组合；不可 → 原子。

Parallel tool call 是免费午餐，什么时候反而拖慢 agent？

三种场景：1) Tool 之间有依赖（A 的输出是 B 的输入），并行让 B 拿空值；2) 共享资源竞争（同时写同一文件）；3) 模型误判任务并行调用不相关 tool，结果都白跑。Anthropic 的 parallel tool call 是 opt-in，因为模型默认保守串行——production 要在 system prompt 加 'when independent, call in parallel' 才会用上。

Tool schema 的字段名比 description 简短，常听 'name 不重要'。那字段名什么时候真的重要？

Name 在 tool selection 阶段不重要（看 description），但在 argument generation 阶段重要：字段名是模型生成参数时的 prior。比如 'query' vs 'search_string'，前者引导自然语言，后者引导关键词组合。Production 实践：字段名简短无歧义 + description 详细，是最优组合。

// 延伸阅读

Anthropic · How to implement tool use（含 best practices 段） — 官方最权威的 tool design 指南
Anthropic · Building Effective Agents — 工具粒度与 agent 设计
Berkeley Function Calling Leaderboard — 公开 function calling 评测
OpenAI · Function Calling Guide — OpenAI 视角的 schema 设计
MCP · Concepts & Architecture — tool / resource / prompt 三层抽象