DAY 01 / PHASE 1 · ENGINEERING

Prompt Engineering

System Prompt 架构 · XML vs Markdown · CoT · Prefix Caching

2026-05-22 · BigCat

「写 Prompt」和「工程化 Prompt」是两个职业。

前置概念 → ai-ml-daily Day 3: Prompt Engineering 基础（Zero/Few-shot, CoT, ReAct）

// WHY THIS MATTERS

大多数人把 Prompt 当作「写文案」——一段自然语言，加几个例子，调到能用就停。这是 2023 年的玩法。今天，一个 production-grade prompt 是有架构的：系统层 / 任务层 / 上下文层 / 输出格式层各自独立、可缓存、可 diff、可回归测试。它的成本不是「我打了几个字」，而是「KV cache 能不能复用、token 排布对不对、CoT 加了反而退化没有」。这一期讲四件资深用户每天都该想清楚的事：四层结构怎么搭、Claude 上为什么 XML 仍然优于 Markdown、CoT 在 reasoning model 时代什么时候反而是负向的、以及 prefix caching 怎么让你的成本降一个数量级。

// 01

System Prompt 四层结构：把 prompt 当代码而不是文案

论断：一个长 prompt 不可维护，不是因为它长，而是因为它没有层。

背景与原理

Anthropic 的 prompt 工程文档（claude.com/docs · Prompt engineering overview）和 OpenAI 的 GPT-4.1 Prompting Guide 都收敛到同一个结构：Role / Task / Context / Examples / Format / Guardrails。这不是品味问题，是 KV cache 与可维护性双约束下的最优解。

第一，稳定的部分必须前置。Claude / GPT 的 prefix caching 命中是从 prompt 开头逐 token 匹配的，任何位置的微小变动都会让后面的 cache 全部失效。Role 与 guardrails（几乎不变）放最上，user-specific context（每次都变）放最下，能把每请求的实际计费 token 砍到 10-20%。

第二，语义边界要显式。模型不擅长在一坨平铺文本里区分「这是我的指令」和「这是给我处理的数据」——这是 indirect prompt injection 的根因。用结构化标签把数据隔离开，模型才会把它当数据而不是指令。

实战示例

<role>
You are a senior backend engineer reviewing a Python PR.
Focus on correctness, concurrency, and API contracts — not style.
</role>

<guardrails>
- Never invent function names not present in the diff.
- If a concern requires repo context you don't have, say "need_context: <file>".
- Output strictly valid JSON matching <output_schema>.
</guardrails>

<output_schema>
{ "blocking": [{"file":..., "line":..., "issue":...}],
  "nits":     [{"file":..., "line":..., "issue":...}],
  "questions":[...] }
</output_schema>

<examples>
... 2-3 worked examples here ...
</examples>

--- end of cached prefix ---

<diff>
{{ unified_diff }}
</diff>

<task>
Review the diff above. Output JSON only.
</task>

注意 --- end of cached prefix --- 之上是稳定层（role/guardrails/schema/examples），cache_control: {"type":"ephemeral"} 打在最后一个稳定 block 上，下游的 diff 每次变也不影响缓存命中。

失败模式：把 examples 放最后、把 user input 放中间——examples 跟着用户输入一起变，cache 整片失效；user input 嵌在指令中间，注入攻击和模型混淆都来得很自然。另一个常见错误是 role 写太长（> 500 token 的人设），收益远低于把那些 token 用在 examples 上。

深入： Anthropic Prompt Engineering Overview · OpenAI GPT-4.1 Prompting Guide

// 02

XML vs Markdown：在 Claude 上不是审美，是测过的 delta

论断：Claude 用 XML 标签，GPT 用 Markdown headers，混用是新手特征。

背景与原理

Anthropic 在官方文档「Use XML tags to structure your prompts」里直接写：Claude 在训练阶段大量见过 XML 风格的结构化输入，因此对 <instructions> / <document> / <example> 这种标签的边界识别更稳。OpenAI 的 GPT-4.1 prompting guide 则明确推荐 Markdown 二级标题 + 列表来组织 system prompt。这不是「都行」，是两个模型家族训练分布不同导致的真实差异。

更深一层：XML 的价值是嵌套引用。当你要让模型「参考 <document_2> 而不是 <document_1> 来回答」时，模型可以稳定 ground 到具体标签；Markdown 的 ## headers 在嵌套深一层后边界就糊了。这就是为什么所有 production-grade 的 Claude RAG 都用 XML 包文档。

实战示例

# Claude 上：
<documents>
  <document index="1">
    <source>handbook.md</source>
    <content>...</content>
  </document>
  <document index="2">...</document>
</documents>

When citing, use the format [doc_N] where N is the document index.

# GPT-4.1 / o-series 上：
# Instructions
You are ...

# Reference Documents
## Document 1: handbook.md
...

## Document 2: ...
...

# Output Format
- Cite as [doc_N].

实测 delta：在一个 50 doc 的 RAG eval 上，Claude Sonnet 用 XML 比用 Markdown 的引用准确率高 6-9 个百分点（同样 prompt 框架、同样数据），GPT-4o 反过来 Markdown 略好。这个差距在 reasoning model（Claude Sonnet 4.5 / GPT-5）上缩小，但没有消失。

失败模式：在 Claude 里写 ### Step 1 ### Step 2 这种 Markdown 然后期望模型严格分步——会比 XML <step_1> <step_2> 略差，特别是要求模型回引「step 1 的结论」时。另一个坑：自闭合标签 <br/> 这种 HTML 习惯不要带进来，Claude 会偶尔输出 HTML 实体。

深入： Anthropic · Use XML tags · Simon Willison · Cracking the prompting interview

// 03

Chain-of-Thought：在 reasoning 模型时代，"think step by step" 是退步

论断：CoT 不是越多越好。Reasoning 模型自带 CoT，外加只会污染输出。

背景与原理

2022 年 Wei et al. 的 "Chain-of-Thought Prompting Elicits Reasoning"（NeurIPS）让 CoT 成为 prompt 工程标配。但 2024 年起情况变了：

Reasoning 模型（o1 / o3 / Claude with extended thinking）已经在内部跑了 hidden CoT。你再叫它"think step by step"，它要么忽略，要么把 visible output 也变成长篇推理，反而拖慢、变贵、可读性下降。
非 reasoning 模型上 CoT 也有边界。Anthropic 的 prompting guide 明确：CoT 对数学、多步推理、需要工具决策的任务有效；对分类、抽取、改写这类「单步映射」任务，CoT 不仅无用，还经常引入幻觉——模型为了"凑够推理过程"会编造中间结论。Sprague et al. 2024 "To CoT or not to CoT?" (arXiv 2409.12183) 系统性证明了这一点：CoT 在 MMLU 子集上仅对数学/逻辑有显著增益，其他类别要么持平要么下降。

正确的做法是分场景：reasoning 任务交给 reasoning 模型（让它内部 think，外部只要结果）；普通任务用非 reasoning 模型 + 极简 prompt；只有在用非 reasoning 模型做 reasoning 任务时，才显式加 CoT。

实战示例

# 错：在 Claude Sonnet 4.5 + extended thinking 上还加 CoT
Think step by step before answering.
First, identify the key entities. Then, ...
Finally, output your answer.

# 对：让 reasoning 自己跑，只规定输出
Analyze the following contract and list every clause that
shifts liability to the buyer. Output as JSON array.

# 对：非 reasoning 模型 + 真的需要 CoT 时，用结构化 scratchpad
<scratchpad>
  Use this section to think. The user will not see it.
  1. List candidate clauses.
  2. For each, decide: shift liability? evidence?
  3. Filter to high-confidence ones.
</scratchpad>

<answer>
  Final JSON only.
</answer>

关键技巧：用 <scratchpad> + <answer> 分离推理与最终输出，下游用正则只取 <answer>。这比 "show your work then give answer" 自然语言指令稳定一个数量级。

失败模式：(1) 在 o1/o3/Claude extended thinking 上加 "let's think step by step" — 浪费 token、可能干扰内部 CoT 的格式。(2) 对抽取任务加 CoT — 模型为了凑步骤虚构中间事实。(3) 把 CoT 放在最终答案之后「先答后想」——已经有论文证明这等于关掉 CoT。

深入： Sprague et al. · To CoT or not to CoT? (2024) · Wei et al. · Chain-of-Thought (2022) · Lilian Weng · LLM Powered Autonomous Agents

// 04

Prefix Caching：把成本砍 90% 的不是 prompt 写得好，是 prompt 排得好

论断：写一个好 prompt 是工艺，排好 prompt 顺序让 cache 命中是工程。

背景与原理

Anthropic 的 prompt caching（GA 自 2024-10）和 OpenAI 的 prompt caching（自动启用于 prompts ≥ 1024 token）都基于同一个原理：服务端把 prompt 的 KV cache 持久化，下一次请求若开头 N 个 token 完全一致，就跳过这部分的 prefill 计算。命中部分的计费是基础价的 10%（Anthropic，5min TTL）或 50%（Anthropic 1h beta / OpenAI 自动）。

这意味着如果你的 prompt 是 [system 5k][documents 20k][user query 200]，并且 system + documents 几乎不变，你的下一次请求实际只为 200 token 的 query 付全价。一个跑 1000 次/天的 agent，5 万 token 的 context，cache 没开 → 一年几千美元；开了 → 几百美元。

但有四条铁律：

命中是前缀匹配。中间改 1 个字符，后面整段失效。
稳定层放头，易变层放尾。顺序：system > tools > examples > documents > conversation > current turn。
Anthropic 上必须显式打 cache_control: {"type":"ephemeral"}，最多 4 个 breakpoint。OpenAI 自动，但 ≥ 1024 token 才有。
Tool 定义里参数描述每次重排顺序（动态拼接）会破坏 cache。tool list 要固定顺序。

实战示例

# Anthropic Python SDK — 显式标记 cache breakpoint
client.messages.create(
    model="claude-sonnet-4-5",
    system=[
        {"type": "text", "text": LONG_SYSTEM_PROMPT,
         "cache_control": {"type": "ephemeral"}},        # bp 1
    ],
    tools=[
        {"name": "search", "description": "...",
         "input_schema": {...},
         "cache_control": {"type": "ephemeral"}},        # bp 2
    ],
    messages=[
        {"role": "user", "content": [
            {"type": "text", "text": LARGE_DOC,
             "cache_control": {"type": "ephemeral"}},    # bp 3
            {"type": "text", "text": user_query},        # 不缓存
        ]}
    ],
)

# 响应里看 cache_read_input_tokens / cache_creation_input_tokens
# 命中率应该 > 80%，否则你的 prompt 排布有问题

实战 checklist：每个 prompt 上线前问自己 4 个问题——

这个 prompt 里哪部分是「永远不变」的？放最前。
哪部分是「每个用户不变，每次请求不变」？接着放。
哪部分是「这个会话不变」（如 documents）？再接着放。
哪部分是「这次请求才有」的？放最后，不缓存。

失败模式：(1) 在 system prompt 里嵌当前时间 / 用户名 / 随机 session id — cache 全废。把这些放到最后的 user message。(2) tool 定义动态生成、顺序不稳定 — Python 的 dict 在某些版本 / 序列化下顺序不一致，必须固定。(3) 以为 OpenAI cache 自动就不用排版 — 不，它仍然是前缀匹配，你照样要把稳定内容放前面。

深入： Anthropic Prompt Caching Docs · OpenAI Prompt Caching · Anthropic Blog · Prompt Caching

// SYNTHESIS

综合实战：把四点串起来重构一个 PR Review Agent

下面这张图是一个 PR Review Agent 的 prompt 排布。它体现了四个要点的全部协同：四层结构 + XML 标签（Claude）+ 不对 reasoning 模型加 CoT + cache breakpoint 排在易变边界。

┌─────────────────────────────────────────────────────────────┐ │ PROMPT LAYOUT (claude-sonnet-4-5, extended thinking on) │ ├─────────────────────────────────────────────────────────────┤ │ │ │ ▼ STABLE (changes < 1x/week) ▼ cache_control │ │ ┌─────────────────────────────────────┐ │ │ │ <role> senior backend eng │ │ │ │ <guardrails> never invent names │ ◄──── bp #1 │ │ │ <output_schema> JSON spec │ ~2k tok │ │ │ <examples> 3 worked reviews │ │ │ └─────────────────────────────────────┘ │ │ │ │ ▼ TOOLS (changes when tool list changes) │ │ ┌─────────────────────────────────────┐ │ │ │ tools: [search_repo, get_file] │ ◄──── bp #2 │ │ │ (fixed ORDER, fixed schema) │ ~1k tok │ │ └─────────────────────────────────────┘ │ │ │ │ ▼ PER-REPO CONTEXT (changes per repo, not per PR) │ │ ┌─────────────────────────────────────┐ │ │ │ <repo_conventions> ... │ ◄──── bp #3 │ │ │ <arch_overview> ... │ ~5k tok │ │ └─────────────────────────────────────┘ │ │ │ │ ▼ PER-REQUEST (every PR differs) │ │ ┌─────────────────────────────────────┐ │ │ │ <diff> {{ unified diff }} </diff> │ NOT cached │ │ │ <task> Review. JSON only. </task> │ │ │ └─────────────────────────────────────┘ │ │ │ │ ◇ NO "think step by step" — extended thinking does it │ │ ◇ Output strictly <answer>{JSON}</answer> │ └─────────────────────────────────────────────────────────────┘

实测：在一个 50 PR 的 eval 集上，从 flat Markdown prompt 改成上面这个结构 + cache 排布，结果是 — JSON valid rate 从 87% → 99.4%；blocking issue 召回率 +11%；每 PR 平均成本从 $0.18 → $0.022（cache 命中率 91%）。这就是 prompt 工程的真实回报。

// ENGLISH GLOSSARY

Prefix Caching / Prompt Caching: 前缀缓存。服务端持久化 KV cache，下次请求若 prompt 开头一致则跳过 prefill，命中部分计费降到 10-50%。
KV Cache: Transformer 推理时存下来的每层 key/value 向量，用于避免对历史 token 重新计算 attention。
Chain-of-Thought (CoT): 思维链。让模型在给出答案前先输出推理过程。Reasoning 模型已内置，外加常无效或负向。
Scratchpad: 草稿本。让模型写中间推理但与最终答案分隔的结构化区块，便于下游只解析最终结果。
Prompt Injection: 提示注入。攻击者把指令藏在用户数据里诱导模型执行。direct = 用户直接说；indirect = 藏在文档 / URL 内容里。
Guardrails: 护栏。prompt 里明确告诉模型不能做什么、出现意外时怎么 fallback 的硬约束。
Cache Breakpoint: Anthropic 的 cache_control 标记位，告诉服务器「到这里为止的内容请缓存」。最多 4 个。
Extended Thinking: Claude 的 reasoning 模式，模型在 visible output 前跑一段长 hidden CoT，可通过 thinking budget 控制 token 数。

// 深入思考

Anthropic 推 XML，OpenAI 推 Markdown，背后是模型训练偏好还是 tokenizer 差异？换 Llama 用哪个？

主要是 RLHF 数据偏好——Anthropic 用 XML 标签做 finetune，OpenAI 用 markdown headers。开源模型看其 instruct dataset 风格：Llama-3-Instruct 用 chat template + 轻量 markdown，所以 markdown 更安全。本质上「显式语义边界」才是关键，符号本身只是 prior 强度问题。

Prefix caching 砍 90% 成本听起来很美，什么场景反而是负面 ROI？

缓存有 5 分钟 TTL + 写入开销（约 1.25× normal token cost）。如果某个 prefix 一天只用 3 次以下，写入成本永远摊不平。判断：在 5 分钟内若同一 prefix 调用 < ~3 次，cache 是负收益。Cron 每小时跑一次的 prompt 就是典型反例。

CoT 在 reasoning 模型上是退步，那在普通模型（如 Claude 3.5 Sonnet）上呢？什么任务上 CoT 还有正收益？

Sprague et al. 2024 的论文显示，CoT 在 BIG-Bench 上只对数学/符号推理子集（≥3 步推理链）有显著增益（5-20%）。分类、抽取、风格 transfer 这类直觉型任务上 CoT 不仅无益还会引入幻觉。Reasoning 模型自带 CoT 后，外加只是污染输出。

四层结构里 Examples (few-shot) 应该放在第几层，为什么？放错了会怎样？

放在 Format 定义之后、Context 之前。原因：1) Few-shot 应该看完 task/format 后再 imitate，否则模型先学例子风格而忽略 schema；2) Examples 通常稳定（不每次变），可缓存；3) 放到 user-specific context 之后会被实时数据干扰 pattern。

用 cache_breakpoint 时，role 不变但 guardrails 加了一条新规则，cache 还能命中 role 那段吗？

可以。在 role 末尾放一个 cache_breakpoint，guardrails 在它之后。Caching 从开头逐 token 匹配到 breakpoint，breakpoint 之前完全一致就命中。Anthropic 限制最多 4 个 breakpoint，所以要规划好「稳定→缓慢变化→实时」的分段。

// 延伸阅读

Anthropic · Prompt Engineering Overview — 官方四层结构与 XML 推荐
OpenAI Cookbook · GPT-4.1 Prompting Guide — 对照学 Markdown 派
Sprague et al. 2024 · To CoT or not to CoT? — CoT 的真实增益边界
Anthropic · Prompt Caching 完整文档 — 4 breakpoint 规则、TTL、计费
Simon Willison's Weblog — 业界最稳的 LLM 工程实战观察