"Writing a prompt" and "engineering a prompt" are two different jobs.
Most people still treat prompts like copywriting — a paragraph of natural language, a couple of examples, ship it when it kind of works. That's the 2023 playbook. Today, a production-grade prompt has an architecture: system layer / task layer / context layer / output format layer, each independent, cacheable, diffable, regression-testable. The cost isn't "how many words did I type", it's "will the KV cache get reused, is the token layout right, did adding CoT actually make things worse". This issue covers four things every senior prompt engineer should think through daily: how to lay out the four-layer structure, why XML still beats Markdown on Claude, when CoT is actively harmful on reasoning models, and how prefix caching cuts your bill by an order of magnitude.
Anthropic's prompt engineering docs (claude.com/docs · Prompt engineering overview) and OpenAI's GPT-4.1 Prompting Guide converge on the same structure: Role / Task / Context / Examples / Format / Guardrails. This isn't taste — it's the optimum under two hard constraints: KV cache hit rate and long-term maintainability.
First, stable content must come first. Prefix caching on Claude / GPT matches token-by-token from the start of the prompt; any change anywhere invalidates the cache from that point onward. Put Role and guardrails (almost never change) at the top, user-specific context (changes every call) at the bottom — your billable tokens drop to 10–20% of nominal.
Second, semantic boundaries must be explicit. Models struggle to tell "this is my instruction" from "this is data I should operate on" inside a flat blob of text — this is the root cause of indirect prompt injection. Structured tags isolate the data so the model treats it as data, not as instructions.
<role>
You are a senior backend engineer reviewing a Python PR.
Focus on correctness, concurrency, and API contracts — not style.
</role>
<guardrails>
- Never invent function names not present in the diff.
- If a concern requires repo context you don't have, say "need_context: <file>".
- Output strictly valid JSON matching <output_schema>.
</guardrails>
<output_schema>
{ "blocking": [{"file":..., "line":..., "issue":...}],
"nits": [{"file":..., "line":..., "issue":...}],
"questions":[...] }
</output_schema>
<examples>
... 2-3 worked examples here ...
</examples>
--- end of cached prefix ---
<diff>
{{ unified_diff }}
</diff>
<task>
Review the diff above. Output JSON only.
</task>
Note: everything above --- end of cached prefix --- is the stable layer (role / guardrails / schema / examples). Put cache_control: {"type":"ephemeral"} on the last stable block — the diff below changes every call without busting the cache.
Anthropic's docs ("Use XML tags to structure your prompts") spell it out: Claude has seen lots of XML-style structured inputs during training, so it's much more reliable at recognizing the boundaries of <instructions> / <document> / <example>. OpenAI's GPT-4.1 prompting guide explicitly recommends Markdown H2 + lists for system prompts. This isn't "use whatever" — it's a real delta driven by different training distributions across model families.
The deeper point: XML's value is nestable references. When you want the model to "answer using <document_2> rather than <document_1>", it can ground reliably to a specific tag; Markdown ## headers blur as soon as you nest one level. That's why every production-grade Claude RAG wraps documents in XML.
# On Claude:
<documents>
<document index="1">
<source>handbook.md</source>
<content>...</content>
</document>
<document index="2">...</document>
</documents>
When citing, use the format [doc_N] where N is the document index.
# On GPT-4.1 / o-series:
# Instructions
You are ...
# Reference Documents
## Document 1: handbook.md
...
## Document 2: ...
...
# Output Format
- Cite as [doc_N].
Measured delta: on a 50-doc RAG eval, Claude Sonnet with XML beat the same prompt in Markdown by 6–9 points on citation accuracy (same framework, same data); GPT-4o was slightly better with Markdown the other way. The gap shrinks on reasoning models (Claude Sonnet 4.5 / GPT-5) but doesn't disappear.
### Step 1 / ### Step 2 Markdown on Claude and expecting strict step-by-step behavior — it'll be slightly worse than XML <step_1> / <step_2>, especially when you later ask the model to "reference the conclusion of step 1". Another trap: self-closing tags like <br/> bleeding in from HTML habits — Claude will occasionally emit HTML entities in the output.
Wei et al. 2022 ("Chain-of-Thought Prompting Elicits Reasoning", NeurIPS) made CoT a default in the prompt-engineering toolbox. Things changed in 2024:
The right approach is scenario-driven: hand reasoning tasks to reasoning models (let them think internally, you only consume the result); use non-reasoning models with minimal prompts for routine tasks; only add explicit CoT when you're doing a reasoning task on a non-reasoning model.
# Wrong: adding CoT on Claude Sonnet 4.5 + extended thinking
Think step by step before answering.
First, identify the key entities. Then, ...
Finally, output your answer.
# Right: let reasoning run itself; only specify output
Analyze the following contract and list every clause that
shifts liability to the buyer. Output as JSON array.
# Right: non-reasoning model + you really need CoT — use a structured scratchpad
<scratchpad>
Use this section to think. The user will not see it.
1. List candidate clauses.
2. For each, decide: shift liability? evidence?
3. Filter to high-confidence ones.
</scratchpad>
<answer>
Final JSON only.
</answer>
Key trick: split reasoning and final output with <scratchpad> + <answer>, then regex out only <answer> downstream. This is an order of magnitude more reliable than a natural-language "show your work then give the answer" instruction.
Anthropic prompt caching (GA since 2024-10) and OpenAI prompt caching (auto-enabled for prompts ≥ 1024 tokens) work the same way: the server persists the KV cache, and if the next request's prompt prefix matches token-by-token, prefill is skipped for the hit region. The billing for hit tokens is 10% of base (Anthropic, 5min TTL) or 50% (Anthropic 1h beta / OpenAI auto).
So if your prompt is [system 5k][documents 20k][user query 200] and system + documents barely change, the next request only pays full price on the 200-token query. An agent running 1000×/day with 50k of context: no cache → thousands of dollars a year; with cache → hundreds.
Four hard rules:
cache_control: {"type":"ephemeral"}, up to 4 breakpoints. OpenAI is automatic but only kicks in at ≥ 1024 tokens.# Anthropic Python SDK — explicit cache breakpoints
client.messages.create(
model="claude-sonnet-4-5",
system=[
{"type": "text", "text": LONG_SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"}}, # bp 1
],
tools=[
{"name": "search", "description": "...",
"input_schema": {...},
"cache_control": {"type": "ephemeral"}}, # bp 2
],
messages=[
{"role": "user", "content": [
{"type": "text", "text": LARGE_DOC,
"cache_control": {"type": "ephemeral"}}, # bp 3
{"type": "text", "text": user_query}, # not cached
]}
],
)
# Inspect cache_read_input_tokens / cache_creation_input_tokens in the response.
# Hit rate should be > 80%; otherwise your layout is wrong.
Pre-deploy checklist — four questions per prompt:
Below is the prompt layout of a PR review agent. It exercises all four ideas together: four-layer structure + XML tags (Claude) + no CoT on a reasoning model + cache breakpoints placed at the volatility boundaries.
Measured: on a 50-PR eval set, switching from a flat Markdown prompt to the layout above + cache plan moved JSON valid rate from 87% → 99.4%, blocking-issue recall +11%, and per-PR average cost from $0.18 → $0.022 (91% cache hit rate). That's the real ROI of prompt engineering.