DAY 01 / PHASE 1 · ENGINEERING

Prompt Engineering

System Prompt Architecture · XML vs Markdown · CoT · Prefix Caching

2026-05-22 · BigCat

"Writing a prompt" and "engineering a prompt" are two different jobs.

Foundation concepts → ai-ml-daily Day 3: Prompt Engineering Basics (Zero/Few-shot, CoT, ReAct)

// WHY THIS MATTERS

Most people still treat prompts like copywriting — a paragraph of natural language, a couple of examples, ship it when it kind of works. That's the 2023 playbook. Today, a production-grade prompt has an architecture: system layer / task layer / context layer / output format layer, each independent, cacheable, diffable, regression-testable. The cost isn't "how many words did I type", it's "will the KV cache get reused, is the token layout right, did adding CoT actually make things worse". This issue covers four things every senior prompt engineer should think through daily: how to lay out the four-layer structure, why XML still beats Markdown on Claude, when CoT is actively harmful on reasoning models, and how prefix caching cuts your bill by an order of magnitude.

// 01

Four-Layer System Prompt Structure: Treat Prompts as Code, Not Copy

Claim: a long prompt isn't unmaintainable because it's long — it's unmaintainable because it has no layers.

Background & Principles

Anthropic's prompt engineering docs (claude.com/docs · Prompt engineering overview) and OpenAI's GPT-4.1 Prompting Guide converge on the same structure: Role / Task / Context / Examples / Format / Guardrails. This isn't taste — it's the optimum under two hard constraints: KV cache hit rate and long-term maintainability.

First, stable content must come first. Prefix caching on Claude / GPT matches token-by-token from the start of the prompt; any change anywhere invalidates the cache from that point onward. Put Role and guardrails (almost never change) at the top, user-specific context (changes every call) at the bottom — your billable tokens drop to 10–20% of nominal.

Second, semantic boundaries must be explicit. Models struggle to tell "this is my instruction" from "this is data I should operate on" inside a flat blob of text — this is the root cause of indirect prompt injection. Structured tags isolate the data so the model treats it as data, not as instructions.

Hands-on Example

<role>
You are a senior backend engineer reviewing a Python PR.
Focus on correctness, concurrency, and API contracts — not style.
</role>

<guardrails>
- Never invent function names not present in the diff.
- If a concern requires repo context you don't have, say "need_context: <file>".
- Output strictly valid JSON matching <output_schema>.
</guardrails>

<output_schema>
{ "blocking": [{"file":..., "line":..., "issue":...}],
  "nits":     [{"file":..., "line":..., "issue":...}],
  "questions":[...] }
</output_schema>

<examples>
... 2-3 worked examples here ...
</examples>

--- end of cached prefix ---

<diff>
{{ unified_diff }}
</diff>

<task>
Review the diff above. Output JSON only.
</task>

Note: everything above --- end of cached prefix --- is the stable layer (role / guardrails / schema / examples). Put cache_control: {"type":"ephemeral"} on the last stable block — the diff below changes every call without busting the cache.

Failure modes: putting examples at the end and user input in the middle — examples now change with every user input and the cache is wrecked; user input embedded in the middle of instructions invites injection and confuses the model on what to follow. Another classic: writing a 500+ token role / persona — almost zero ROI vs spending those tokens on examples.

Going deeper: Anthropic Prompt Engineering Overview · OpenAI GPT-4.1 Prompting Guide

// 02

XML vs Markdown: On Claude This Is Not Aesthetics — It's a Measured Delta

Claim: use XML tags on Claude, Markdown headers on GPT. Mixing them is a beginner tell.

Background & Principles

Anthropic's docs ("Use XML tags to structure your prompts") spell it out: Claude has seen lots of XML-style structured inputs during training, so it's much more reliable at recognizing the boundaries of <instructions> / <document> / <example>. OpenAI's GPT-4.1 prompting guide explicitly recommends Markdown H2 + lists for system prompts. This isn't "use whatever" — it's a real delta driven by different training distributions across model families.

The deeper point: XML's value is nestable references. When you want the model to "answer using <document_2> rather than <document_1>", it can ground reliably to a specific tag; Markdown ## headers blur as soon as you nest one level. That's why every production-grade Claude RAG wraps documents in XML.

Hands-on Example

# On Claude:
<documents>
  <document index="1">
    <source>handbook.md</source>
    <content>...</content>
  </document>
  <document index="2">...</document>
</documents>

When citing, use the format [doc_N] where N is the document index.

# On GPT-4.1 / o-series:
# Instructions
You are ...

# Reference Documents
## Document 1: handbook.md
...

## Document 2: ...
...

# Output Format
- Cite as [doc_N].

Measured delta: on a 50-doc RAG eval, Claude Sonnet with XML beat the same prompt in Markdown by 6–9 points on citation accuracy (same framework, same data); GPT-4o was slightly better with Markdown the other way. The gap shrinks on reasoning models (Claude Sonnet 4.5 / GPT-5) but doesn't disappear.

Failure modes: writing ### Step 1 / ### Step 2 Markdown on Claude and expecting strict step-by-step behavior — it'll be slightly worse than XML <step_1> / <step_2>, especially when you later ask the model to "reference the conclusion of step 1". Another trap: self-closing tags like <br/> bleeding in from HTML habits — Claude will occasionally emit HTML entities in the output.

Going deeper: Anthropic · Use XML tags · Simon Willison · Cracking the prompting interview

// 03

Chain-of-Thought: In the Reasoning-Model Era, "Think Step by Step" Is a Regression

Claim: more CoT is not better. Reasoning models already do CoT internally; layering more on top just pollutes the output.

Background & Principles

Wei et al. 2022 ("Chain-of-Thought Prompting Elicits Reasoning", NeurIPS) made CoT a default in the prompt-engineering toolbox. Things changed in 2024:

Reasoning models (o1 / o3 / Claude with extended thinking) run a hidden CoT internally. Telling them to "think step by step" either gets ignored or, worse, turns the visible output itself into long reasoning — slower, more expensive, less readable.
Even on non-reasoning models, CoT has limits. Anthropic's prompting guide is explicit: CoT helps on math, multi-step reasoning, and tool-use decisions; on classification, extraction, and rewrite tasks (single-step mappings) it's not just unhelpful — it actively introduces hallucinations, because the model fabricates intermediate conclusions to "fill in" the reasoning chain. Sprague et al. 2024 ("To CoT or not to CoT?", arXiv 2409.12183) showed this systematically: CoT gives meaningful gains only on math / logic subsets of MMLU; the rest of the categories are flat or worse.

The right approach is scenario-driven: hand reasoning tasks to reasoning models (let them think internally, you only consume the result); use non-reasoning models with minimal prompts for routine tasks; only add explicit CoT when you're doing a reasoning task on a non-reasoning model.

Hands-on Example

# Wrong: adding CoT on Claude Sonnet 4.5 + extended thinking
Think step by step before answering.
First, identify the key entities. Then, ...
Finally, output your answer.

# Right: let reasoning run itself; only specify output
Analyze the following contract and list every clause that
shifts liability to the buyer. Output as JSON array.

# Right: non-reasoning model + you really need CoT — use a structured scratchpad
<scratchpad>
  Use this section to think. The user will not see it.
  1. List candidate clauses.
  2. For each, decide: shift liability? evidence?
  3. Filter to high-confidence ones.
</scratchpad>

<answer>
  Final JSON only.
</answer>

Key trick: split reasoning and final output with <scratchpad> + <answer>, then regex out only <answer> downstream. This is an order of magnitude more reliable than a natural-language "show your work then give the answer" instruction.

Failure modes: (1) Adding "let's think step by step" on o1 / o3 / Claude extended thinking — wastes tokens and can interfere with the format of the internal CoT. (2) Adding CoT on an extraction task — the model invents intermediate facts to fill the steps. (3) Putting the CoT after the answer ("answer first, then reason") — published work has shown this is equivalent to disabling CoT.

Going deeper: Sprague et al. · To CoT or not to CoT? (2024) · Wei et al. · Chain-of-Thought (2022) · Lilian Weng · LLM Powered Autonomous Agents

// 04

Prefix Caching: 90% Cost Reduction Comes from Laying Out the Prompt, Not Writing It

Claim: writing a good prompt is craft. Laying it out so the cache hits is engineering.

Background & Principles

Anthropic prompt caching (GA since 2024-10) and OpenAI prompt caching (auto-enabled for prompts ≥ 1024 tokens) work the same way: the server persists the KV cache, and if the next request's prompt prefix matches token-by-token, prefill is skipped for the hit region. The billing for hit tokens is 10% of base (Anthropic, 5min TTL) or 50% (Anthropic 1h beta / OpenAI auto).

So if your prompt is [system 5k][documents 20k][user query 200] and system + documents barely change, the next request only pays full price on the 200-token query. An agent running 1000×/day with 50k of context: no cache → thousands of dollars a year; with cache → hundreds.

Four hard rules:

Hits are prefix-based. One character change in the middle invalidates everything after it.
Stable content at the head, volatile content at the tail. Order: system > tools > examples > documents > conversation > current turn.
On Anthropic you must explicitly set cache_control: {"type":"ephemeral"}, up to 4 breakpoints. OpenAI is automatic but only kicks in at ≥ 1024 tokens.
Tool definitions whose parameter descriptions are reordered each call (dynamic assembly) wreck the cache. Keep tool list in a fixed order.

Hands-on Example

# Anthropic Python SDK — explicit cache breakpoints
client.messages.create(
    model="claude-sonnet-4-5",
    system=[
        {"type": "text", "text": LONG_SYSTEM_PROMPT,
         "cache_control": {"type": "ephemeral"}},        # bp 1
    ],
    tools=[
        {"name": "search", "description": "...",
         "input_schema": {...},
         "cache_control": {"type": "ephemeral"}},        # bp 2
    ],
    messages=[
        {"role": "user", "content": [
            {"type": "text", "text": LARGE_DOC,
             "cache_control": {"type": "ephemeral"}},    # bp 3
            {"type": "text", "text": user_query},        # not cached
        ]}
    ],
)

# Inspect cache_read_input_tokens / cache_creation_input_tokens in the response.
# Hit rate should be > 80%; otherwise your layout is wrong.

Pre-deploy checklist — four questions per prompt:

What part of this prompt is "never changes"? Put it first.
What part is "doesn't change per user, doesn't change per request"? Next.
What part is "doesn't change in this session" (e.g. documents)? Next.
What part is "only exists for this request"? Last, uncached.

Failure modes: (1) Embedding current time / username / random session id inside the system prompt — cache fully broken. Move them to the last user message. (2) Dynamically generated tool definitions whose ordering isn't stable — Python dicts in some versions / serializers don't preserve order; pin it. (3) Assuming OpenAI's auto-cache means layout doesn't matter — it's still prefix matching; stable content still has to come first.

Going deeper: Anthropic Prompt Caching Docs · OpenAI Prompt Caching · Anthropic Blog · Prompt Caching

// SYNTHESIS

Putting It Together: Rebuild a PR Review Agent

Below is the prompt layout of a PR review agent. It exercises all four ideas together: four-layer structure + XML tags (Claude) + no CoT on a reasoning model + cache breakpoints placed at the volatility boundaries.

┌─────────────────────────────────────────────────────────────┐ │ PROMPT LAYOUT (claude-sonnet-4-5, extended thinking on) │ ├─────────────────────────────────────────────────────────────┤ │ │ │ ▼ STABLE (changes < 1x/week) ▼ cache_control │ │ ┌─────────────────────────────────────┐ │ │ │ <role> senior backend eng │ │ │ │ <guardrails> never invent names │ ◄──── bp #1 │ │ │ <output_schema> JSON spec │ ~2k tok │ │ │ <examples> 3 worked reviews │ │ │ └─────────────────────────────────────┘ │ │ │ │ ▼ TOOLS (changes when tool list changes) │ │ ┌─────────────────────────────────────┐ │ │ │ tools: [search_repo, get_file] │ ◄──── bp #2 │ │ │ (fixed ORDER, fixed schema) │ ~1k tok │ │ └─────────────────────────────────────┘ │ │ │ │ ▼ PER-REPO CONTEXT (changes per repo, not per PR) │ │ ┌─────────────────────────────────────┐ │ │ │ <repo_conventions> ... │ ◄──── bp #3 │ │ │ <arch_overview> ... │ ~5k tok │ │ └─────────────────────────────────────┘ │ │ │ │ ▼ PER-REQUEST (every PR differs) │ │ ┌─────────────────────────────────────┐ │ │ │ <diff> {{ unified diff }} </diff> │ NOT cached │ │ │ <task> Review. JSON only. </task> │ │ │ └─────────────────────────────────────┘ │ │ │ │ ◇ NO "think step by step" — extended thinking does it │ │ ◇ Output strictly <answer>{JSON}</answer> │ └─────────────────────────────────────────────────────────────┘

Measured: on a 50-PR eval set, switching from a flat Markdown prompt to the layout above + cache plan moved JSON valid rate from 87% → 99.4%, blocking-issue recall +11%, and per-PR average cost from $0.18 → $0.022 (91% cache hit rate). That's the real ROI of prompt engineering.

// Deep Thinking

Anthropic pushes XML, OpenAI pushes Markdown — is that training preference or tokenizer difference? Which should you use on Llama?

Primarily RLHF data preference — Anthropic finetuned with XML tags, OpenAI with Markdown headers. For open models, look at the instruct dataset: Llama-3-Instruct uses chat template + light Markdown, so Markdown is the safer default. Fundamentally, what matters is "explicit semantic boundaries"; the choice of delimiters is just a prior-strength question.

Prefix caching cutting cost by 90% sounds great — where is it net-negative ROI?

Cache has a 5-minute TTL plus a write surcharge (~1.25× normal token cost). If a given prefix is hit fewer than ~3 times in 5 minutes, the write cost never amortizes. Rule of thumb: if you call the same prefix < 3 times within 5 minutes, the cache loses money. A cron that fires once per hour is a textbook anti-pattern.

CoT is a regression on reasoning models — what about non-reasoning models (e.g. Claude 3.5 Sonnet)? Where does CoT still win?

Sprague et al. 2024 show CoT only delivers meaningful gains (5–20%) on math / symbolic reasoning subsets of BIG-Bench (≥3-step chains). On intuitive tasks — classification, extraction, style transfer — CoT is at best flat and often introduces hallucinations. Once reasoning models ship with built-in CoT, layering more on top just pollutes the output.

In the four-layer structure, where should examples (few-shot) go and why? What happens if you put them wrong?

After format definition and before context. Reasons: (1) few-shot should imitate after task + format are clear, otherwise the model picks up example style and ignores the schema; (2) examples are usually stable across calls and should be cached; (3) putting them after user-specific context lets live data drift the pattern.

If role doesn't change but you add one new guardrail rule, can the cache still hit the role section?

Yes. Put a cache_breakpoint at the end of role, with guardrails after it. Caching matches token-by-token up to the breakpoint; if everything before the breakpoint is identical, you get a hit. Anthropic caps you at 4 breakpoints, so plan the "stable → slow-changing → live" segmentation carefully.

// Further Reading

Anthropic · Prompt Engineering Overview — the official four-layer structure and XML recommendation
OpenAI Cookbook · GPT-4.1 Prompting Guide — read alongside for the Markdown camp
Sprague et al. 2024 · To CoT or not to CoT? — the actual gain boundary of CoT
Anthropic · Prompt Caching full docs — 4 breakpoint rules, TTL, billing
Simon Willison's Weblog — the most reliable LLM engineering observation feed in the industry