DAY 16 / PHASE 2 · APPLICATIONS

Cost Engineering

Token Economics · Prompt Caching · Model Routing · Batch & Monitoring

2026-05-31 · BigCat

"This model is expensive" is the wrong attribution. Cost is a product — every factor can be cut by an order of magnitude.

// WHY THIS MATTERS

The reality of 2026 LLM bills: between two heavy users on the same Sonnet, monthly spend ranges from $80 to $8000 — and 90% of the gap is engineering precision, not usage volume. Anthropic's official docs break the bill into five multiplicative factors — uncached input, cache write, cache read, batch, output — and missing any single one of them forfeits a 5–10× lever. Simon Willison's late-2024 takeaway: "LLM prices are collapsing." But the people who actually captured that windfall weren't the ones who switched to cheaper models — they were the ones who stacked caching + routing + batch, which can land below 5% of list price. This week covers four moves: dissect the bill at token granularity, engineer prompt caching, decide when to route to cheaper models, and ship batch + monitoring as a minimum-viable cost discipline.

// 01

Token Economics: Your Bill Is a Product, Not a Model Name

Claim: attributing cost to the model name ("Opus is expensive") is the #1 mistake of 99% of teams. The real attribution unit is "the five-way token distribution per call."

Background & Principles

Anthropic's actual billing formula (verifiable from the prompt caching docs):

cost = uncached_input × 1.0x ← default input + cache_write_5m × 1.25x ← write to 5-min cache + cache_write_1h × 2.0x ← write to 1-hour cache + cache_read × 0.1x ← cache hit (-90%) + output × out_rate ← output (typically 5× input) × (1 - 0.5 if batch else 0) ← Batch API -50%

Three counter-intuitive consequences: (1) the model name only sets the base rate — multiplied by cache hit rate and batch share, an "expensive" model often beats a "cheap" one. (2) Output is the real cost center — most models charge 5× for output vs. input; making the model "talk less" (structured output, explicit max_tokens, no rambling chain-of-thought as final answer) has higher ROI than swapping models. (3) cache_write costs 25% more than default input, so "just turn on caching and see" is negative ROI — you have to estimate hit rate first.

Simon Willison's llm-prices.com is the comparison table worth bookmarking. His late-2024 punchline: captioning 68,000 photos with Gemini Flash 8B cost $1.68 in total — meaning the expensive thing was never "calling an LLM," it's "calling the wrong-tier LLM on content that didn't need calling."

Hands-On

Embed bill attribution into every call. Three lines of code:

import anthropic
client = anthropic.Anthropic()

def call_with_cost(messages, system, model="claude-sonnet-4-6"):
    r = client.messages.create(model=model, max_tokens=1024,
        system=[{"type":"text", "text":system,
                 "cache_control":{"type":"ephemeral"}}],
        messages=messages)
    u = r.usage
    # rates from llm-prices.com or your own table; Sonnet 4.6 example
    rate = {"in":3.0, "out":15.0, "cw5":3.75, "cr":0.30}  # $/M tok
    cost = (u.input_tokens         * rate["in"]
          + u.cache_creation_input_tokens * rate["cw5"]
          + u.cache_read_input_tokens     * rate["cr"]
          + u.output_tokens        * rate["out"]) / 1e6
    print(f"${cost:.4f} | hit={u.cache_read_input_tokens/(u.cache_read_input_tokens+u.cache_creation_input_tokens+1):.0%}")
    return r, cost

Run it a week, then look at p50 / p95 cost distribution. 95% of teams discover: a handful of outlier requests (runaway output / cache miss) drive half the total bill. Those are the optimization targets.

Failure modes: (1) fixating on "switch to a cheaper model" while ignoring output cap. Halving output usually saves more than swapping models. (2) using dashboard "total spend" as the decision input — you must attribute by per-feature cost and per-request p95 cost, otherwise you never know which request class is burning money.
Going deeper · Anthropic Prompt Caching docs, platform.claude.com/.../prompt-caching · Simon Willison LLM Pricing Calculator, llm-prices.com
// 02

Prompt Caching: Engineering the 90% Discount

Claim: the hard part of caching isn't the API — it's keeping the prefix stable. 99% of cache misses come from engineers unintentionally breaking prefix invariance.

Background & Principles

Anthropic's prompt caching contract: mark a breakpoint with cache_control: {type:"ephemeral"}, and everything from the start of the request up to that breakpoint (tools → system → messages, in order) forms one cache key. The next request that matches the prefix byte-for-byte pays 0.1× input price; change a single byte and the whole segment is rewritten at 1.25× write cost.

Three iron rules follow: (1) stable content first, volatile content last — tool schemas, system prompt, long docs, few-shot all up front; the user's current request at the very end. (2) Dynamic interpolation is a cache killer — any token that varies over time (current timestamp, random IDs, dynamic counters) inside the prefix drops hit rate to zero. (3) Max 4 breakpoints per request — typically placed after tools, after system, after history, leaving the live user turn uncached. Minimum cacheable length depends on the model: Sonnet typically 1024 tokens, Opus 4096 (per official docs); below that, the cache path isn't entered, so small requests can't be cached.

┌─ A high-hit-rate prompt layout ────────────────────┐ │ [tools schema ] ← cache_control ① almost never changes │ [system prompt ] ← cache_control ② occasional revisions │ [long doc / corpus ] ← cache_control ③ keyed by doc id │ [conversation hist ] ← cache_control ④ rolling append │ [user current turn ] ← never cached, always new └────────────────────────────────────────────────────┘ hit rate: typically 70-95% real bill: 15-30% of list

Hands-On

The "rolling-append history" pattern is the most useful one to know — in a multi-turn conversation, put cache_control on the assistant message one turn before the latest user turn, so all prior history is cached:

messages = [
  {"role":"user",      "content":[{"type":"text","text":turn1_user}]},
  {"role":"assistant", "content":[{"type":"text","text":turn1_asst}]},
  {"role":"user",      "content":[{"type":"text","text":turn2_user}]},
  {"role":"assistant", "content":[{"type":"text","text":turn2_asst,
                                  "cache_control":{"type":"ephemeral"}}]},
  {"role":"user",      "content":[{"type":"text","text":turn3_user}]},   # new
]

When turn 3 arrives, the four prior messages all go through cache_read (0.1×); only turn3_user is new input. Next turn, move the breakpoint to the end of turn 3 — this is the standard rolling-cache idiom, and it cuts a multi-turn bill by ~80% overnight. The 5-minute TTL auto-refreshes on every hit, so high-frequency sessions don't need 1h cache (which costs 2×).

Failure modes: (1) injecting the current date into the system prompt — the cache invalidates at midnight every day. Put dates in the user message. (2) a tool definition calls a helper that returns a timestamp — the schema stringification changes, all caches miss. (3) opening cache on tiny requests: below the minimum threshold, the cache path isn't entered, so you're not actually paying — but you're also not saving anything. Monitor real hit rate via cache_read_input_tokens / total_input. Below 40% means your prefix is unstable — restructure the prompt, don't disable caching.
Going deeper · Anthropic Prompt Caching with Claude, anthropic.com/news/prompt-caching · Anthropic Cookbook caching examples, github.com/anthropics/anthropic-cookbook
// 03

Model Routing & Cascading: Let the Cheap Model Try First

Claim: a single-model strategy is waste — 70% of requests are fine on Haiku, but you have to engineer the routing.

Background & Principles

Real traffic difficulty has a long tail: most requests (classification, extraction, rewriting, simple code) are fine for a small model; a minority (reasoning, long context, complex code) actually needs the flagship. Two engineering approaches:

When is routing negative ROI? Two red lines: (1) routing cost > savings — running an extra LLM classification before each Haiku call can already eat the margin; (2) routing-error cost > savings — routing a medical diagnosis to a mini model saves $0.01 but a single mistake costs $10,000. Rule: routing fits high-volume, low-error-cost, high-difficulty-variance workloads (customer support, extraction, triage, batch generation).

Hands-On

Don't start by training a router — start by using an open-source proxy like LiteLLM to unify providers behind one OpenAI-compatible endpoint, then add fallback rules. Working in 10 minutes:

# litellm config.yaml — route 80% to the cheap model, escalate on error
model_list:
  - model_name: tier-cheap
    litellm_params: {model: claude-haiku-4-5-20251001, api_key: os.environ/ANTHROPIC_API_KEY}
  - model_name: tier-mid
    litellm_params: {model: claude-sonnet-4-6,         api_key: os.environ/ANTHROPIC_API_KEY}

router_settings:
  fallbacks:
    - tier-cheap: [tier-mid]   # 429 / 5xx / context-window errors escalate
  context_window_fallbacks:
    - tier-cheap: [tier-mid]   # overflow on Haiku falls to Sonnet

# client side: just call tier-cheap; force tier-mid only for complex requests
client.chat.completions.create(model="tier-cheap", messages=[...])

This config does three things: error fallback, context-window fallback, and a unified OpenAI interface (no if-else branching in client code). Combined with LiteLLM's cost tracking, you see per-tier share and unit cost directly — far more pragmatic than training a router from scratch.

Failure modes: (1) using an LLM as a router in the hot path — every request adds one LLM call, and classification latency + cost eats the savings. Prefer rule-based routing or embedding thresholds. (2) trusting the cheap model's self-eval during cascading (sycophancy — see Day 9) — don't ask "are you sure?", use an external judge or a task-built-in verifier.
Going deeper · LMSys RouteLLM, lmsys.org/blog/routellm · arXiv 2406.18665 · LiteLLM Router docs, docs.litellm.ai/docs/routing
// 04

Batch API + Cost Monitoring & Alerts

Claim: anything that tolerates hour-level latency should go to batch; any cost you can't monitor will get out of control.

Background & Principles

Anthropic's Message Batches API and OpenAI's Batch API share almost identical contracts: async submission, 24-hour SLA, -50% on every token, stackable with prompt caching (theoretical floor ≈ 5% of list — 0.5 × 0.1). The engineering point is not "half-price" — it's that separating async-tolerable work from must-be-sync work makes the former structurally cheaper.

Three questions to decide if it batches: (1) Is a user waiting? No → batchable. (2) Does the next step block the current trace? No → batchable. (3) Can failure retries tolerate 24h? Yes → batchable. Classic fits: nightly eval, historical backfill, bulk document summarization, embedding computation, report generation. Classic misfits: live user conversation, agent loop internal tool calls, anything needing immediate feedback in the UI.

Monitoring is the layer most often skipped — until the monthly bill triples. The minimum-viable monitor is three signals: per-request cost, cache hit rate, model tier distribution. A sudden change in any one corresponds to a class of bug (prompt change broke cache / upstream misrouted to flagship / runaway output).

Hands-On

# A) Batch submit — one-line swap for -50%
from anthropic.types.messages.batch_create_params import Request
batch = client.messages.batches.create(requests=[
    Request(custom_id=f"doc-{i}", params={"model":"claude-sonnet-4-6",
        "max_tokens":1024, "messages":[{"role":"user","content":doc}]})
    for i, doc in enumerate(docs)
])
# poll batch.id until processing_status == "ended", then fetch results

# B) Emit the three signals — wire into any metrics backend
def emit(usage, model, feature, cost):
    hit = usage.cache_read_input_tokens / max(1, usage.cache_read_input_tokens
                                              + usage.cache_creation_input_tokens
                                              + usage.input_tokens)
    metrics.histogram("llm.cost_usd",        cost, tags=[feature, model])
    metrics.histogram("llm.cache_hit",       hit,  tags=[feature, model])
    metrics.histogram("llm.output_tokens", usage.output_tokens, tags=[feature])

# C) Three alert rules (PromQL / Datadog, same shape)
# 1) p95(llm.cost_usd) by feature  > 3× historical          ← sudden burn
# 2) avg(llm.cache_hit) by feature < 0.4 for 1h             ← prefix broken
# 3) sum(cost) by model{tier=premium} / sum(cost) > 0.6     ← routing failed

This monitor takes less than a day to ship and prevents the next "where did the money go?" incident. Make batch-share and cache-hit-rate the headline metrics on the dashboard, and you'll notice that "prompt optimization" and "cost optimization" are very often the same activity.

Failure modes: (1) shoving an agent loop's intermediate tool calls into batch — agents are sync, each step depends on the previous, batch breaks the loop. Batch only fits independent, parallelizable, non-blocking requests. (2) attributing cost at the model dimension — the right granularity is feature dimension (search / summarize / coding / classification); the same model has wildly different cost structure across features.
Going deeper · Anthropic Introducing the Message Batches API, anthropic.com/news/message-batches-api · OpenAI Batch API docs, platform.openai.com/docs/guides/batch

// Combined Drill · One Week to a 70%-Off Bill

Stitch the four points into a real working playbook — pull the bill Monday, see results Friday:

  1. Day 1 · Attribution: slice last month's bill on three dimensions: feature × model × cache-hit. Most teams find 1–2 features account for 60%+ of total spend — hit those two first, leave the rest alone.
  2. Day 2 · Output cap: check max_tokens on the top-cost features. It's usually defaulted to 4096 while actual output is a few hundred. Tighten to a sane ceiling — instantly cuts 20–40%.
  3. Day 3 · Cache restructure: rewrite the prompt structure for top-cost features — stable content first, volatile last, breakpoints inserted. Target hit rate ≥ 70% — another 50%+ off input cost.
  4. Day 4 · Tier downshift: run 50 representative samples on Haiku vs. Sonnet. Features where quality holds → downshift; ones that don't → wire LiteLLM fallback.
  5. Day 5 · Batch carving: identify jobs that tolerate 24h (eval / backfill / offline analysis) and switch to batch endpoint — automatic -50%.
  6. Day 5 · Monitoring: wire the three alerts above into a metrics backend — this is the only way to prevent the bill from creeping back next month.

Stacked conservatively, that's 60–80% off without quality loss. The most valuable byproduct: you'll develop a "read the prompt, estimate the bill" engineering muscle — one of the basic skills of an AI super-individual.

// Deep Thinking

Cache write costs 1.25× base. In what scenarios is opening cache actually negative ROI? How do you compute breakeven?
Breakeven point: cache write costs 0.25× extra; each hit saves 0.9×. So you need at least 0.25/0.9 ≈ 0.28 future hits — i.e. at least one more access to the same prefix within 5 minutes. Conclusion: one-shot requests (search, single QA) are negative ROI; multi-turn conversations, agent loops, and batch over the same document are positive ROI. Decision rule: at write time, can you predict "this prefix will be requested again within 5 minutes"? If not, don't enable cache.
Output is typically 5× input price. Why do providers price it this way? Does it mean output compute is 5× too?
Not entirely. Technically, output is more expensive because of autoregressive decoding — each output token runs a full forward pass (using the entire KV cache), while input is a single batched prefill. But the 5× markup includes commercial factors: output volume is small so per-unit price tolerates a higher markup, output drives perceived quality, and output is the main differentiator. Engineering implication: instead of switching to a cheaper model, first reduce output length (structured output, tight max_tokens, don't have the model repeat its input).
RouteLLM reported 85% cost reduction. Why is that number rare in production deployments?
Three reasons: (1) the paper baseline is "100% GPT-4" — real teams already use mid-tier models, leaving much less headroom. (2) the paper evaluates quality on MT Bench and the like; real-world edge-case distribution is broader, and router accuracy drops. (3) the routing decision itself has cost (embedding / classifier / latency) which papers often exclude. In practice, 30–50% is a solid result; expecting 85% will disappoint. But stacked with cache + batch, a 70% total reduction is a reasonable goal.
Batch API looks like a no-brainer — why do many teams batch-eligible workloads still not batch?
Three hidden frictions: (1) mental model cost — sync API is request-response; batch is submit-poll-collect, requiring a state machine and retry logic rewrite. (2) UX inertia — users are used to "instant reply", and there's product habit to wrap even background work as "process now". (3) orchestration complexity — batch fits "many homogeneous requests", but real workflows are often heterogeneous; forcing batch can actually slow the critical path. Pragmatic move: identify 1–2 obviously batchable scenarios (eval, backfill), wire them with a minimal poll loop. Don't start with Temporal-class workflow engines.
Cost monitoring minimum is cost, cache_hit, tier distribution. If you can only have one alert, which one?
Pick per-feature cache_hit_rate sudden drop. Reasoning: absolute cost fluctuates naturally with business growth, single outliers are individual not systemic issues, and tier-distribution changes are usually intentional. But a cache-hit drop is almost always a bug — an engineer quietly edited the system prompt, added a timestamp, or tweaked a tool schema. None of those trigger any alert, but they show up as 5–10× bill spikes by month-end. Cache hit is a leading indicator; cost is a lagging one.

// Further Reading