"This model is expensive" is the wrong attribution. Cost is a product — every factor can be cut by an order of magnitude.
The reality of 2026 LLM bills: between two heavy users on the same Sonnet, monthly spend ranges from $80 to $8000 — and 90% of the gap is engineering precision, not usage volume. Anthropic's official docs break the bill into five multiplicative factors — uncached input, cache write, cache read, batch, output — and missing any single one of them forfeits a 5–10× lever. Simon Willison's late-2024 takeaway: "LLM prices are collapsing." But the people who actually captured that windfall weren't the ones who switched to cheaper models — they were the ones who stacked caching + routing + batch, which can land below 5% of list price. This week covers four moves: dissect the bill at token granularity, engineer prompt caching, decide when to route to cheaper models, and ship batch + monitoring as a minimum-viable cost discipline.
Anthropic's actual billing formula (verifiable from the prompt caching docs):
Three counter-intuitive consequences: (1) the model name only sets the base rate — multiplied by cache hit rate and batch share, an "expensive" model often beats a "cheap" one. (2) Output is the real cost center — most models charge 5× for output vs. input; making the model "talk less" (structured output, explicit max_tokens, no rambling chain-of-thought as final answer) has higher ROI than swapping models. (3) cache_write costs 25% more than default input, so "just turn on caching and see" is negative ROI — you have to estimate hit rate first.
Simon Willison's llm-prices.com is the comparison table worth bookmarking. His late-2024 punchline: captioning 68,000 photos with Gemini Flash 8B cost $1.68 in total — meaning the expensive thing was never "calling an LLM," it's "calling the wrong-tier LLM on content that didn't need calling."
Embed bill attribution into every call. Three lines of code:
import anthropic
client = anthropic.Anthropic()
def call_with_cost(messages, system, model="claude-sonnet-4-6"):
r = client.messages.create(model=model, max_tokens=1024,
system=[{"type":"text", "text":system,
"cache_control":{"type":"ephemeral"}}],
messages=messages)
u = r.usage
# rates from llm-prices.com or your own table; Sonnet 4.6 example
rate = {"in":3.0, "out":15.0, "cw5":3.75, "cr":0.30} # $/M tok
cost = (u.input_tokens * rate["in"]
+ u.cache_creation_input_tokens * rate["cw5"]
+ u.cache_read_input_tokens * rate["cr"]
+ u.output_tokens * rate["out"]) / 1e6
print(f"${cost:.4f} | hit={u.cache_read_input_tokens/(u.cache_read_input_tokens+u.cache_creation_input_tokens+1):.0%}")
return r, cost
Run it a week, then look at p50 / p95 cost distribution. 95% of teams discover: a handful of outlier requests (runaway output / cache miss) drive half the total bill. Those are the optimization targets.
Anthropic's prompt caching contract: mark a breakpoint with cache_control: {type:"ephemeral"}, and everything from the start of the request up to that breakpoint (tools → system → messages, in order) forms one cache key. The next request that matches the prefix byte-for-byte pays 0.1× input price; change a single byte and the whole segment is rewritten at 1.25× write cost.
Three iron rules follow: (1) stable content first, volatile content last — tool schemas, system prompt, long docs, few-shot all up front; the user's current request at the very end. (2) Dynamic interpolation is a cache killer — any token that varies over time (current timestamp, random IDs, dynamic counters) inside the prefix drops hit rate to zero. (3) Max 4 breakpoints per request — typically placed after tools, after system, after history, leaving the live user turn uncached. Minimum cacheable length depends on the model: Sonnet typically 1024 tokens, Opus 4096 (per official docs); below that, the cache path isn't entered, so small requests can't be cached.
The "rolling-append history" pattern is the most useful one to know — in a multi-turn conversation, put cache_control on the assistant message one turn before the latest user turn, so all prior history is cached:
messages = [
{"role":"user", "content":[{"type":"text","text":turn1_user}]},
{"role":"assistant", "content":[{"type":"text","text":turn1_asst}]},
{"role":"user", "content":[{"type":"text","text":turn2_user}]},
{"role":"assistant", "content":[{"type":"text","text":turn2_asst,
"cache_control":{"type":"ephemeral"}}]},
{"role":"user", "content":[{"type":"text","text":turn3_user}]}, # new
]
When turn 3 arrives, the four prior messages all go through cache_read (0.1×); only turn3_user is new input. Next turn, move the breakpoint to the end of turn 3 — this is the standard rolling-cache idiom, and it cuts a multi-turn bill by ~80% overnight. The 5-minute TTL auto-refreshes on every hit, so high-frequency sessions don't need 1h cache (which costs 2×).
cache_read_input_tokens / total_input. Below 40% means your prefix is unstable — restructure the prompt, don't disable caching.
Real traffic difficulty has a long tail: most requests (classification, extraction, rewriting, simple code) are fine for a small model; a minority (reasoning, long context, complex code) actually needs the flagship. Two engineering approaches:
When is routing negative ROI? Two red lines: (1) routing cost > savings — running an extra LLM classification before each Haiku call can already eat the margin; (2) routing-error cost > savings — routing a medical diagnosis to a mini model saves $0.01 but a single mistake costs $10,000. Rule: routing fits high-volume, low-error-cost, high-difficulty-variance workloads (customer support, extraction, triage, batch generation).
Don't start by training a router — start by using an open-source proxy like LiteLLM to unify providers behind one OpenAI-compatible endpoint, then add fallback rules. Working in 10 minutes:
# litellm config.yaml — route 80% to the cheap model, escalate on error
model_list:
- model_name: tier-cheap
litellm_params: {model: claude-haiku-4-5-20251001, api_key: os.environ/ANTHROPIC_API_KEY}
- model_name: tier-mid
litellm_params: {model: claude-sonnet-4-6, api_key: os.environ/ANTHROPIC_API_KEY}
router_settings:
fallbacks:
- tier-cheap: [tier-mid] # 429 / 5xx / context-window errors escalate
context_window_fallbacks:
- tier-cheap: [tier-mid] # overflow on Haiku falls to Sonnet
# client side: just call tier-cheap; force tier-mid only for complex requests
client.chat.completions.create(model="tier-cheap", messages=[...])
This config does three things: error fallback, context-window fallback, and a unified OpenAI interface (no if-else branching in client code). Combined with LiteLLM's cost tracking, you see per-tier share and unit cost directly — far more pragmatic than training a router from scratch.
Anthropic's Message Batches API and OpenAI's Batch API share almost identical contracts: async submission, 24-hour SLA, -50% on every token, stackable with prompt caching (theoretical floor ≈ 5% of list — 0.5 × 0.1). The engineering point is not "half-price" — it's that separating async-tolerable work from must-be-sync work makes the former structurally cheaper.
Three questions to decide if it batches: (1) Is a user waiting? No → batchable. (2) Does the next step block the current trace? No → batchable. (3) Can failure retries tolerate 24h? Yes → batchable. Classic fits: nightly eval, historical backfill, bulk document summarization, embedding computation, report generation. Classic misfits: live user conversation, agent loop internal tool calls, anything needing immediate feedback in the UI.
Monitoring is the layer most often skipped — until the monthly bill triples. The minimum-viable monitor is three signals: per-request cost, cache hit rate, model tier distribution. A sudden change in any one corresponds to a class of bug (prompt change broke cache / upstream misrouted to flagship / runaway output).
# A) Batch submit — one-line swap for -50%
from anthropic.types.messages.batch_create_params import Request
batch = client.messages.batches.create(requests=[
Request(custom_id=f"doc-{i}", params={"model":"claude-sonnet-4-6",
"max_tokens":1024, "messages":[{"role":"user","content":doc}]})
for i, doc in enumerate(docs)
])
# poll batch.id until processing_status == "ended", then fetch results
# B) Emit the three signals — wire into any metrics backend
def emit(usage, model, feature, cost):
hit = usage.cache_read_input_tokens / max(1, usage.cache_read_input_tokens
+ usage.cache_creation_input_tokens
+ usage.input_tokens)
metrics.histogram("llm.cost_usd", cost, tags=[feature, model])
metrics.histogram("llm.cache_hit", hit, tags=[feature, model])
metrics.histogram("llm.output_tokens", usage.output_tokens, tags=[feature])
# C) Three alert rules (PromQL / Datadog, same shape)
# 1) p95(llm.cost_usd) by feature > 3× historical ← sudden burn
# 2) avg(llm.cache_hit) by feature < 0.4 for 1h ← prefix broken
# 3) sum(cost) by model{tier=premium} / sum(cost) > 0.6 ← routing failed
This monitor takes less than a day to ship and prevents the next "where did the money go?" incident. Make batch-share and cache-hit-rate the headline metrics on the dashboard, and you'll notice that "prompt optimization" and "cost optimization" are very often the same activity.
Stitch the four points into a real working playbook — pull the bill Monday, see results Friday:
max_tokens on the top-cost features. It's usually defaulted to 4096 while actual output is a few hundred. Tighten to a sane ceiling — instantly cuts 20–40%.Stacked conservatively, that's 60–80% off without quality loss. The most valuable byproduct: you'll develop a "read the prompt, estimate the bill" engineering muscle — one of the basic skills of an AI super-individual.
0.25/0.9 ≈ 0.28 future hits — i.e. at least one more access to the same prefix within 5 minutes. Conclusion: one-shot requests (search, single QA) are negative ROI; multi-turn conversations, agent loops, and batch over the same document are positive ROI. Decision rule: at write time, can you predict "this prefix will be requested again within 5 minutes"? If not, don't enable cache.