Users never perceive total latency — they perceive the moment the first token appears.
Latency is the metric people most often optimize in the wrong place. Most stare at a single "end-to-end total" number, trim output length, swap in a smaller model — only to find the bottleneck was in prefill all along. Real latency engineering has two completely different layers, each with its own levers. One is real time (what determines TTFT and TPOT separately, how prompt caching kills prefill). The other is perceived time (streaming saves zero tokens yet halves the felt wait; p99 — not the mean — is the true experience). This issue covers four things: decomposing latency into optimizable parts, using streaming to optimize perception, using prompt caching as the single largest lever on TTFT, and why fan-out amplifies tail latency into the norm — with copy-paste code to measure and mitigate.
Generation latency has two segments with different mechanisms and different fixes:
The key corollary: long input hurts TTFT; long output hurts total time (via N × TPOT) — they're different diseases. "My agent is slow" tells you nothing about which to treat unless you decompose. Interactive apps (chat, coding) are sensitive to both — first byte fast, then smooth.
Before any optimization, use streaming to measure both segments — this is step one of latency work:
import anthropic, time
client = anthropic.Anthropic()
t0 = time.perf_counter(); ttft = None; n = 0
with client.messages.stream(model="claude-sonnet-4-6", max_tokens=1024,
messages=[{"role":"user","content": prompt}]) as s:
for text in s.text_stream:
if ttft is None: ttft = time.perf_counter() - t0 # first-token time
n += len(text)
total = time.perf_counter() - t0
tpot = (total - ttft) / max(n, 1)
print(f"TTFT={ttft*1000:.0f}ms TPOT≈{tpot*1000:.1f}ms/char total={total*1000:.0f}ms")
Run this on your real prompt. If TTFT dominates → the disease is in the input (prompt too long / not cached), see §3. If N×TPOT dominates total → the disease is in the output (make the model say less, or use a faster decode path).
max_tokens, which does nothing when the bottleneck is prefill.
Streaming changes zero real cost: total time, total tokens, total dollars are all identical. What it changes is when the user starts seeing content. Without streaming, the user stares at a spinner for the whole N×TPOT; with streaming, they start watching text scroll after just TTFT, so perceived wait ≈ TTFT. An 8-second response is "8 seconds of blank" without streaming, and "text at 0.5s, read-as-it-goes" with it — same 8 seconds, two orders of magnitude of experience.
Technically, the Claude API uses SSE (Server-Sent Events): set stream:true and the server pushes a one-way series of events — content_block_delta carries incremental text, a final message_stop closes it. The official SDK's messages.stream() wraps the raw events into a text_stream iterator so you don't parse them by hand.
The real engineering trap is structured output: if the downstream needs complete JSON to act, streaming gains evaporate — you can't use partial JSON until the final }. Two fixes:
# Trap: streaming JSON, but the UI waits for a complete object → no gain
# Fix A: split "prose for humans" and "structure for machines" into two calls
# stream the prose (user sees it now), don't stream the structure (parse in back)
# Fix B: stream + incremental parse — feed a partial JSON parser as it arrives
with client.messages.stream(model="claude-sonnet-4-6", max_tokens=2048,
tools=[summary_tool],
messages=msgs) as s:
for ev in s:
if ev.type == "input_json": # incremental JSON of tool input
render_partial(ev.snapshot) # use the accumulated snapshot ("typing...")
Principle: stream human-facing output, keep machine-facing structure off the critical path of streaming. When you need both, split them into separate calls so the streamed one alone carries "perceived speed".
input_json_delta, but the tool must wait for complete arguments to execute; assuming you can act mid-stream yields half-parsed args and errors. (3) A network buffer or reverse proxy (nginx) with buffering on accumulates the SSE into one blob — the "stream" actually arrives all at once.
§1 established that TTFT is dominated by prefill — running the whole input through to compute the KV cache. But many requests share a repeated prefix: the same system prompt, the same tool schemas, the same few-shot batch, the unchanging history in a multi-turn chat. Prompt caching lets the server compute that prefix's KV once and store it, so the next hit skips prefill entirely, dropping TTFT from "scales with full length" to "only computes the uncached tail". It's the highest-leverage move on TTFT, and the payoff is biggest for long system prompts / large few-shot / long RAG context.
Three engineering constraints you must remember: (1) the cache is prefix-match — matched byte-by-byte from the start; change one byte of the prefix and the whole thing invalidates. (2) A hit reads at ~10% of base input price, but a write costs ~1.25× (building the cache the first time is pricier). (3) Default TTL is 5 minutes, refreshed on every hit; add "ttl":"1h" to buy one hour (more expensive). There's also a minimum: ~1024 tokens for Sonnet, ~4096 for Opus 4.5+, before a prefix is cacheable; up to 4 breakpoints.
resp = client.messages.create(
model="claude-sonnet-4-6", max_tokens=1024,
system=[
{"type":"text", "text": LONG_STABLE_INSTRUCTIONS}, # stable
{"type":"text", "text": FEWSHOT_EXAMPLES,
"cache_control": {"type":"ephemeral"}}, # ← cache up to here
],
messages=[{"role":"user", "content": user_query}], # changes; put last
)
print(resp.usage) # cache_creation_input_tokens / cache_read_input_tokens
Once live, you must watch the two fields in usage: high cache_read = healthy hits; persistently high cache_creation = your prefix is unstable and the cache is effectively dead. This is the only hard evidence of whether caching actually works — don't go by feel.
cache_control is silently ignored — you think you cached, you didn't.
Two often-ignored things. First, independent sub-tasks should run in parallel: for N mutually independent calls, serial time is Σ, parallel time (asyncio.gather) is max. Turning "ask one by one" into "ask all at once" is often the most direct speedup.
Second, the deeper trap: tail latency is amplified under fan-out. This is the classic phenomenon Google's Dean & Barroso flagged in The Tail at Scale (CACM 2013) — when one user request fans out into K parallel sub-calls, overall latency is bounded by the slowest one (max), so a single call's "occasionally slow" p99 becomes the norm for the aggregate. The math is sobering: with a 1% chance any single call is slow, K=20 gives ≈ 1 − 0.99²⁰ ≈ 18% chance at least one is slow; K=100 gives ≈ 63%. So one service's p99 becomes the p50 of its upstream aggregate request. Optimizing the mean is pointless — you have to treat the tail.
import asyncio, anthropic
client = anthropic.AsyncAnthropic()
sem = asyncio.Semaphore(8) # bounded concurrency, don't blow the rate limit
async def ask(q):
async with sem:
r = await client.messages.create(model="claude-sonnet-4-6",
max_tokens=512, messages=[{"role":"user","content": q}])
return r.content[0].text
# parallel: overall ≈ max(latency), not Σ(latency)
results = await asyncio.gather(*(ask(q) for q in questions))
# treat the tail: hedged request — past p95 with no reply, fire a duplicate, take first
async def hedged(q, p95):
a = asyncio.create_task(ask(q))
done, _ = await asyncio.wait({a}, timeout=p95)
if done: return a.result()
b = asyncio.create_task(ask(q)) # fire the second
done, pend = await asyncio.wait({a, b}, return_when=asyncio.FIRST_COMPLETED)
for t in pend: t.cancel()
return done.pop().result()
Hedging trades ~5% extra call volume for a large drop in p99 — the classic prescription from The Tail at Scale. But it must pair with bounded concurrency: unbounded parallelism only triggers 429s, retries, and avalanche — the more you "optimize" the slower it gets.
gather-ing hundreds of concurrent calls, hitting the rate limit, triggering 429 retries — overall it gets slower. (3) Forcing parallelism on dependent steps — the next step needs the previous result, so parallelizing only hands you a half-baked input.
String the four points into a 30-minute checkup — the order is the optimization priority:
cache_control, and watch usage to confirm cache_read is climbing. Best bang-for-buck.gather independent sub-tasks (with a Semaphore), then run 100 times and record p50/p95/p99 — you'll almost certainly find p99 far worse than you imagined.Do this and your attribution of "slow" upgrades from "the model's bad" to "TTFT or TPOT? real or perceived? p50 or p99?" — that's the latency-engineering mindset.