DAY 15 / PHASE 2 · APPLICATIONS & SYSTEMS

Latency Engineering

TTFT vs TPOT · Streaming · Prompt Caching · Tail Latency

2026-05-29 · BigCat

Users never perceive total latency — they perceive the moment the first token appears.

// WHY THIS MATTERS

Latency is the metric people most often optimize in the wrong place. Most stare at a single "end-to-end total" number, trim output length, swap in a smaller model — only to find the bottleneck was in prefill all along. Real latency engineering has two completely different layers, each with its own levers. One is real time (what determines TTFT and TPOT separately, how prompt caching kills prefill). The other is perceived time (streaming saves zero tokens yet halves the felt wait; p99 — not the mean — is the true experience). This issue covers four things: decomposing latency into optimizable parts, using streaming to optimize perception, using prompt caching as the single largest lever on TTFT, and why fan-out amplifies tail latency into the norm — with copy-paste code to measure and mitigate.

// 01

Decompose Latency: TTFT vs TPOT — Measure Before You Optimize

Claim: end-to-end latency = TTFT + N × TPOT; optimize without decomposing first and nine times out of ten you cut the wrong thing.

Background & Principle

Generation latency has two segments with different mechanisms and different fixes:

TTFT (Time To First Token): from request sent to the first token emitted. It = queue + prefill (running the entire input prompt through once, computing the KV cache) + network. Prefill is compute-bound and grows roughly linearly with input token count — so longer prompt, slower TTFT.
TPOT (Time Per Output Token, a.k.a. ITL): once decoding token-by-token, the gap between consecutive tokens. Decode is memory-bound (every token requires sweeping the entire model weights out of VRAM), so per-token time is near-constant and barely varies with input length.

The key corollary: long input hurts TTFT; long output hurts total time (via N × TPOT) — they're different diseases. "My agent is slow" tells you nothing about which to treat unless you decompose. Interactive apps (chat, coding) are sensitive to both — first byte fast, then smooth.

request ─────────────────────────────────────────▶ time │◀─ TTFT ─▶│◀──────── N × TPOT ──────────▶│ │ (prefill) │ (decode, token by token) │ no-stream: [════════ stare at spinner ════════]→ see all at once stream: [wait TTFT]→ first token → text scrolls in → done ▲ perceived latency ≈ TTFT, NOT total time

In Practice

Before any optimization, use streaming to measure both segments — this is step one of latency work:

import anthropic, time
client = anthropic.Anthropic()

t0 = time.perf_counter(); ttft = None; n = 0
with client.messages.stream(model="claude-sonnet-4-6", max_tokens=1024,
        messages=[{"role":"user","content": prompt}]) as s:
    for text in s.text_stream:
        if ttft is None: ttft = time.perf_counter() - t0   # first-token time
        n += len(text)
total = time.perf_counter() - t0
tpot = (total - ttft) / max(n, 1)
print(f"TTFT={ttft*1000:.0f}ms  TPOT≈{tpot*1000:.1f}ms/char  total={total*1000:.0f}ms")

Run this on your real prompt. If TTFT dominates → the disease is in the input (prompt too long / not cached), see §3. If N×TPOT dominates total → the disease is in the output (make the model say less, or use a faster decode path).

Failure mode: estimating latency as "avg token count × price". Latency and token count are not linearly isomorphic — the same 2000 tokens all-in-input (hurts TTFT) versus all-in-output (hurts total) feel worlds apart. People also "speed up" by cutting max_tokens, which does nothing when the bottleneck is prefill.

Going deeper · vLLM Metrics (standard definitions of TTFT / TPOT / ITL), docs.vllm.ai/.../metrics · Day 14 Inference Optimization — why decode is memory-bound.

// 02

Streaming: You're Optimizing Perceived Latency, Not Real Latency

Claim: streaming saves not a single token, yet compresses the user's wait from "total time" down to "TTFT" — the cheapest experience win there is.

Background & Principle

Streaming changes zero real cost: total time, total tokens, total dollars are all identical. What it changes is when the user starts seeing content. Without streaming, the user stares at a spinner for the whole N×TPOT; with streaming, they start watching text scroll after just TTFT, so perceived wait ≈ TTFT. An 8-second response is "8 seconds of blank" without streaming, and "text at 0.5s, read-as-it-goes" with it — same 8 seconds, two orders of magnitude of experience.

Technically, the Claude API uses SSE (Server-Sent Events): set stream:true and the server pushes a one-way series of events — content_block_delta carries incremental text, a final message_stop closes it. The official SDK's messages.stream() wraps the raw events into a text_stream iterator so you don't parse them by hand.

In Practice

The real engineering trap is structured output: if the downstream needs complete JSON to act, streaming gains evaporate — you can't use partial JSON until the final }. Two fixes:

# Trap: streaming JSON, but the UI waits for a complete object → no gain

# Fix A: split "prose for humans" and "structure for machines" into two calls
#   stream the prose (user sees it now), don't stream the structure (parse in back)

# Fix B: stream + incremental parse — feed a partial JSON parser as it arrives
with client.messages.stream(model="claude-sonnet-4-6", max_tokens=2048,
        tools=[summary_tool],
        messages=msgs) as s:
    for ev in s:
        if ev.type == "input_json":           # incremental JSON of tool input
            render_partial(ev.snapshot)        # use the accumulated snapshot ("typing...")

Principle: stream human-facing output, keep machine-facing structure off the critical path of streaming. When you need both, split them into separate calls so the streamed one alone carries "perceived speed".

Failure mode: (1) streaming + forcing complete JSON before render — streaming is wasted, the user still waits to the end. (2) Tool-call arguments also stream as input_json_delta, but the tool must wait for complete arguments to execute; assuming you can act mid-stream yields half-parsed args and errors. (3) A network buffer or reverse proxy (nginx) with buffering on accumulates the SSE into one blob — the "stream" actually arrives all at once.

Going deeper · Anthropic Streaming Messages docs (SSE event types, SDK streaming), docs.claude.com/.../streaming

// 03

Prompt Caching: The Single Largest Lever on TTFT

Claim: TTFT is mostly prefill; a cache hit = skip recomputing the prefix = TTFT falls off a cliff, and you save money too.

Background & Principle

§1 established that TTFT is dominated by prefill — running the whole input through to compute the KV cache. But many requests share a repeated prefix: the same system prompt, the same tool schemas, the same few-shot batch, the unchanging history in a multi-turn chat. Prompt caching lets the server compute that prefix's KV once and store it, so the next hit skips prefill entirely, dropping TTFT from "scales with full length" to "only computes the uncached tail". It's the highest-leverage move on TTFT, and the payoff is biggest for long system prompts / large few-shot / long RAG context.

Three engineering constraints you must remember: (1) the cache is prefix-match — matched byte-by-byte from the start; change one byte of the prefix and the whole thing invalidates. (2) A hit reads at ~10% of base input price, but a write costs ~1.25× (building the cache the first time is pricier). (3) Default TTL is 5 minutes, refreshed on every hit; add "ttl":"1h" to buy one hour (more expensive). There's also a minimum: ~1024 tokens for Sonnet, ~4096 for Opus 4.5+, before a prefix is cacheable; up to 4 breakpoints.

prefix-match: stable in front, volatile at the back ┌───────────────────────────────────────────────┐ │ [system prompt] ←stable ┐ │ │ [tool schemas] ←stable ├ cache_control hit→ │ skip prefill │ [few-shot] ←stable ┘ (breakpoint) │ TTFT plummets │ ───────────────────────────────────────────── │ │ [chat history] ←semi-stable cache_control │ incremental cache │ ───────────────────────────────────────────── │ │ [this-turn user] ←changes every time (no cache) │ └───────────────────────────────────────────────┘ volatile content in front = cache miss every time = lever → zero

In Practice

resp = client.messages.create(
    model="claude-sonnet-4-6", max_tokens=1024,
    system=[
        {"type":"text", "text": LONG_STABLE_INSTRUCTIONS},   # stable
        {"type":"text", "text": FEWSHOT_EXAMPLES,
         "cache_control": {"type":"ephemeral"}},        # ← cache up to here
    ],
    messages=[{"role":"user", "content": user_query}],       # changes; put last
)
print(resp.usage)  # cache_creation_input_tokens / cache_read_input_tokens

Once live, you must watch the two fields in usage: high cache_read = healthy hits; persistently high cache_creation = your prefix is unstable and the cache is effectively dead. This is the only hard evidence of whether caching actually works — don't go by feel.

Failure mode: (1) splicing a timestamp, random ID, or username into the start of the system prompt — the prefix changes every time, cache always misses. (2) Sparse traffic: requests more than 5 minutes apart mean the cache has expired by the next call, and you paid the 1.25× write for nothing. (3) The prefix is below the minimum token threshold, so cache_control is silently ignored — you think you cached, you didn't.

Going deeper · Anthropic Prompt Caching docs (cache_control / TTL / minimums / breakpoints), docs.claude.com/.../prompt-caching

// 04

Parallelism & Tail Latency: p99 Is the Real Experience

Claim: average latency is the marketing number for your boss; p99 is the experience. And fan-out amplifies the tail into the norm.

Background & Principle

Two often-ignored things. First, independent sub-tasks should run in parallel: for N mutually independent calls, serial time is Σ, parallel time (asyncio.gather) is max. Turning "ask one by one" into "ask all at once" is often the most direct speedup.

Second, the deeper trap: tail latency is amplified under fan-out. This is the classic phenomenon Google's Dean & Barroso flagged in The Tail at Scale (CACM 2013) — when one user request fans out into K parallel sub-calls, overall latency is bounded by the slowest one (max), so a single call's "occasionally slow" p99 becomes the norm for the aggregate. The math is sobering: with a 1% chance any single call is slow, K=20 gives ≈ 1 − 0.99²⁰ ≈ 18% chance at least one is slow; K=100 gives ≈ 63%. So one service's p99 becomes the p50 of its upstream aggregate request. Optimizing the mean is pointless — you have to treat the tail.

In Practice

import asyncio, anthropic
client = anthropic.AsyncAnthropic()
sem = asyncio.Semaphore(8)   # bounded concurrency, don't blow the rate limit

async def ask(q):
    async with sem:
        r = await client.messages.create(model="claude-sonnet-4-6",
            max_tokens=512, messages=[{"role":"user","content": q}])
    return r.content[0].text

# parallel: overall ≈ max(latency), not Σ(latency)
results = await asyncio.gather(*(ask(q) for q in questions))

# treat the tail: hedged request — past p95 with no reply, fire a duplicate, take first
async def hedged(q, p95):
    a = asyncio.create_task(ask(q))
    done, _ = await asyncio.wait({a}, timeout=p95)
    if done: return a.result()
    b = asyncio.create_task(ask(q))                       # fire the second
    done, pend = await asyncio.wait({a, b}, return_when=asyncio.FIRST_COMPLETED)
    for t in pend: t.cancel()
    return done.pop().result()

Hedging trades ~5% extra call volume for a large drop in p99 — the classic prescription from The Tail at Scale. But it must pair with bounded concurrency: unbounded parallelism only triggers 429s, retries, and avalanche — the more you "optimize" the slower it gets.

Failure mode: (1) using average latency as an SLO — a pretty 800ms mean while p99 is 12s means 1% of users wait until they give up, every day. (2) Blindly gather-ing hundreds of concurrent calls, hitting the rate limit, triggering 429 retries — overall it gets slower. (3) Forcing parallelism on dependent steps — the next step needs the previous result, so parallelizing only hands you a half-baked input.

Going deeper · Dean & Barroso The Tail at Scale, CACM 2013, cacm.acm.org/research/the-tail-at-scale · Anthropic AsyncAnthropic / rate-limit docs (concurrency and 429 handling).

// SYNTHESIS · Run a Latency Checkup on Your LLM App

String the four points into a 30-minute checkup — the order is the optimization priority:

Measure first: use the §1 code to get TTFT / TPOT / total on your real prompt. Without these three numbers, everything after is guesswork.
Big TTFT → cache first: check whether the input prefix is stable, move system + tools + few-shot to the very front with cache_control, and watch usage to confirm cache_read is climbing. Best bang-for-buck.
Perception layer → stream everything: stream all human-facing output; split out anything structured into a separate call so JSON never blocks perceived speed.
Many calls → parallelize + measure p99: gather independent sub-tasks (with a Semaphore), then run 100 times and record p50/p95/p99 — you'll almost certainly find p99 far worse than you imagined.
Tail hurts → add hedging: on latency-sensitive critical paths, add hedged requests to flatten p99 with ~5% extra volume.

Do this and your attribution of "slow" upgrades from "the model's bad" to "TTFT or TPOT? real or perceived? p50 or p99?" — that's the latency-engineering mindset.

// DEEP THINKING

Streaming saves not one millisecond of real time, yet it's sold as "speedup". Where else does this "perception optimization" apply in LLM products, and where's the boundary?

Same idea: stream a "thinking… / searching…" placeholder (TTFT feels shorter), break a long task into visible intermediate artifacts (plan → draft → final), skeleton screens. All of it trades blank waiting for a first visible signal. The boundary: when the user needs correctness of the result, not a sense of progress (one-shot API returns, batch jobs, machine consumers), perception optimization is pointless or harmful — streaming JSON actually slows the downstream. It only works on interactive paths where a human is waiting at the screen.

Prompt caching says "stable content in front". Does this conflict with Day 2 Context Engineering's "important content at the ends (lost-in-the-middle)"?

No conflict — they constrain different dimensions. Caching constrains byte-stability across requests (don't change the prefix); lost-in-the-middle constrains the attention position of info within a single request. Putting stable system/tools/few-shot at the very front satisfies both: it's a cache-friendly prefix AND occupies the high-attention opening. What you must avoid is burying "the most critical instruction for the current task" in the middle — that's neither in the attention high ground nor stable, breaking the cache. Both principles point to the same layout: stable + important skeleton up front, volatile query at the tail.

Why is prefill compute-bound while decode is memory-bound? How does this difference dictate that "long input" and "long output" need completely different fixes?

Prefill processes all input tokens in parallel in one pass, saturating the compute → compute-bound, bottleneck is FLOPs, so caching (skip the compute) is the lever. Decode generates 1 token per step yet must sweep the entire model weights + KV cache out of VRAM each time, leaving compute idle waiting on memory → memory-bound, bottleneck is bandwidth, so batching, quantization, speculative decoding (Day 14) are the levers. Conclusion: treat TTFT (long input) by "computing less" (caching); treat TPOT (long output) by "moving less" (batch/quant/speculation) — the levers don't swap.

Hedged requests trade ~5% extra calls for a p99 drop. In what scenarios does this trade-off flip into a net loss?

Three ways it backfires: (1) the system is already near capacity — the extra 5% pushes load up and manufactures more stragglers, a positive-feedback avalanche; (2) the call itself is expensive (long output / big model) — the 5% extra token cost may exceed the business value of the p99 improvement; (3) non-idempotent side-effecting calls (DB writes, sending messages, charging) — the duplicate executes twice, so you must guarantee idempotency or dedupe first. Hedging's premise is "cheap, idempotent, system has headroom"; otherwise Dean & Barroso themselves suggest tied requests or not hedging at all.

If you could set just one latency SLO for your LLM product, would you pick TTFT or end-to-end p99? Why is picking only one dangerous?

Interactive products mostly pick "p99 TTFT" — it encodes both "the start of perceived wait" and "tail experience". But one is dangerous: optimize only TTFT and the model can be fast on the first token then sluggish (bad TPOT), total time still terrible; optimize only end-to-end p99 and you sacrifice first-byte responsiveness for total time. The mature move is a percentile × segment matrix: set at least p99 TTFT (perception) and p99 total (completion), then weight by product type — chat leans TTFT, batch summarization leans total.

// FURTHER READING

Anthropic · Prompt Caching — authoritative on cache_control / TTL / minimums / breakpoints
Anthropic · Streaming Messages — SSE event types and SDK streaming interface
vLLM · Metrics — industry-standard definitions of TTFT / TPOT / ITL
Dean & Barroso · The Tail at Scale (CACM 2013) — the foundational paper on tail latency and hedged requests