Build an inference platform for a 10M-DAU chat product: a 70B dense model on 8×H100 nodes (tensor parallel). Users average 15 turns/day, with 3k input + 400 output tokens per turn, peaking around 5k requests/sec. This is not "call an API" — an H100 costs $2–4/hour and GPUs are the dominant cost, so raising utilization from 30% to 80% literally halves the bill.
The SLO splits in two, because LLMs return token by token, streaming: TTFT (time-to-first-token) p99 < 700ms decides "does it feel laggy"; TPOT (time-per-output-token) p99 < 40ms (≈25 tok/s, faster than reading) decides "does it stream smoothly". These two metrics rest on two completely different compute profiles, and the whole architecture is a set of trade-offs around them.
graph LR
C["Client
SSE streaming"]
GW["API Gateway
auth · rate-limit · metering"]
R["Router
prefix-aware LB"]
PC[("Prompt Cache
prefix KV reuse")]
subgraph Engine["Inference Engine (vLLM/TRT-LLM)"]
PF["Prefill pool
compute-bound"]
DC["Decode pool
memory-bound"]
KV[("KV Cache
PagedAttention")]
end
C -->|1 prompt| GW --> R
R -.prefix hit?.-> PC
R -->|2 route| PF
PF -->|KV transfer| DC
PF <--> KV
DC <--> KV
DC -.->|3 token stream| GW -.SSE.-> C
classDef client fill:#1a2530,stroke:#64c8ff,color:#e8eef5
classDef edge fill:#0e2030,stroke:#5eead4,color:#e8eef5
classDef cache fill:#1a1a30,stroke:#ffb450,color:#e8eef5
classDef gpu fill:#2a1530,stroke:#ff7ab6,color:#e8eef5
class C client
class GW,R edge
class PC,KV cache
class PF,DC gpu
Requests check the prefix cache first, then enter the engine; inside, prefill and decode differ, and tokens stream back one by one over SSE
The gateway handles auth, rate limiting (per-token, not per-request), and usage metering. The router is LLM-specific: it does "prefix-aware" routing — sending requests with the same system prompt to the same GPU to reuse cached KV. Inside the engine, inference splits into prefill (process the whole prompt at once) and decode (autoregressively generate token by token), both contending for the same KV Cache memory.
Core trade-off: static batching makes fast requests wait for slow ones and leaves the GPU idle; continuous batching swaps requests in and out every token step, saturating utilization.
Principle: LLM inference has two phases. Prefill runs the whole prompt in parallel and computes KV for every token — it is compute-bound (can saturate the GPU). Decode generates 1 token at a time but must stream the entire model weights from memory into compute — it is memory-bandwidth-bound (compute mostly idle). A single request's decode might use only a few percent of the GPU — the only redemption is to batch many requests' decode together, loading weights once and amortizing across the whole batch.
But classic "static batching" has a fatal flaw: in one batch, some requests finish after 10 tokens, others need 2000, and finished slots just wait for the whole batch to complete. Continuous batching (iteration-level scheduling) drops scheduling granularity from "whole request" to "single token iteration": after each token, rebatch — finished requests leave immediately, queued ones join immediately.
# Continuous batching scheduler loop (pseudo)
running = [] # requests currently decoding
while True:
# 1) admit new requests for prefill while there's room
while can_admit(running) and queue:
req = queue.pop()
req.kv = prefill(req.prompt) # compute full prompt KV
running.append(req)
# 2) batch the "next token" of all running requests together
logits = decode_step(running) # weights loaded once, amortized
for req in running:
tok = sample(logits[req])
stream_out(req, tok) # push to client over SSE now
if tok == EOS or req.len >= req.max:
release_kv(req); running.remove(req) # free seat, reclaim KV
Core trade-off: how much memory KV Cache takes directly decides how many requests you can run concurrently and how big the batch can be — it, not compute, is the real ceiling of most serving systems.
Principle: each autoregressive step must look back at the K/V of all prior tokens; to avoid recompute you cache them. KV memory per token ≈ 2 × layers × kv_heads × head_dim × bytes. For a 70B GQA model that's about 0.3 MB/token, so a single 8k-context request eats ~2.5 GB. The 70B weights alone are 140GB in fp16 (multi-GPU); whatever memory remains is divided up by KV Cache — so "how many concurrent requests" is really a memory division problem. Worse, naive implementations reserve "max length" of contiguous memory per request, wasting 60–80% to fragmentation and over-reservation.
PagedAttention borrows the OS virtual-memory idea: slice KV Cache into fixed-size blocks (pages), logically contiguous but physically scattered, mapped via a page table. Allocate on demand, near-zero fragmentation, and requests with identical prefixes can share the same physical pages (copy-on-write). This multiplies usable batch size and throughput soars.
Core trade-off: most requests share a long prefix (system prompt, few-shot, RAG docs) — caching that prefix's KV saves both expensive prefill compute and TTFT, at the cost of routing that must be "sticky".
Principle: an agent app's system prompt might be 2k tokens; re-prefilling it per request is pure waste. Prefix caching keeps the prefix's computed KV in memory (or tiered to CPU/SSD); later requests whose prefix matches token-for-token reuse it and skip that prefill. But the cache lives in a specific GPU's memory — which demands prefix-aware routing: send same-prefix requests to the node holding the cache, or the cache is useless. This is the fundamental difference from traditional stateless LBs — it is stateful and affinity-based.
# Prefix-aware routing (pseudo): same prefix to same node
def route(req):
prefix_hash = hash(req.system_prompt + req.shared_context)
node = consistent_hash_ring.get(prefix_hash) # prefix affinity
if node.kv_load > HIGH_WATERMARK: # but guard hotspots
node = least_loaded(replicas_of(prefix_hash))
return node
# Key point: a cache hit saves prefill, but you must put the breakpoint at the
# END of the STABLE prefix — change any token in it (e.g. a timestamp) and the
# prefix hash no longer matches, invalidating everything.
Core trade-off: a stronger model is higher quality but slower and costlier; these three form an impossible triangle, and engineering pushes the "cost-latency" frontier outward without sacrificing quality.
Principle: you can't get "strongest model + lowest latency + lowest cost" at once. But several tricks push the Pareto frontier out:
# Speculative decoding (pseudo): draft guesses, big model verifies
def speculative_step(ctx):
draft = small_model.generate(ctx, k=4) # cheaply guess 4 tokens
logits = big_model.verify(ctx, draft) # one forward verifies 4 in parallel
accepted = longest_matching_prefix(draft, logits) # accept matching prefix
return accepted + [sample(logits[len(accepted)])] # advance at least 1
# Better draft = higher accept rate = more speedup; a weak draft wastes compute
# -> choosing the draft model is the crux.
Common interview questions: (1) Why is single-request GPU utilization so low during decode, and how does continuous batching save it? (2) Estimate how many requests a single node can serve concurrently for 70B at 8k context. (3) How does prefix-aware routing trade off "cache hit rate" against "load balance"? (4) Why is speculative decoding lossless, and when does it actually get slower? (5) What bottleneck governs TTFT vs TPOT, and how do you optimize each?
Memory division: weights 70B×2B ≈ 140GB (sharded across 8 cards via TP, but still 140GB of the total). Leave ~100GB for activations and framework overhead, so KV Cache has about 640−140−100 ≈ 400GB. At 70B GQA ~0.3MB/token, an average context (3k input + growing output ≈ 4k) is ~1.2GB/request → ~300 concurrent requests.
The bottleneck is memory, not compute: decode is memory-bandwidth-bound and compute mostly idles; what caps "how many concurrent" is KV Cache memory. So saving memory (MQA, KV quantization, PagedAttention) raises concurrency more than adding compute. This is why Character.AI, after cutting KV 20×, says "memory is no longer the bottleneck" — the bottleneck was moved. Note this is an order-of-magnitude estimate, affected by max-length reservation and preemption policy.
Typically prefill insertion interrupting decode. A new request must prefill (compute-bound, saturating compute for tens to hundreds of ms), and that step preempts the in-flight decode batch, spiking the existing requests' TPOT — users see the stream stutter.
Fixes: (1) Chunked prefill — slice long-prompt prefill into small chunks interleaved with decode steps, inserting only a little prefill per step; (2) prefill/decode disaggregation (DistServe) — put them on different GPU pools so they don't interfere; (3) scheduling priority — give decode higher priority to protect TPOT. The lesson: throughput optimization (continuous batching) and tail latency (TPOT jitter) are two goals, and optimizing one hurts the other.
On the surface: half as much prefix prefill saved. But the amplification is harsher. Cost: missed requests must re-prefill, and prefill is compute-bound, saturating compute — the extra prefill steals compute that could have served decode, dropping effective cluster throughput, meaning you must add machines: cost rises non-linearly.
TTFT: on a hit, TTFT is mostly network + a little compute (sub-100ms); on a miss you must fully prefill a 2k+ token prefix, and TTFT can multiply. Plus prefill queuing squeezes everything, worsening p99 further.
Common cause: someone added dynamic content (timestamp / session ID) to the system prompt, breaking the prefix hash — one line of code can crash the hit rate from 80% to 10%. So "prefix stability" is an engineering contract to monitor and guard.
Per-request it's a speedup, but cluster-wide it can lower throughput — the counterintuitive point. Each step runs an extra draft model plus parallel verification of k tokens by the big model, consuming extra compute.
So production often toggles speculative decoding by load: on at low peak (save latency), off at high peak (protect throughput).
KV Cache: not an optional accelerator but a necessary intermediate state of the computation — lose it and you must recompute; there's no "read stale value" consistency problem, only a "doesn't fit, must evict/preempt" capacity problem. Invalidation = reclaimed when the request ends. It's more like CPU registers/working memory than a data cache.
Prefix cache: closer to a traditional cache — reusable, hit-or-miss, a hit saves compute. But its "key" is the token-by-token prefix sequence, and as long as the prefix is identical the result must be identical (deterministic prefill), so there is no staleness — simpler than Redis, with no "write DB first or delete cache first" dilemma. Its hard parts are routing affinity (cache is bound to a specific GPU) and capacity eviction.
Redis data cache: the source data mutates, so the core problems are invalidation and consistency (Day 2's thundering herd, double-delete, eventual-consistency window).
The essential difference: the two LLM caches are immune to consistency problems because "same input → deterministic output", with the hard problems shifted to "memory capacity" and "stateful routing"; the traditional cache's hard problem is "the source data changes". Same name "cache", entirely different constraint dimensions.