Day 32 Hard LLM Serving Continuous Batching KV Cache Cost/Latency/Quality

LLM Serving — Keep the GPU Fed, Keep Latency DownContinuous Batching, PagedAttention, Prompt Caching, Cost/Latency/Quality

Problem & Constraints

Build an inference platform for a 10M-DAU chat product: a 70B dense model on 8×H100 nodes (tensor parallel). Users average 15 turns/day, with 3k input + 400 output tokens per turn, peaking around 5k requests/sec. This is not "call an API" — an H100 costs $2–4/hour and GPUs are the dominant cost, so raising utilization from 30% to 80% literally halves the bill.

The SLO splits in two, because LLMs return token by token, streaming: TTFT (time-to-first-token) p99 < 700ms decides "does it feel laggy"; TPOT (time-per-output-token) p99 < 40ms (≈25 tok/s, faster than reading) decides "does it stream smoothly". These two metrics rest on two completely different compute profiles, and the whole architecture is a set of trade-offs around them.

High-Level Architecture

graph LR
    C["Client
SSE streaming"]
    GW["API Gateway
auth · rate-limit · metering"]
    R["Router
prefix-aware LB"]
    PC[("Prompt Cache
prefix KV reuse")]
    subgraph Engine["Inference Engine (vLLM/TRT-LLM)"]
      PF["Prefill pool
compute-bound"]
      DC["Decode pool
memory-bound"]
      KV[("KV Cache
PagedAttention")]
    end

    C -->|1 prompt| GW --> R
    R -.prefix hit?.-> PC
    R -->|2 route| PF
    PF -->|KV transfer| DC
    PF <--> KV
    DC <--> KV
    DC -.->|3 token stream| GW -.SSE.-> C

    classDef client fill:#1a2530,stroke:#64c8ff,color:#e8eef5
    classDef edge fill:#0e2030,stroke:#5eead4,color:#e8eef5
    classDef cache fill:#1a1a30,stroke:#ffb450,color:#e8eef5
    classDef gpu fill:#2a1530,stroke:#ff7ab6,color:#e8eef5
    class C client
    class GW,R edge
    class PC,KV cache
    class PF,DC gpu

Requests check the prefix cache first, then enter the engine; inside, prefill and decode differ, and tokens stream back one by one over SSE

The gateway handles auth, rate limiting (per-token, not per-request), and usage metering. The router is LLM-specific: it does "prefix-aware" routing — sending requests with the same system prompt to the same GPU to reuse cached KV. Inside the engine, inference splits into prefill (process the whole prompt at once) and decode (autoregressively generate token by token), both contending for the same KV Cache memory.

Key Technical Points

1. Two-Phase Inference + Continuous Batching: The Throughput Lever

Core trade-off: static batching makes fast requests wait for slow ones and leaves the GPU idle; continuous batching swaps requests in and out every token step, saturating utilization.

Principle: LLM inference has two phases. Prefill runs the whole prompt in parallel and computes KV for every token — it is compute-bound (can saturate the GPU). Decode generates 1 token at a time but must stream the entire model weights from memory into compute — it is memory-bandwidth-bound (compute mostly idle). A single request's decode might use only a few percent of the GPU — the only redemption is to batch many requests' decode together, loading weights once and amortizing across the whole batch.

But classic "static batching" has a fatal flaw: in one batch, some requests finish after 10 tokens, others need 2000, and finished slots just wait for the whole batch to complete. Continuous batching (iteration-level scheduling) drops scheduling granularity from "whole request" to "single token iteration": after each token, rebatch — finished requests leave immediately, queued ones join immediately.

Trade-off:

Static batching: ✅ simple; ❌ dragged by the longest request, GPU utilization often < 30%, and you wait to fill a batch before starting (bad TTFT).
Continuous batching: ✅ utilization up to 70–80%, new requests don't wait; ❌ complex scheduler, prefill insertion interrupts in-flight decode (TPOT jitter, mitigated by chunked prefill).
Large vs small batch: bigger batch = higher throughput but more KV Cache memory and slower per-request TPOT — throughput and per-request latency are inherently opposed.

# Continuous batching scheduler loop (pseudo)
running = []                       # requests currently decoding
while True:
    # 1) admit new requests for prefill while there's room
    while can_admit(running) and queue:
        req = queue.pop()
        req.kv = prefill(req.prompt)   # compute full prompt KV
        running.append(req)
    # 2) batch the "next token" of all running requests together
    logits = decode_step(running)      # weights loaded once, amortized
    for req in running:
        tok = sample(logits[req])
        stream_out(req, tok)           # push to client over SSE now
        if tok == EOS or req.len >= req.max:
            release_kv(req); running.remove(req)   # free seat, reclaim KV

Real-world cases:

ORCA (OSDI 2022, SNU/FriendliAI): introduced iteration-level scheduling + selective batching, an order-of-magnitude throughput gain over FasterTransformer at equal latency — the foundational continuous-batching paper.
vLLM: combined continuous batching with PagedAttention into the de-facto open-source standard adopted across countless platforms.
NVIDIA TensorRT-LLM: productized the same idea as "in-flight batching" for H100 deployments.

2. KV Cache Is the First-Order Bottleneck: PagedAttention & Compression

Core trade-off: how much memory KV Cache takes directly decides how many requests you can run concurrently and how big the batch can be — it, not compute, is the real ceiling of most serving systems.

Principle: each autoregressive step must look back at the K/V of all prior tokens; to avoid recompute you cache them. KV memory per token ≈ 2 × layers × kv_heads × head_dim × bytes. For a 70B GQA model that's about 0.3 MB/token, so a single 8k-context request eats ~2.5 GB. The 70B weights alone are 140GB in fp16 (multi-GPU); whatever memory remains is divided up by KV Cache — so "how many concurrent requests" is really a memory division problem. Worse, naive implementations reserve "max length" of contiguous memory per request, wasting 60–80% to fragmentation and over-reservation.

PagedAttention borrows the OS virtual-memory idea: slice KV Cache into fixed-size blocks (pages), logically contiguous but physically scattered, mapped via a page table. Allocate on demand, near-zero fragmentation, and requests with identical prefixes can share the same physical pages (copy-on-write). This multiplies usable batch size and throughput soars.

Ways to compress KV (each with a cost):

GQA / MQA: multiple query heads share one set of KV heads, shrinking KV linearly (MQA is most aggressive). ✅ saves memory and bandwidth; ❌ slight expressivity loss, must be adopted at training time.
KV quantization (int8/fp8): store KV at low precision. ✅ halves memory; ❌ may lose precision in long contexts, needs calibration.
PagedAttention paging: ✅ kills fragmentation, enables prefix sharing; ❌ kernels must handle non-contiguous memory, complex to implement.

Real-world cases:

Character.AI: uses MQA in all attention layers (shrinking KV another 8× vs mainstream GQA) and natively trains weights/activations/KV in int8, cutting KV Cache by 20×+, so "memory is no longer the bottleneck for large batches".
vLLM PagedAttention (SOSP 2023 best paper): up to 24× throughput over HuggingFace Transformers by pushing memory waste from 60–80% down to single digits.
Mainstream open models (Llama, Mistral, etc.) widely adopt GQA precisely for serving-time KV economy.

3. Prompt Caching + Prefix-Aware Routing: Skip the Repeated Prefill

Core trade-off: most requests share a long prefix (system prompt, few-shot, RAG docs) — caching that prefix's KV saves both expensive prefill compute and TTFT, at the cost of routing that must be "sticky".

Principle: an agent app's system prompt might be 2k tokens; re-prefilling it per request is pure waste. Prefix caching keeps the prefix's computed KV in memory (or tiered to CPU/SSD); later requests whose prefix matches token-for-token reuse it and skip that prefill. But the cache lives in a specific GPU's memory — which demands prefix-aware routing: send same-prefix requests to the node holding the cache, or the cache is useless. This is the fundamental difference from traditional stateless LBs — it is stateful and affinity-based.

# Prefix-aware routing (pseudo): same prefix to same node
def route(req):
    prefix_hash = hash(req.system_prompt + req.shared_context)
    node = consistent_hash_ring.get(prefix_hash)   # prefix affinity
    if node.kv_load > HIGH_WATERMARK:              # but guard hotspots
        node = least_loaded(replicas_of(prefix_hash))
    return node
# Key point: a cache hit saves prefill, but you must put the breakpoint at the
# END of the STABLE prefix — change any token in it (e.g. a timestamp) and the
# prefix hash no longer matches, invalidating everything.

Trade-off: pure round-robin is simple but prefix cache hit rate ≈ 0; pure prefix affinity has high hit rate but creates hotspots (a popular system prompt overloads one node). Production is a "prefix affinity + load-threshold fallback" hybrid, dynamically balancing hit rate vs evenness.

Real-world cases:

Anthropic Claude API: explicit prompt caching with up to 4 cache breakpoints, 5-min / 1-hour TTL; cached portions billed at a steep discount, with a minimum cacheable token threshold — putting the breakpoint at "the end of the stable prefix" is the key trick.
vLLM Automatic Prefix Caching: reuses identical-prefix physical pages via PagedAttention's page hashing, out of the box.
Streaming (SSE): every major chat API uses Server-Sent Events to push tokens one by one, making TTFT the user-perceived latency — once the first token appears, the user starts reading, "hiding" the generation time of the next few hundred tokens inside reading.

4. The Cost / Latency / Quality Triangle: The Architect's Master Trade-off

Core trade-off: a stronger model is higher quality but slower and costlier; these three form an impossible triangle, and engineering pushes the "cost-latency" frontier outward without sacrificing quality.

Principle: you can't get "strongest model + lowest latency + lowest cost" at once. But several tricks push the Pareto frontier out:

Model routing / cascade: simple queries go to a small model, escalating to the big one only when uncertain. Most traffic is caught by the cheap model, with "hard-case escalation" backstopping quality.
Speculative decoding: a small "draft model" guesses k tokens at once, the big model verifies in parallel and accepts the matching prefix. Output distribution is identical to the big model (lossless), but it collapses several serial decode steps into one — 2–3× speedup.
Prefill/decode disaggregation: prefill and decode have opposite compute profiles; co-located they interfere. Splitting them onto separate GPU pools, each with its own parallelism and batch size, improves both TTFT and TPOT.
Quantization (fp8/int8/int4): lower weight precision for memory and speed; quality loss must be evaluated per task.

# Speculative decoding (pseudo): draft guesses, big model verifies
def speculative_step(ctx):
    draft = small_model.generate(ctx, k=4)        # cheaply guess 4 tokens
    logits = big_model.verify(ctx, draft)         # one forward verifies 4 in parallel
    accepted = longest_matching_prefix(draft, logits)  # accept matching prefix
    return accepted + [sample(logits[len(accepted)])]  # advance at least 1
# Better draft = higher accept rate = more speedup; a weak draft wastes compute
# -> choosing the draft model is the crux.

Real-world cases:

Speculative decoding (Leviathan et al., ICML 2023): 2–3× speedup on T5-XXL with output bit-identical to the original model; now a standard capability in vLLM / TensorRT-LLM.
DistServe (OSDI 2024): prefill/decode disaggregation serves 7.4× more requests or meets 12.6× tighter SLOs versus co-located systems, within latency constraints.
Major API platforms commonly offer tiered model families (e.g. Haiku/Sonnet/Opus) so callers pick their own point on the triangle per use case.

Scaling & Optimization

Multi-model / multi-LoRA co-location: one base + many LoRA adapters serving multi-tenant on the same GPU saves cards; scheduling must batch across adapter switches.
Tiered KV Cache: hot prefixes in HBM, warm ones tiered to CPU memory / local NVMe, cross-node sharing via KV transfer — extending "cache hit" from single-node to cluster.
Autoscaling: GPU cold start is slow (loading 70B weights takes tens of seconds to minutes), so scaling must be predictive not reactive, paired with request queuing and priorities.
Chunked prefill: slice long-prompt prefill and interleave with decode to flatten TPOT jitter.
Multi-region: weights resident per region, geo-routing cuts TTFT; prefix caches are hard to share across regions — a trade-off.

Pitfalls + Interview Questions

1. Watching throughput, ignoring the latency distribution. A big batch looks great on throughput, but TPOT slows and p99 TTFT explodes. You must explain that "throughput vs per-request latency" are two faces of the same resource, and cap batch size by SLO.

2. Assuming LLM load balancing = stateless round-robin. KV Cache and prefix cache make nodes stateful and affinity-bound. Round-robin drives prefix cache hit rate to zero. This is the most common follow-up.

3. Confusing compute-bound vs memory-bound. Prefill lacks compute, decode lacks bandwidth. The fixes are entirely different (decode → bigger batch, prefill → chunking/parallelism); getting it wrong exposes a shallow grasp.

4. Writing an unstable prompt prefix. Stuffing a current timestamp or random ID into the system prompt makes the prefix hash never match, so the cache is built for nothing. Put the cache at the token-stable end of the prefix.

5. Sizing capacity by averages. Input/output lengths are long-tailed; a few ultra-long contexts can exhaust KV memory, triggering preemption or OOM. Design by distribution with admission control.

Common interview questions: (1) Why is single-request GPU utilization so low during decode, and how does continuous batching save it? (2) Estimate how many requests a single node can serve concurrently for 70B at 8k context. (3) How does prefix-aware routing trade off "cache hit rate" against "load balance"? (4) Why is speculative decoding lossless, and when does it actually get slower? (5) What bottleneck governs TTFT vs TPOT, and how do you optimize each?

Further Resources

vLLM: Efficient Memory Management with PagedAttention (Kwon et al., SOSP 2023) + the vLLM blog — the foundational KV Cache paging work.
Orca (Yu et al., OSDI 2022, USENIX) — the original iteration-level scheduling / continuous batching paper.
DistServe (Zhong et al., OSDI 2024, USENIX) — prefill/decode disaggregation.
Fast Inference from Transformers via Speculative Decoding (Leviathan et al., ICML 2023).
Anthropic Prompt Caching docs and Character.AI "Optimizing AI Inference" engineering blog — production prompt caching and KV compression.

Going Deeper (click to expand)

1. Given 8×H100 (640GB), a 70B fp16 model, estimate theoretical concurrency. Is the bottleneck compute or memory?

Memory division: weights 70B×2B ≈ 140GB (sharded across 8 cards via TP, but still 140GB of the total). Leave ~100GB for activations and framework overhead, so KV Cache has about 640−140−100 ≈ 400GB. At 70B GQA ~0.3MB/token, an average context (3k input + growing output ≈ 4k) is ~1.2GB/request → ~300 concurrent requests.

The bottleneck is memory, not compute: decode is memory-bandwidth-bound and compute mostly idles; what caps "how many concurrent" is KV Cache memory. So saving memory (MQA, KV quantization, PagedAttention) raises concurrency more than adding compute. This is why Character.AI, after cutting KV 20×, says "memory is no longer the bottleneck" — the bottleneck was moved. Note this is an order-of-magnitude estimate, affected by max-length reservation and preemption policy.

2. Continuous batching saturated the GPU, so why do users still complain of "a stutter mid-typing"?

Typically prefill insertion interrupting decode. A new request must prefill (compute-bound, saturating compute for tens to hundreds of ms), and that step preempts the in-flight decode batch, spiking the existing requests' TPOT — users see the stream stutter.

Fixes: (1) Chunked prefill — slice long-prompt prefill into small chunks interleaved with decode steps, inserting only a little prefill per step; (2) prefill/decode disaggregation (DistServe) — put them on different GPU pools so they don't interfere; (3) scheduling priority — give decode higher priority to protect TPOT. The lesson: throughput optimization (continuous batching) and tail latency (TPOT jitter) are two goals, and optimizing one hurts the other.

3. Prompt-cache hit rate drops from 80% to 40%. What happens to cost and TTFT, and why is it non-linear?

On the surface: half as much prefix prefill saved. But the amplification is harsher. Cost: missed requests must re-prefill, and prefill is compute-bound, saturating compute — the extra prefill steals compute that could have served decode, dropping effective cluster throughput, meaning you must add machines: cost rises non-linearly.

TTFT: on a hit, TTFT is mostly network + a little compute (sub-100ms); on a miss you must fully prefill a 2k+ token prefix, and TTFT can multiply. Plus prefill queuing squeezes everything, worsening p99 further.

Common cause: someone added dynamic content (timestamp / session ID) to the system prompt, breaking the prefix hash — one line of code can crash the hit rate from 80% to 10%. So "prefix stability" is an engineering contract to monitor and guard.

4. Speculative decoding claims lossless 2–3× speedup. When does it instead slow down or even lower total throughput?

Per-request it's a speedup, but cluster-wide it can lower throughput — the counterintuitive point. Each step runs an extra draft model plus parallel verification of k tokens by the big model, consuming extra compute.

Inaccurate draft (low accept rate): when the draft and big model distributions differ a lot, most guesses are rejected, wasting verification compute, netting slower.
Large-batch high-load regimes: at low load the GPU compute idles, so spending it on speculation to cut latency is a great deal; but at high load with an already-large batch, decode is near compute-saturated, the extra speculation has nowhere to fit and actually lowers total throughput — it trades idle compute for latency, and with no idle compute it's a net loss.

So production often toggles speculative decoding by load: on at low peak (save latency), off at high peak (protect throughput).

5. Apply Day 2's "caching" lens to LLM serving: how do KV Cache, prefix cache, and a traditional Redis cache differ on "invalidation" and "consistency"?

KV Cache: not an optional accelerator but a necessary intermediate state of the computation — lose it and you must recompute; there's no "read stale value" consistency problem, only a "doesn't fit, must evict/preempt" capacity problem. Invalidation = reclaimed when the request ends. It's more like CPU registers/working memory than a data cache.

Prefix cache: closer to a traditional cache — reusable, hit-or-miss, a hit saves compute. But its "key" is the token-by-token prefix sequence, and as long as the prefix is identical the result must be identical (deterministic prefill), so there is no staleness — simpler than Redis, with no "write DB first or delete cache first" dilemma. Its hard parts are routing affinity (cache is bound to a specific GPU) and capacity eviction.

Redis data cache: the source data mutates, so the core problems are invalidation and consistency (Day 2's thundering herd, double-delete, eventual-consistency window).

The essential difference: the two LLM caches are immune to consistency problems because "same input → deterministic output", with the hard problems shifted to "memory capacity" and "stateful routing"; the traditional cache's hard problem is "the source data changes". Same name "cache", entirely different constraint dimensions.

← Back to index