The whole secret of inference optimization is feeding a GPU that's starving for memory bandwidth.
// WHY THIS MATTERS
The moment you go from "calling an API" to "hosting your own open model," cost, latency, and throughput become your engineering problem. Most people's first instinct is "buy a beefier GPU"—then they swap in an H100, throughput rises 20%, and money burns. The reason is a counterintuitive fact: token-by-token decoding is not compute-bound, it's memory-bound—the GPU's compute units sit mostly idle; the bottleneck is the bandwidth to move model weights from HBM into the compute units. Grasp this and every downstream optimization (continuous batching, KV cache paging, speculative decoding, quantization) lines up: they all answer one question—how to compute more tokens per single weight load. This issue skips "what is inference" and covers the decisions a senior engineer actually makes: how to locate the bottleneck, the marginal payoff of the four levers, their failure modes, and how to choose between vLLM / TGI / TensorRT-LLM.
// 01
Decode is memory-bound: continuous batching is the first lever
Claim: single-request inference wastes 90%+ of the GPU; the first step to throughput isn't a new card, it's continuous batching.
Background & principle
Inference has two phases with completely different bottlenecks. Prefill (processing the prompt) computes all tokens in parallel—compute-bound. Decode (generating one token at a time) computes just 1 token per step yet must re-stream the entire model weights from HBM—arithmetic intensity is tiny, the Tensor Cores sit mostly idle, and the real bottleneck is memory bandwidth. That's why at batch=1 even the priciest card stays underutilized.
The fix is to pack the same decode step from many requests into one batch: weights are streamed once, a token is computed for N requests simultaneously, and throughput scales near-linearly. But traditional static batching has a fatal flaw—it can't release the batch until the longest sequence in it finishes, so short requests are dragged down by long ones and GPU utilization collapses as generation proceeds. Orca (OSDI'22) introduced iteration-level scheduling (i.e. continuous batching): after every decode step, it evicts finished requests and admits queued ones, so the GPU never idles. Anyscale measured 23× throughput versus a naive HF implementation. This is the single highest-payoff change.
Static batching (wait for longest) Continuous batching (per-step scheduling)
step→ 1 2 3 4 5 6 step→ 1 2 3 4 5 6
R1 [██████]................ R1 [██████]R5[████....
R2 [████].................. R2 [████]R6[██████....
R3 [██████████████]........ R3 [██████████████]....
R4 [██].................... R4 [██]R7[████]R8[██..
↑ R1/R2/R4 done early ↑ a slot frees → fill it now
GPU idles waiting on R3 GPU stays saturated
Hands-on
# vLLM does continuous batching by default — serve in one line
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--max-num-seqs 256 # concurrent request cap = how big batch can grow# Diagnose whether you're memory-bound / under-batched:# while load-testing, watch nvidia-smi — if GPU-Util is high but SM# compute is low, mem bandwidth is maxed, and throughput rises# near-linearly with batch → you're memory-bound, batching pays off
nvidia-smi dmon -s u # compare sm% vs mem%
Remember the division of labor: throughput comes from bigger batches, latency from smaller batches—a zero-sum pair. Decide which one you're optimizing first.
Failure modes: (1) Forcing large batches in a low-QPS setting—there simply aren't enough concurrent requests to fill the batch, so the gain is ~zero. (2) Chasing throughput while ignoring TTFT (time-to-first-token): bigger batches queue new requests longer, so interactive UX actually degrades. Throughput and latency must be separate targets.
Going deeper · Orca (origin of continuous batching), usenix.org · OSDI'22 ·
Anyscale How continuous batching enables 23x throughput, anyscale.com/blog
// 02
KV cache is the real memory hog: PagedAttention + prefix caching
Claim: what limits batch size is often not the model weights but the KV cache; manage it like virtual memory (paging + prefix reuse) and you can double throughput.
Background & principle
Continuous batching wants larger batches, but the bigger the batch and the longer the sequence, the more memory the KV cache eats—at long context it can exceed the weights themselves. A size estimate (standard formula, not a cited figure):
KV_bytes ≈ 2 × n_layers × n_kv_heads × head_dim
× seq_len × batch × dtype_bytes
# the "2" = K and V; one 8K-context request can reach hundreds of MB
The traditional approach pre-reserves contiguous memory for each request at "max length," but actual lengths vary, so 60-80% of memory is wasted on fragmentation and over-reservation—squeezing your achievable batch. PagedAttention (vLLM's core, SOSP'23) borrows the OS's virtual memory: it splits the KV cache into fixed-size blocks (pages), stored non-contiguously and allocated on demand, cutting waste below 4%—effectively doubling the batch and yielding 2-4× throughput.
The second trick is prefix caching: if several requests share an identical prefix (system prompt, few-shot examples, a long document), that prefix's KV is computed once and reused across requests. This is exactly how Anthropic prompt caching works server-side—put stable content first and volatile content last, and a cache hit skips the redundant prefill.
Hands-on
# vLLM: enable prefix caching + quantize KV cache (save half the memory)
vllm serve \
--enable-prefix-caching \ # KV of identical prefixes is reused
--kv-cache-dtype fp8 # fp8 KV halves memory, lifts throughput# Prompt layout: stable prefix first, volatile last — required for a hit
[system + tool defs + few-shot] ← byte-identical across reqs → cache hit
[this turn's user input] ← differs each time → prefill only this bit
Failure modes: (1) Prefix caching only hits when the prefix is byte-for-byte identical—stick a timestamp / random ID / shuffled concatenation order into the system prompt and the hit rate drops to zero. (2) fp8 KV cache loses accuracy on long-context, high-precision tasks (math, long reasoning chains); test with your own eval before shipping, don't just read the throughput number.
Speculative decoding: a small model runs ahead, doubling throughput with identical output
Claim: since decode is memory-bound, verifying a few extra tokens per weight load is nearly free—draft-then-verify is a lossless 2-3× speedup.
Background & principle
The key insight follows from §1: during decode the GPU's compute sits idle, so why not compute the probabilities of several tokens at once. Speculative decoding (Leviathan et al., ICML'23) lets a cheap draft model quickly guess k candidate tokens, then has the big model verify all k in a single parallel forward—accepting the longest correct prefix and resampling at the first disagreement. When the draft guesses well, one step advances several tokens; when it's wrong, you simply fall back to normal speed.
The most counterintuitive part: via rejection sampling, the final output distribution is provably identical to the original model—this is lossless acceleration, not an approximation. The speedup depends on the acceptance rate (how often the draft is right) and the draft/target cost ratio. Medusa (arXiv 2401.10774) goes further: instead of a separate draft model, it adds a few lightweight decoding heads + tree attention so the model speculates on itself, removing the burden of maintaining two models—measured at 2.3-3.6×.
┌─ draft model guesses k ─┐
prompt ──▶ "the cat sat on" ──▶ draft: [" the"," mat"," and"," ran"]
│ (cheap, fast)
▼
big model verifies all 4 at once ──▶ accept " the"," mat" ✓✓
3rd disagrees ✗ → resample " ."
result: one big-model forward advanced 3 tokens (not 1)
distribution identical to original (rejection sampling = lossless)
Hands-on
# vLLM: use a small model as draft (must share tokenizer, ~10-20× smaller)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--speculative-model meta-llama/Llama-3.2-1B-Instruct \
--num-speculative-tokens 5 # guess 5; too many backfires# No draft model handy? Use n-gram speculation (zero extra model)
--speculative-model "[ngram]" --ngram-prompt-lookup-max 4
Failure modes: (1) Gains vanish at high batch—with a large batch the GPU is already compute-bound (§1's idle compute is now full), so there's no spare capacity to verify speculative tokens, and the draft overhead makes it slower. Speculative decoding is a lever for low-concurrency / heavily interactive workloads. (2) If draft and target distributions diverge too much → low acceptance → net loss. Code and structured output have high acceptance (formulaic); free-form creative writing is low.
Going deeper · Leviathan et al. Fast Inference via Speculative Decoding, arXiv:2211.17192 ·
Medusa (multiple decoding heads), arXiv:2401.10774
// 04
Choosing an engine & quantization: the vLLM / TGI / TensorRT-LLM trade-offs
Claim: there's no "fastest engine," only the one that matches your workload + ops constraints; quantization is the second-biggest lever, with hidden traps.
Background & principle
All three major open engines implement the core optimizations of §1-§3; they differ in the triangle of peak performance vs deployment flexibility vs onboarding cost:
vLLM: native PagedAttention, fastest iteration, Python-friendly, lowest onboarding. The default starting point for most self-hosting.
TGI (HuggingFace): production-ready, seamless with the HF ecosystem/Endpoints, multi-backend (incl. TRT-LLM). Pick it for stable ops.
TensorRT-LLM (NVIDIA): peak performance on NVIDIA cards, but requires pre-compiling an engine per GPU architecture—heaviest ops, least flexible. Worth it only when you chase max throughput-per-card and your model/hardware are stable.
The second-biggest lever is quantization: compress weights from fp16 to int4/fp8. Because decode is memory-bound (§1), smaller weights = faster streaming = directly faster decode, while freeing memory to enlarge the batch. weight-only int4 (AWQ / GPTQ) halves memory and adds speed; fp8 (weights + activations) needs Hopper or newer and loses less precision.
Hands-on
# Quick decision sheet
fast validation / self-host / rapid iteration → vLLM
HF ecosystem / stable production ops → TGI
squeeze NVIDIA cards / stable model → TensorRT-LLM
# Quantization picks
memory-tight, must fit a big model → AWQ int4 (weight-only)
Hopper+, want precision and speed → fp8 (weights + KV cache)
vllm serve --quantization awq # one flag
Failure modes: (1) A TensorRT-LLM compiled artifact locks to a GPU arch + config—change cards or the batch/seq cap and you recompile, spiking CI/CD complexity; don't adopt it early for a 10% throughput bump. (2) int4 quantization degrades reasoning / math tasks far more than chit-chat; you must run your own eval—comparing throughput without comparing quality is the most common self-deception.
// Capstone · Stand up an inference service that measures itself
Chain the four points into a weekend project: host an 8B model with vLLM, stack the optimizations one at a time, and measure marginal gains—the only way to turn "I know this" into "I've done this."
Define metrics, measure them separately: TTFT (time-to-first-token), TPOT (time-per-output-token), throughput (total tokens/s). Load-test with vLLM's benchmark_serving.py—never measure throughput with a single request.
Baseline: default continuous batching, sweep --max-num-seqs from 16 to 256, plot the throughput-latency curve, and find your "knee."
Add §2: turn on --enable-prefix-caching, construct a request stream sharing a system prompt, and see how much TTFT drops on a hit.
Add §3: in a low-concurrency setting attach a 1B draft model, measure acceptance rate and TPOT change—then crank concurrency up and watch the speculative gain vanish (confirming §3's failure mode).
Add §4: switch to an AWQ int4 build, compare memory footprint, how much bigger a batch you can fit, and the throughput lift—then run 10 questions with known answers to check whether quality dropped.
Once you've done this loop, any "our inference is N× faster" pitch triggers a reflex: measuring TTFT or throughput? at what concurrency? did quality drop?—instead of being swept along by a single isolated throughput number.
// GLOSSARY
Prefill
The phase that processes the input prompt; all tokens computed in parallel, compute-bound.
Decode
The phase that generates output tokens one at a time; re-streams all weights per step, memory-bound.
Memory-bound
Bottlenecked by memory bandwidth, not compute—the essence of LLM decoding and the thesis behind every downstream optimization.
Continuous Batching
Iteration-level scheduling: add/evict requests every decode step so the GPU never idles.
KV Cache
Cached Key/Value of past tokens to avoid recompute; the main memory consumer at long context.
PagedAttention
Manages the KV cache like OS virtual-memory pages, eliminating fragmentation; vLLM's core.
Prefix Caching
Reuse the KV of an identical prefix across requests; the server-side principle of prompt caching.
Speculative Decoding
A draft model guesses, the target model verifies in parallel; lossless 2-3× speedup.
TTFT / TPOT
Time-To-First-Token / Time-Per-Output-Token, the two core interactive-latency metrics.
Weight-only Quantization
Quantize only the weights (e.g. AWQ int4), saving memory and speeding up memory-bound decode.
// DEEP THINKING
Continuous batching raises throughput and so does speculative decoding—why can't you just stack them?
Because both fight over the same resource: idle compute. Continuous batching fills that idle compute with a large batch, pushing the system toward compute-bound; speculative decoding needs spare compute to verify speculative tokens in parallel to pay off. Once the batch is large, there's nowhere to compute the verification tokens, so the draft overhead is pure loss. Rule of thumb: high-concurrency services rely on batching, low-concurrency / single-user interaction (e.g. a local coding agent) relies on speculative. Production systems often switch dynamically by current load rather than running both wide open.
Why does prompt caching save cost and latency but only for the prefix—not for identical fragments in the middle or end?
Because attention is causal: each token's KV representation depends on all tokens before it. Identical prefix → that KV's compute context is identical → directly reusable. But an identical fragment in the middle is preceded by different tokens, so its KV is "colored" by a different context and can't be reused. That's why engineering practice puts stable content (system / tool defs / few-shot) strictly first and volatile content last—this is a physical constraint on prefix-cache hit rate, not a convention.
Speculative decoding claims to be "lossless"—the output distribution is identical to the original model. How can that not trade quality for speed?
The key is the math of rejection sampling: draft proposal distribution q, target true distribution p. For each speculative token, accept with probability min(1, p/q); on rejection, resample from the corrected distribution (p−q)⁺. It can be proven the final sample follows p exactly. So the speedup comes from compressing several serial steps into one parallel forward, not from approximation. The better the draft (q closer to p), the more is accepted and the faster it runs—but even with a wildly wrong draft, output quality is unchanged, it just degrades to normal speed. Speed is the wager, quality is guaranteed.
If "decode is memory-bound," why can Mixture-of-Experts (MoE) models scale up without adding activated compute? Does that contradict memory-bound?
No contradiction—it reinforces it. MoE activates only a small subset of experts per token, so activated FLOPs don't grow linearly with total params—but all expert weights must still reside in memory, and the portion that is streamed still goes through HBM. So MoE pushes the bottleneck further toward "memory capacity + bandwidth + irregular access from expert routing." That's why the core challenge of MoE inference is memory and expert parallelism / scheduling, not compute. The memory-bound lens explains exactly where MoE's engineering weight sits.
If you could give a team only one inference-optimization tip, and their QPS is very low (internal tool, single-digit concurrency), what would you say?
"Don't touch batching—do speculative decoding + quantization first." At low QPS continuous batching has no requests to pack, so throughput optimization is nearly useless; what hurts users is per-request latency. Here the GPU's compute is largely idle—ideal soil for speculative decoding to cut TPOT directly. Layer on weight-only int4 quantization to both speed up the memory-bound decode and let smaller cards run it. Save the expensive multi-card batching strategy for truly high-concurrency production. Using the wrong lever wastes more money than not optimizing at all.