AI/ML Deep Dive: Inference Optimization

Day 9 · 2026-05-26
For: experienced engineers outside the AI/ML field

KV Cache

memoryfoundational
One-line analogy

KV cache is memoization for autoregressive generation — structurally identical to "don't re-compute solved subproblems" in dynamic programming. Backend analogues: keyed state in a stream processor (Flink's keyed state), materialized views in a database, the session state on an HTTP keep-alive connection. Without it, generating each token re-runs attention over every prior token — O(n²) explodes into O(n³).

What it solves + how it works

LLM inference is autoregressive: generating token #100 requires attention against tokens 1–99. The attention formula is softmax(Q·KT/√d)·V — where Q (Query) is the current token's "asking vector", K (Key) is each historical token's "label vector", and V (Value) is each historical token's "content vector". Intuition: Q is a query, K is an index, V is the indexed payload.

Key observation: when generating token #100, the K and V of tokens 1–99 do not change — each token's K/V depends only on what came before it. So K/V is computed once and reused forever; only the new token's Q/K/V is new work. Storing all historical K/V is the KV cache.

Two-phase inference (Prefill vs Decode)

Prefill processes the entire prompt at once (e.g., 1000 tokens), computing K/V in parallel and writing the cache
→ compute-bound: GPU matrix multiplies saturate

Decode emits 1 new token per step, reusing 1000 historical K/V entries from cache
→ memory-bandwidth-bound: moving KV is much slower than computing it

Memory cost: per token ≈ 2 × n_layers × n_heads × head_dim × 2 bytes
Llama-3 70B: ~320 KB per token, 10K context = 3.2 GB, batch=32 → 100 GB+

This is why LLM inference is bottlenecked by memory bandwidth, not compute — a counter-intuitive but critical fact. The other three optimizations on this page (speculative decoding, continuous batching, quantization) all exist to "squeeze more out of bandwidth".

Code example
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
m = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B", torch_dtype=torch.float16).cuda()

input_ids = tok("The capital of France is", return_tensors="pt").input_ids.cuda()
past_key_values = None  # ← KV cache, starts empty

for _ in range(20):
    with torch.no_grad():
        out = m(input_ids=input_ids, past_key_values=past_key_values, use_cache=True)
    next_id = out.logits[:, -1, :].argmax(-1, keepdim=True)
    past_key_values = out.past_key_values  # ← accumulate cache
    input_ids = next_id  # ← next round feeds only the "new token"; the rest is in cache
    if next_id.item() == tok.eos_token_id: break
# Disable use_cache and feel the difference — same 20 tokens, 10–50x slower
Common pitfall + practical scenario
"KV cache is enough — we're fast now" — wrong. In production the cache itself becomes the bottleneck: long contexts don't fit in VRAM, and common prefixes across requests get duplicated. That's why 2023 brought PagedAttention (vLLM) — borrowing OS virtual-memory paging to manage the KV cache — and Prefix Caching to share common prefix KV across requests. The entire body of "memory allocator" knowledge moved into LLM inference engines.
📌 BigCat scenario: when running local LLMs (LM Studio / Ollama), longer context = slower generation isn't your imagination — it's the KV cache being shuttled across memory bandwidth on every step. This is why "compress history then ask next question" beats "blindly grow context" — you're directly reducing KV traffic.
Takeaway + reflection
💡 LLM inference is a memory bandwidth problem, not a compute problem. Internalize this and every other optimization becomes obvious.
🤔 What backend systems have you worked on that looked CPU-tight but were actually I/O- or memory-bandwidth-tight? What do the optimization strategies share?

Speculative Decoding

latencyparallel
One-line analogy

The same trick as CPU branch prediction + speculative execution: let a cheap "small model" run ahead and propose several tokens, then have the "big model" verify them in parallel in a single pass. If the guess holds, you've banked free tokens; if it fails, you roll back. Backend analogues: optimistic concurrency (write first, conflict-check later) and predicate pushdown (use a cheap filter to drop most rows before the expensive operator).

What it solves + how it works

During decode, each new token requires shipping the entire 70B model's weights from VRAM to compute units — 140 GB (FP16), which costs ~50 ms even on an H100. The bottleneck is bandwidth, not compute, meaning GPU FLOPs sit largely idle. Speculative decoding's insight: since the bandwidth cost of one forward pass is fixed, verifying 5 candidate tokens at once is nearly free — parallelism for almost no extra cost.

Three steps:

  • ① Draft: a 1B-class small model sequentially generates K tokens (typically K=4-8). It's 10-20x faster, so K serial steps are cheap.
  • ② Verify: feed all K tokens to the big model in one forward pass, which computes — for every position — what the big model itself would have predicted.
  • ③ Accept / Reject: walk left-to-right; accept consecutive matches; at the first mismatch, replace with the big model's prediction and discard the rest of the draft.
One speculative-decoding round

Draft proposes 5 candidates: Thecatsatonmat
Big model verifies in one forward pass: Thecatsatamatdiscarded
Result: 1 big-model call yields 4 tokens (vs 4 calls in vanilla decode)

Key invariant: output distribution is identical to the original big model (provable) — lossless speedup, not approximation

Math: with acceptance rate α and K draft tokens, expected yield per round is (1-αK+1)/(1-α) tokens. At α=0.7, K=5, that's ~2.9 tokens per big-model call — a theoretical 2.9x speedup. In practice 1.5-3x, with the remainder going to draft-model overhead. Medusa / EAGLE replace the standalone draft with lightweight extra heads on the big model itself, eliminating draft overhead and pushing speedup to 3-4x.

Code example
# HuggingFace transformers' built-in assisted_generation is speculative decoding
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

big   = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-70B", torch_dtype=torch.float16, device_map="auto")
draft = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B",  torch_dtype=torch.float16, device_map="auto")
tok   = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-70B")

prompt = tok("Write a haiku about caching:", return_tensors="pt").to("cuda")

# assistant_model = the draft — big and draft must share family (same tokenizer + vocab)
out = big.generate(
    **prompt,
    assistant_model=draft,       # ← the key flag
    max_new_tokens=100,
    do_sample=False,
)
print(tok.decode(out[0]))
# Output is identical to no-assistant; speed 1.5-3x.
# vLLM / TensorRT-LLM / SGLang have production-grade impls (PagedAttn + speculative fused)
Common pitfall + practical scenario
"Larger K must mean more speed" — wrong. Too-large K causes the draft to drift from the big model, acceptance rate α collapses, and you waste work. Hidden gotcha: speculative gains drop sharply at batch size > 1, because the big model already saturates bandwidth via batching and parallel verification has diminishing returns. Production rule: low batch + long output (interactive chat) benefits most; high batch + short output (bulk classification) gets almost nothing — turn it off.
📌 BigCat scenario: when running Ollama locally, also download a same-family 1B model to use as a draft for your 70B primary — easy 2x speedup. Both "time-to-first-token" and "tokens per second" improve visibly. It's the lowest-effort free lunch in your personal AI super-individual workflow.
Takeaway + reflection
💡 Speculative decoding is lossless acceleration — output is bit-for-bit identical to the original model; you simply traded idle compute for lower latency.
🤔 "Cheap approximation runs ahead, expensive verifier checks" — where else in your work or decision-making does this pattern fit?

Continuous Batching

throughputscheduling
One-line analogy

The paradigm jump from a blocking thread pool to an async event loop — same kernel idea as Node.js / Netty / asyncio. Static batching has a weakest-link problem: the whole batch waits for the slowest request to finish before the next batch can start. Continuous batching is the OS scheduler: whoever finishes first yields its slot, and a new request fills in immediately. The vLLM 2023 paper made this mainstream, lifting throughput 5-23x.

What it solves + how it works

LLM requests have huge variance in output length: one needs 10 tokens, another needs 2000. The old approach was request-level batching: gather 8 requests, run them together, but because the batch advances in lockstep, everyone waits for the 2000-token one. GPU spins idle, throughput tanks.

Continuous batching (a.k.a. iteration-level scheduling) drops scheduling granularity from "whole request" down to "every decode step":

  • After every token the scheduler re-evaluates the batch;
  • Finished requests exit immediately, freeing their KV-cache memory;
  • Queued new requests fill the empty slots and begin prefill;
  • Prefill and decode phases can coexist in the same batch (chunked prefill further breaks long prefills into pieces that interleave with decodes).
Static batching (old)
T1: R1R2R3R4 parallel decode
T2: R1R2R3R4 R2 done but still occupying slot
T3: R1idleR3R4 GPU wastes cycles ↑
T4: R1idleidleR4 more waste

Continuous batching (vLLM)
T1: R1R2R3R4 normal
T2: R1R5R3R4 R2 done → R5 fills in
T3: R1R5R6R4 R3 done → R6 fills in
GPU saturated every step; 5-23x throughput

The implementation challenge is KV cache memory management: the naive approach pre-allocates max-length KV memory per request, wasting 60-80%. vLLM's PagedAttention slices the KV into fixed 4KB "pages" allocated on demand — directly borrowing OS virtual memory paging — which made continuous batching actually usable. It's the most important LLM-serving systems advance of 2023; today vLLM / SGLang / TensorRT-LLM / TGI all use the same pattern.

Code example
# vLLM has continuous batching on by default — just submit requests
from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct",
          gpu_memory_utilization=0.9,
          max_num_seqs=256)  # ← max concurrent requests; scheduler does in-flight scheduling

prompts = [
    "Explain caching in one paragraph.",   # short output
    "Write a 1000-word essay on consciousness.",  # long output
    "What is 2+2?",                          # very short
    # ... dozens or hundreds of mixed-length requests at once
]
params = SamplingParams(max_tokens=1024, temperature=0.7)

outputs = llm.generate(prompts, params)
# Short requests return in 1-2s without waiting for long ones; long ones continue.
# Same hardware vs raw HuggingFace transformers: typically 10x+ throughput.
# This is why every production LLM serving stack uses vLLM/SGLang/TensorRT-LLM, not raw transformers.
Common pitfall + practical scenario
"Continuous batching makes a single request faster" — wrong. It does not reduce single-request latency and may slightly increase it due to shared-GPU contention. It optimizes throughput (system tokens-per-second) — your cost when serving many users, not your felt speed as a single caller. The reason OpenAI / Anthropic can price APIs at $0.5-3 per 1M tokens is exactly this scheduling layer driving GPU utilization from 30% to 80%+.
📌 BigCat scenario: when building a personal AI tool for one user, skip vLLM — HuggingFace transformers + KV cache is enough. But the moment you build a shared AI assistant for family or friends (even 5-10 people), switching to vLLM lets the same hardware serve 10x more users with a cliff drop in unit cost. The dividing line: are there concurrent requests?
Takeaway + reflection
💡 Continuous batching is a scheduling problem, not a model problem — same weights, different serving framework, 10x throughput gap.
🤔 In systems you've built, when "scheduling granularity went from coarse to fine" (threads → coroutines, batch → streaming), what order-of-magnitude wins followed? What's the invariant pattern?

Quantization

compressionprecision
One-line analogy

Quantization is JPEG compression for neural networks — deliberately accept controlled lossy precision degradation in exchange for exponential gains in memory and speed. Backend analogues: fixed-point instead of float (classic embedded systems trick), columnar compression (Parquet's dictionary encoding squeezing strings into ints), protobuf varint (small numbers in fewer bytes). All bet on the same thing: "data's distribution structure lets us recover it with fewer bits."

What it solves + how it works

Llama-3 70B in FP16 takes 140 GB of VRAM — a single H100 (80 GB) can't hold it. And H100s cost ~$30K. Quantized to INT4, it shrinks to 35 GB — a consumer RTX 4090 (24 GB) with CPU offload can run it, a 100x cost gap. This is why quantization is the entry ticket to local / edge LLM deployment.

Core formula: x_int = round((x_float - zero) / scale), dequantized via x_float ≈ x_int * scale + zero. Intuition: linearly map floats to a small integer range — scale sets "step size", zero sets "where zero lands". Compressing FP16's [-65504, 65504] to INT8's [-128, 127] loses precision, but model weights follow a bell curve concentrated near zero, so the loss is far smaller than you'd expect.

Mainstream options (precision vs memory)

FP16 / BF16 16 bit · 100% memory · zero loss · baseline
FP8 8 bit · 50% · near-zero loss · H100/B200 hardware support
INT8 (W8A8) 8 bit · 50% · <1% accuracy loss · SmoothQuant/LLM.int8()
INT4 (GPTQ/AWQ) 4 bit · 25% · 1-3% loss · standard for local deployment
INT2 / GGUF Q2 2-3 bit · 12% · significant loss · extreme compression

Three mainstream algorithms, engineering trade-offs:
GPTQ: post-training, layer-by-layer error minimization — accurate but slow (hours)
AWQ: keeps the top 1% "important weights" un-quantized (ranked by activation magnitude); balanced speed and accuracy
GGUF (llama.cpp): multiple Q2-Q8 levels, CPU/Mac friendly, the local hobbyist default
QAT: simulates quantization noise during training, most accurate but requires retraining (expensive)

Weight-only vs weight+activation (W·A) are two distinct paths: weight-only saves VRAM and accelerates memory transfer (which is exactly the inference bottleneck), is simple to implement; activation quantization unlocks INT8 tensor cores for real compute speedup but is harder to keep accurate. Community consensus: local inference uses weight-only INT4 (AWQ/GPTQ); cloud high-throughput uses W8A8 / FP8.

Code example
# Load a 4-bit quantized model directly with bitsandbytes — one config block
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

qcfg = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",        # normal-float 4 (QLoRA paper), optimal for normal distributions
    bnb_4bit_use_double_quant=True,    # quantize the quantization constants too — small extra win
)
m = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config=qcfg,
    device_map="auto",
)
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

ids = tok("Define entropy in one sentence.", return_tensors="pt").to("cuda")
print(tok.decode(m.generate(**ids, max_new_tokens=100)[0]))
# 8B model drops from 16GB to 5GB — runs on M2 Mac / RTX 3060; quality loss usually <2%
Common pitfall + practical scenario
"INT4 only loses 1-3% — essentially free" — depends. On multiple-choice and summarization, you really won't notice; but on math reasoning, code generation, long-chain CoT, and rare languages, quantization errors compound — a quantized 8B can drift far from the original and feel "noticeably dumber". Evaluation discipline: compare quantized vs full on your real task, don't trust public benchmarks alone. Karpathy keeps reminding people: a 1% benchmark loss can become a 30% task-failure rate in long-chain agent workflows.
📌 BigCat scenario: run a 4-bit-quantized 70B model on a MacBook M3 Max (48GB unified memory) for offline, private conversations — paper reading, journaling, family-decision reviews — privacy guarantees far above any cloud API. This is a critical piece of the "AI super-individual" stack: strongest cloud model for open tasks, local quantized model for privacy-sensitive tasks, routed by data sensitivity.
Takeaway + reflection
💡 Quantization is lossy compression — you must compare quantized-vs-full on your real task; benchmark numbers alone aren't enough.
🤔 "Express the same meaning with fewer bits" — where else does this show up in your work? Are information theory's "effective entropy" and engineering's "representation precision" the same thing?

Further Reading

Deep Questions

1. Why is "LLM inference is a memory-bandwidth problem, not a compute problem"? How does this reframe GPU selection / cost modeling / optimization priorities?
Core numbers: H100 delivers ~989 TFLOPS (FP16) but only ~3 TB/s memory bandwidth. To decode one token from Llama-70B, 70B params × 2 bytes = 140 GB must move from VRAM to compute units — theoretical floor 140/3000 ≈ 47 ms, regardless of compute. That is "memory-bound" made concrete. Reframings: (a) buy HBM bandwidth, not TFLOPS — H100 vs H200 have identical compute but H200's 141GB / 4.8 TB/s makes LLM inference 40-60% faster; (b) batching is a free lunch — one parameter shuffle serves many requests, throughput grows linearly until compute saturates (this is the physical basis of continuous batching); (c) quantization's real value is shrinking transfers, not shrinking compute — INT4 cuts the move from 140GB to 35GB, instantly 4x; (d) speculative decoding exploits idle compute — since bandwidth is the limit, verifying extra tokens is almost free. BigCat, your distributed-systems habit of diagnosing "I/O bound vs CPU bound" transfers directly: watch nvidia-smi, if GPU util sits at 30-50% instead of 90%+, it's almost certainly bandwidth. Adding batch / speculative / quantization beats buying bigger cards.
2. KV cache (Day 9) and prompt caching (Day 8) both sound like "caching" — but they solve completely different problems. Articulate the difference, then how would you compose them?
Easy to conflate. KV cache is an optimization within a single request — generating token N reuses K/V already computed for tokens 1..N-1, discarded when the request ends. It lives in the inference engine, is transparent to the application, and without it inference is O(n³) and unusable. Prompt caching is an optimization across requests — KV for the common prefix (system prompt + few-shot + tool schemas) is persisted to VRAM or SSD; next request hits and skips prefill. It requires the application to explicitly mark cache breakpoints. Composition: (a) put stable prefix content first so prompt caching can hit; (b) inside the request, the engine automatically uses KV cache; (c) on vLLM / SGLang there's a third layer — radix-tree prefix caching that automatically shares KV across concurrent requests with the same prefix, finer-grained than explicit prompt caching. Three layers compose: cross-request prefix cache + intra-request KV cache + explicit prompt caching = the complete "cache stack" of LLM serving. The hierarchy maps almost one-to-one to a database's buffer pool / query-plan cache / materialized view.
3. Speculative decoding uses "cheap proposer + expensive verifier" — can this pattern extend into agent workflows?
Yes, and it already is. Pattern: use the big model as verifier, small model / tools / heuristics as proposers. Examples: (a) parallel reasoning paths — let an 8B model brainstorm 5 approaches; have a 70B model only expand the 1-2 most promising; 5-10x cheaper than pure 70B thinking; (b) tool-call pre-screening — small model decides which tool and rough args, big model intervenes only when it's uncertain; (c) code-agent lint-then-think — run type checker / linter for cheap signals, eliminate obvious errors, save the big model for genuinely hard cases; (d) RAG reranking — cheap BM25/embedding recall to 1000 → mid-tier model reranks to 20 → big model reads top-5. Common pattern: use cheap procedures to prune the obviously-wrong, so expensive resources only handle high-value uncertainty. This is the universal answer when "judgement is expensive and candidates explode" — query optimizers, CDN tiered hits, even human "pipeline review + senior judgment" follow the same template. BigCat, for your "AI super-individual" workflow this is critical: not every question needs Opus — pre-pass with Haiku, escalate to Opus only when Haiku flags "uncertain" or "high stakes". Same output quality, 5-10x cheaper.
4. Quantization is lossy compression but LLM quantization losses are much smaller than expected. What deep structural reasons? Any parallels to neuroscience or information theory?
A beautiful phenomenon with cross-disciplinary echoes. Surface causes: (a) network weights are bell-curve distributed, concentrated near zero — high-precision tails carry almost no information; (b) networks have massive redundancy; combinations of weights compensate for single-weight errors (like redundancy in error-correcting codes); (c) inputs are similarly centered, so quantization noise is "averaged out" by downstream nonlinearities. Deeper parallels: (i) neuroscience — biological neuron spikes are essentially 1-bit discrete signals; the brain does high-level cognition at extremely low precision — proving intelligence doesn't require high-precision continuous representation; (ii) information theory — Shannon's rate-distortion theorem already says: for structured data, effective information is far smaller than raw bit count; quantization is just approaching the true "effective entropy"; (iii) sparse coding theory (Olshausen-Field 1996) — sensory cortex uses sparse low-precision codes for natural images, more efficient than dense representations; (iv) Buddhist/cognitive "good enough" — perception and decision never pursue precision but sufficient discriminative power; that's exactly what quantization does. The flip-side lesson for individuals: "high precision" may not be intelligence's essence — "right granularity for the task" is. BigCat, when you do cross-disciplinary analogies you'll notice this pattern recurs everywhere — brains, compression, organizations, personal decisions — all solving the same problem: maintain effective discrimination under constrained resources.
5. Inference optimization makes "small model + optimization" cross over "big model + defaults" on cost curves — what does this mean for an individual's AI strategy?
Key fact: today a 4-bit Llama-3.1-70B on an M3 Max ≈ last year's GPT-4 capability, zero marginal cost, fully local, offline-capable. A 14B-32B small model with speculative + vLLM costs 5-20x less per token than calling the API. This means capability curves and cost curves are decoupling: it used to be "stronger capability = more money", now it's "stronger capability = pick the right model + apply the right optimization". Individual strategy implications: (a) route by sensitivity — privacy-related work (family, health, finances, private thoughts) goes to local quantized models; open tasks go to the strongest cloud model; (b) route by latency need — interactive typing uses a speculative-optimized small model; batch tasks use cloud high-throughput; (c) route by value — long-tail low-stakes queries (search, rewriting, translation) hit local / small models; only high-stakes decisions (major judgments, long reasoning chains) burn Opus; (d) think "AI toolbox" not "AI single-tool" — pick models like you pick languages, fit-for-purpose. BigCat, the essence of your "AI super-individual" pursuit is designing your own LLM inference stack — a local + multi-cloud, sensitivity-and-value-tiered routing system. The architectural thinking is already deeply familiar from distributed systems; moving it into personal AI workflow is a natural extension. Within a year, "personal AI orchestration architect" will be a competitive but invisible skill.