One forward pass of the model produces not a word, but an entire probability distribution over tens of thousands of vocabulary tokens. Picking the next word out of that distribution is "decoding." Same model, same prompt — swap the decoding method and the output can go from "dull and repetitive" to "wildly creative," even from "factually accurate" to "confidently wrong." Today we make the math of this final step precise: deterministic search, the distribution-reshaping of random sampling, and two elegant contrast-based mechanisms.
Greedy is a greedy algorithm — each step just takes the highest-probability token, like making change by always grabbing the largest coin: simple, but prone to getting stuck in a local optimum. Beam search is the query-optimizer's approach: it won't bet on a single execution plan, so it maintains a bounded priority queue of k candidates, expanding all k paths each step, pruning by cumulative score, keeping only the top-k. It spends k× compute to buy a little global foresight.
Generating a whole sentence is, at heart, finding the token sequence with the highest joint probability. In theory you'd enumerate all sequences — a 50K vocabulary and a 20-word sentence is 50K²⁰ possibilities, utterly infeasible. Greedy is the crudest approximation: take the argmax word by word. The catch is that local optimum ≠ global optimum — the highest-probability word now may corner you into a dead end later.
Beam search softens this: keep the k best prefixes, expand each against the full vocabulary, then take the global top-k by sum of log probabilities.
Summing logs instead of multiplying probabilities avoids underflow to 0 when many small decimals multiply (the same trick you use in backends for long-chain probabilities). Beam width k=1 degenerates to greedy.
The core counter-intuition (revealed by Holtzman et al. 2019): for open-ended generation (writing, dialogue), chasing the "highest-probability sequence" produces dull, eerily repetitive text. Real human language is full of "lower-probability but more informative" choices; maximizing probability makes the model converge on safe, "the the the"-style filler. This is why chat models almost never use pure beam search — it suits translation and summarization, where the answer is nearly unique, not creation.
from transformers import AutoModelForCausalLM, AutoTokenizer tok = AutoTokenizer.from_pretrained("gpt2") model = AutoModelForCausalLM.from_pretrained("gpt2") ids = tok("The future of AI is", return_tensors="pt").input_ids # Greedy: argmax each step, fully deterministic greedy = model.generate(ids, max_new_tokens=30, do_sample=False) # Beam: keep 5 paths, pick the whole sentence with highest cumulative logP beam = model.generate(ids, max_new_tokens=30, num_beams=5, early_stopping=True) print(tok.decode(beam[0], skip_special_tokens=True))
If greedy is "always read the hottest cached row," sampling is "draw a row at weighted random by heat." Temperature is the master valve on that randomness — like "sharpening / blurring" the softmax distribution; while top-k / top-p / min-p are three truncation strategies that chop off the long tail of junk candidates first, so an absurd low-probability token doesn't get drawn.
Temperature (T) divides each logit by T inside the softmax:
zᵢ is the raw logit of token i. Intuition: T < 1 (say 0.3) amplifies the gaps between logits → sharper distribution → high-probability words dominate more (approaching greedy); T > 1 (say 1.5) flattens the gaps → more uniform → rare words get a chance; T → 0 is argmax, T → ∞ is uniform random. Key point: temperature doesn't change what the model "knows," it only reshapes the sharpness of the distribution it already computed.
Temperature alone isn't enough: at high T the absurd tail words also get their probability lifted. The three truncations chop the tail first, then re-normalize and sample among the survivors:
# HuggingFace: the three truncations stack; sampling runs among survivors out = model.generate( ids, max_new_tokens=40, do_sample=True, # turn on random sampling (else all params below ignored) temperature=0.8, # sharpness: <1 conservative, >1 divergent top_k=50, # sample only within the top 50 by probability top_p=0.9, # and within the nucleus of cumulative prob ≥0.9 (dynamic count) min_p=0.05, # and keep only tokens with prob ≥ 0.05×top prob ) print(tok.decode(out[0], skip_special_tokens=True)) # You rarely max out all three; the key is grasping "dynamic vs fixed"
Like taking a diff / controlled experiment: use an "amateur (weak, small) model" as the control group, and subtract it from the "expert (strong, large) model's" log probabilities. The cheap bad habits both models share — repetition, boilerplate syntax, mindless high-frequency words — cancel out in the subtraction, leaving only the knowledge and judgment unique to the strong model. It's not two models voting by sum, it's using the weak model to do subtraction.
Following concept 1's dilemma: greedy is too dull, pure sampling can drift. Contrastive decoding (Li et al. 2022) takes a different angle — what do bad texts have in common? Answer: the errors both large and small models make (repetition, generic filler). A GPT-2 small and a GPT-2 XL both assign "the" high probability; only the large model can produce a truly apt rare word. So the scoring function becomes their difference:
λ controls the strength of the subtraction. Intuition: a token both expert and amateur love (like "the") has its advantage wiped out after subtraction; a token the expert loves but the amateur doesn't (a word the expert chose from knowledge) stands out. But pure subtraction has a pitfall: a reasonable word the amateur strongly dislikes gets over-rewarded, producing gibberish. So add a plausibility constraint: run the contrast only among candidates where the expert's probability is high enough (≥ α × top probability, same idea as min-p) — first mark out the "expert-approved safe zone," then rank within it using the amateur.
The effect: the coherence of greedy plus the richness of sampling, with less repetitive degeneration. Follow-up work (Contrastive Decoding Improves Reasoning, 2023) even found it boosts reasoning accuracy — because "the amateur model's wrong intuitions" are exactly what you want subtracted away.
import torch, torch.nn.functional as F expert = AutoModelForCausalLM.from_pretrained("gpt2-large") amateur = AutoModelForCausalLM.from_pretrained("gpt2") # small = amateur lp_e = F.log_softmax(expert(ids).logits[:, -1], dim=-1) lp_a = F.log_softmax(amateur(ids).logits[:, -1], dim=-1) # plausibility: keep candidates with expert prob ≥ α×top, else -inf alpha, lam = 0.1, 1.0 mask = lp_e < (lp_e.max() + torch.log(torch.tensor(alpha))) score = lp_e - lam * lp_a # core: expert − λ×amateur score[mask] = -float("inf") # drop words even the expert rejects next_id = score.argmax(-1) # take the contrast-best within the safe zone
This is the CPU's speculative execution + branch prediction, or the database's optimistic concurrency control (OCC), applied to decoding: let a cheap small model optimistically guess a string of tokens, then have the expensive large model verify that guess in one parallel pass. Correct guesses are accepted (saving the time of generating them one by one), wrong ones roll back from there. The crucial part — the final output is mathematically identical to what the large model would have generated token by token. Lossless acceleration.
The pain point of autoregressive generation: it's serial, token by token. Generating 100 words means 100 large-model forward passes, each dragging the entire giant model through memory — GPU compute sits idle, the bottleneck is memory bandwidth, not computation. Leviathan et al.'s (2022) insight: many tokens are "easy" ("the," "of," "." and such) — a small model gets them right too, so why invoke the large one?
The flow: the draft model q autoregressively generates γ tokens; the large model p forwards those γ tokens in one parallel pass (computing the distributions at all γ positions at once — this is key: verification is parallel, only generation is serial). Then, token by token, rejection sampling decides acceptance:
Intuition: if the small model's confidence in x doesn't exceed the large model's (q ≤ p), accept unconditionally; if the small model is over-confident (q > p), accept with probability p/q. Once a token is rejected, discard all drafts after it and resample one from the corrected distribution norm(max(0, p − q)) — it can be proven that this accept/resample rule makes the final distribution strictly equal to the large model's p. So it sacrifices no quality, only compressing the cost of "a string of easy tokens" from "γ large-model passes" to "1 large-model pass + γ small-model passes."
The closer the draft model is to the large one, the higher the acceptance rate and the greater the speedup; the paper reports roughly 2–3× on T5-XXL with byte-for-byte identical output. DeepMind's concurrent speculative sampling (Chen et al. 2023) gives an equivalent independent derivation.
# HuggingFace built-in: pass a small assistant_model to enable speculative decoding big = AutoModelForCausalLM.from_pretrained("gpt2-xl") # target (slow/accurate) draft = AutoModelForCausalLM.from_pretrained("gpt2") # draft (fast) out = big.generate( ids, max_new_tokens=60, assistant_model=draft, # ← small guesses, large verifies; distribution unchanged do_sample=True, temperature=0.7, ) print(tok.decode(out[0], skip_special_tokens=True)) # result matches the distribution without assistant_model — just faster