AI/ML Explained: The Math of Decoding & Sampling

Day 45 · 2026-07-01

For: engineers with coding experience, not from an AI background

One forward pass of the model produces not a word, but an entire probability distribution over tens of thousands of vocabulary tokens. Picking the next word out of that distribution is "decoding." Same model, same prompt — swap the decoding method and the output can go from "dull and repetitive" to "wildly creative," even from "factually accurate" to "confidently wrong." Today we make the math of this final step precise: deterministic search, the distribution-reshaping of random sampling, and two elegant contrast-based mechanisms.

Greedy & Beam SearchGreedy & Beam Search

Deterministic DecodingSearch

One-line analogy

Greedy is a greedy algorithm — each step just takes the highest-probability token, like making change by always grabbing the largest coin: simple, but prone to getting stuck in a local optimum. Beam search is the query-optimizer's approach: it won't bet on a single execution plan, so it maintains a bounded priority queue of k candidates, expanding all k paths each step, pruning by cumulative score, keeping only the top-k. It spends k× compute to buy a little global foresight.

The problem it solves + how it works

Generating a whole sentence is, at heart, finding the token sequence with the highest joint probability. In theory you'd enumerate all sequences — a 50K vocabulary and a 20-word sentence is 50K²⁰ possibilities, utterly infeasible. Greedy is the crudest approximation: take the argmax word by word. The catch is that local optimum ≠ global optimum — the highest-probability word now may corner you into a dead end later.

Beam search softens this: keep the k best prefixes, expand each against the full vocabulary, then take the global top-k by sum of log probabilities.

score(sequence) = Σᵢ log P(tokenᵢ | context)

Summing logs instead of multiplying probabilities avoids underflow to 0 when many small decimals multiply (the same trick you use in backends for long-chain probabilities). Beam width k=1 degenerates to greedy.

Beam search, width k=2 (keep only 2 best paths per step)

start → The
├→ cat logP=-0.9 ✓keep
├→ dog logP=-1.2 ✓keep
└→ sky logP=-3.1 ✗prune
↑ each layer ranks 2×vocab candidates, carries forward only the 2 highest cumulative logP

The core counter-intuition (revealed by Holtzman et al. 2019): for open-ended generation (writing, dialogue), chasing the "highest-probability sequence" produces dull, eerily repetitive text. Real human language is full of "lower-probability but more informative" choices; maximizing probability makes the model converge on safe, "the the the"-style filler. This is why chat models almost never use pure beam search — it suits translation and summarization, where the answer is nearly unique, not creation.

Code example

from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
ids = tok("The future of AI is", return_tensors="pt").input_ids

# Greedy: argmax each step, fully deterministic
greedy = model.generate(ids, max_new_tokens=30, do_sample=False)

# Beam: keep 5 paths, pick the whole sentence with highest cumulative logP
beam = model.generate(ids, max_new_tokens=30, num_beams=5,
                      early_stopping=True)
print(tok.decode(beam[0], skip_special_tokens=True))

Common misconception + practical scenario

❌ Misconception: "the highest-probability sequence = the best text." This is exactly the intuition Holtzman's paper overturned — maximizing likelihood slides into repetitive degeneration. "Good text" doesn't live at the peak of the distribution, but in a slightly lower, wider region. That's the whole reason the next three concepts exist.

📌 Super-individual scenario: when you need reproducible, auditable output (extracting structured fields from a filing, running the same prompt across versions), use deterministic decoding (temperature=0, which is essentially greedy); when you want it to write or diverge, switch to sampling. First decide whether the task is "search" or "creation."

Takeaway + question

💡 Decoding is a search problem, but "the optimum" is a trap for open-ended generation — high probability ≠ high quality.
🤔 Which of your AI tasks truly need "a single correct answer," and which actually need "diverse but reasonable"? That distinction decides search vs. sampling.

Temperature & Truncation SamplingTemperature · Top-k · Top-p · Min-p

Random SamplingDistribution Reshaping

One-line analogy

If greedy is "always read the hottest cached row," sampling is "draw a row at weighted random by heat." Temperature is the master valve on that randomness — like "sharpening / blurring" the softmax distribution; while top-k / top-p / min-p are three truncation strategies that chop off the long tail of junk candidates first, so an absurd low-probability token doesn't get drawn.

The problem it solves + how it works

Temperature (T) divides each logit by T inside the softmax:

Pᵢ = exp(zᵢ / T) / Σⱼ exp(zⱼ / T)

zᵢ is the raw logit of token i. Intuition: T < 1 (say 0.3) amplifies the gaps between logits → sharper distribution → high-probability words dominate more (approaching greedy); T > 1 (say 1.5) flattens the gaps → more uniform → rare words get a chance; T → 0 is argmax, T → ∞ is uniform random. Key point: temperature doesn't change what the model "knows," it only reshapes the sharpness of the distribution it already computed.

Same logits, distribution under different temperatures

T=0.5 (sharp/conservative)

T=1.0 (raw)

T=1.6 (flat/divergent)

↑ higher temperature = flatter distribution = rarer words more likely to be drawn

Temperature alone isn't enough: at high T the absurd tail words also get their probability lifted. The three truncations chop the tail first, then re-normalize and sample among the survivors:

Top-k — keep only the k highest-probability tokens. Downside: k is fixed — it still forces in k candidates when the distribution is sharp, and prunes reasonable options when it's flat;
Top-p / Nucleus sampling (Holtzman 2019) — keep the smallest set whose cumulative probability ≥ p. It dynamically adjusts candidate count: small set when the model is confident, large when it's uncertain. Like "keep the hot rows covering 90% of cumulative heat";
Min-p (Nguyen 2024) — threshold = top probability × min_p, chop everything below it. When the model is confident (top prob 0.9) it truncates hard, keeping just the best; when uncertain (top prob 0.2) it relaxes, keeping diversity. More stable than top-p at high temperature.

Code example

# HuggingFace: the three truncations stack; sampling runs among survivors
out = model.generate(
    ids, max_new_tokens=40,
    do_sample=True,       # turn on random sampling (else all params below ignored)
    temperature=0.8,      # sharpness: <1 conservative, >1 divergent
    top_k=50,             # sample only within the top 50 by probability
    top_p=0.9,            # and within the nucleus of cumulative prob ≥0.9 (dynamic count)
    min_p=0.05,           # and keep only tokens with prob ≥ 0.05×top prob
)
print(tok.decode(out[0], skip_special_tokens=True))
# You rarely max out all three; the key is grasping "dynamic vs fixed"

Common misconception + practical scenario

❌ Misconception: "temperature is the model's 'creativity knob' — turn it up to get smarter." Wrong. Temperature injects no new information; it merely redistributes the probability mass the model already computed. High temperature isn't "more creative," it's "more willing to bet on low-probability tokens" — bet right, we call it creativity; bet wrong, we call it hallucination. Creativity comes from the model's knowledge, not from this division.

📌 Super-individual scenario: understand "what physical quantity you're actually tuning" — fact-checking / code generation with low temperature (concentrate the distribution on what it's most sure of), cross-disciplinary brainstorming with high temperature + top-p (widen the nucleus, let the model jump to adjacent concepts). You're not tuning "smartness," you're choosing "how far it's willing to stray from the peak."

Takeaway + question

💡 Sampling = temperature reshapes the distribution + truncation chops the tail. The core divide is "fixed candidate count (top-k)" vs. "scaling with the model's confidence (top-p / min-p)."
🤔 If the model is extremely confident about the next word (one word at 0.99), does high temperature still matter? (Hint: think about why min-p lets almost nothing else through here.)

Contrastive DecodingContrastive Decoding

Contrast MechanismQuality

One-line analogy

Like taking a diff / controlled experiment: use an "amateur (weak, small) model" as the control group, and subtract it from the "expert (strong, large) model's" log probabilities. The cheap bad habits both models share — repetition, boilerplate syntax, mindless high-frequency words — cancel out in the subtraction, leaving only the knowledge and judgment unique to the strong model. It's not two models voting by sum, it's using the weak model to do subtraction.

The problem it solves + how it works

Following concept 1's dilemma: greedy is too dull, pure sampling can drift. Contrastive decoding (Li et al. 2022) takes a different angle — what do bad texts have in common? Answer: the errors both large and small models make (repetition, generic filler). A GPT-2 small and a GPT-2 XL both assign "the" high probability; only the large model can produce a truly apt rare word. So the scoring function becomes their difference:

score = log P_expert(token) − λ · log P_amateur(token)

λ controls the strength of the subtraction. Intuition: a token both expert and amateur love (like "the") has its advantage wiped out after subtraction; a token the expert loves but the amateur doesn't (a word the expert chose from knowledge) stands out. But pure subtraction has a pitfall: a reasonable word the amateur strongly dislikes gets over-rewarded, producing gibberish. So add a plausibility constraint: run the contrast only among candidates where the expert's probability is high enough (≥ α × top probability, same idea as min-p) — first mark out the "expert-approved safe zone," then rank within it using the amateur.

Contrastive decoding: "expert − amateur" surfaces the real knowledge signal

candidate "the": expert 0.30 · amateur 0.28 → diff ≈0 (shared cliché, suppressed)
candidate "photosynthesis": expert 0.20 · amateur 0.02 → diff large (expert-only, rewarded)
↑ first the plausibility constraint circles expert-approved candidates, then rank by diff

The effect: the coherence of greedy plus the richness of sampling, with less repetitive degeneration. Follow-up work (Contrastive Decoding Improves Reasoning, 2023) even found it boosts reasoning accuracy — because "the amateur model's wrong intuitions" are exactly what you want subtracted away.

Code example

import torch, torch.nn.functional as F
expert  = AutoModelForCausalLM.from_pretrained("gpt2-large")
amateur = AutoModelForCausalLM.from_pretrained("gpt2")  # small = amateur

lp_e = F.log_softmax(expert(ids).logits[:, -1], dim=-1)
lp_a = F.log_softmax(amateur(ids).logits[:, -1], dim=-1)

# plausibility: keep candidates with expert prob ≥ α×top, else -inf
alpha, lam = 0.1, 1.0
mask = lp_e < (lp_e.max() + torch.log(torch.tensor(alpha)))
score = lp_e - lam * lp_a          # core: expert − λ×amateur
score[mask] = -float("inf")      # drop words even the expert rejects
next_id = score.argmax(-1)          # take the contrast-best within the safe zone

Common misconception + practical scenario

❌ Misconception: "contrastive decoding is just an ensemble — let two models vote (sum/average)." Quite the opposite — it's subtraction. An ensemble is "three heads are better than one"; contrast is "use the amateur as a cautionary example": the amateur's value isn't that it's right, it's that it represents "the mistakes you can make without any real knowledge." Subtract that, and what remains is genuine skill.

📌 Super-individual scenario: the "subtract with a weak baseline" idea transfers to human-AI collaboration — when evaluating an AI proposal, first ask "what answer could a clueless person give just by following the playbook," mentally "subtract" that from the model's output, and what's left is the real information gain it contributed.

Takeaway + question

💡 Contrastive decoding: bad habits are common to both large and small models; subtracting with a weak model filters out the commonality and keeps the strong model's unique knowledge.
🤔 "Shared errors" is a deep concept — can you think of other fields where "a weak baseline cancels a systematic bias"? (Hint: control groups in causal inference, hedging in finance.)

Speculative DecodingSpeculative Decoding

AccelerationLossless

One-line analogy

This is the CPU's speculative execution + branch prediction, or the database's optimistic concurrency control (OCC), applied to decoding: let a cheap small model optimistically guess a string of tokens, then have the expensive large model verify that guess in one parallel pass. Correct guesses are accepted (saving the time of generating them one by one), wrong ones roll back from there. The crucial part — the final output is mathematically identical to what the large model would have generated token by token. Lossless acceleration.

The problem it solves + how it works

The pain point of autoregressive generation: it's serial, token by token. Generating 100 words means 100 large-model forward passes, each dragging the entire giant model through memory — GPU compute sits idle, the bottleneck is memory bandwidth, not computation. Leviathan et al.'s (2022) insight: many tokens are "easy" ("the," "of," "." and such) — a small model gets them right too, so why invoke the large one?

The flow: the draft model q autoregressively generates γ tokens; the large model p forwards those γ tokens in one parallel pass (computing the distributions at all γ positions at once — this is key: verification is parallel, only generation is serial). Then, token by token, rejection sampling decides acceptance:

P(accept token x) = min(1, P_large(x) / q_small(x))

Intuition: if the small model's confidence in x doesn't exceed the large model's (q ≤ p), accept unconditionally; if the small model is over-confident (q > p), accept with probability p/q. Once a token is rejected, discard all drafts after it and resample one from the corrected distribution norm(max(0, p − q)) — it can be proven that this accept/resample rule makes the final distribution strictly equal to the large model's p. So it sacrifices no quality, only compressing the cost of "a string of easy tokens" from "γ large-model passes" to "1 large-model pass + γ small-model passes."

One round of speculative decoding (γ=4 draft)

small drafts → isanewera (serial, but cheap)
large verifies → one forward computes all 4 positions' distributions
is✓ a✓ new✓ era✗ → reject here, resample correct word
↑ accept 3 + resample 1 = 4 tokens per round, at the cost of 1 large-model forward

The closer the draft model is to the large one, the higher the acceptance rate and the greater the speedup; the paper reports roughly 2–3× on T5-XXL with byte-for-byte identical output. DeepMind's concurrent speculative sampling (Chen et al. 2023) gives an equivalent independent derivation.

Code example

# HuggingFace built-in: pass a small assistant_model to enable speculative decoding
big   = AutoModelForCausalLM.from_pretrained("gpt2-xl")   # target (slow/accurate)
draft = AutoModelForCausalLM.from_pretrained("gpt2")      # draft (fast)

out = big.generate(
    ids, max_new_tokens=60,
    assistant_model=draft,   # ← small guesses, large verifies; distribution unchanged
    do_sample=True, temperature=0.7,
)
print(tok.decode(out[0], skip_special_tokens=True))
# result matches the distribution without assistant_model — just faster

Common misconception + practical scenario

❌ Misconception: "speculative decoding is approximate acceleration — it trades a bit of quality for speed." Wrong — it's mathematically lossless. The rejection-sampling accept/resample rule is carefully designed to guarantee the final token distribution is strictly identical to the large model generating alone. A bad draft model only lowers the acceptance rate (less speedup); it never pollutes the output. That's its elegance: speed is gambled, correctness is proven.

📌 Super-individual scenario: understand why your everyday API got faster without getting dumber. More broadly, the mental model — "guess optimistically + verify cheaply + roll back only if wrong": when writing code, let the AI generate boldly and quickly verify the critical parts yourself; when deciding, draft first then verify assumptions. Speculative execution is a general philosophy of efficiency.

Takeaway + question

💡 Speculative decoding = speculative execution for decoding: the small model guesses, the large model verifies in parallel, rejection sampling keeps the output distribution byte-for-byte unchanged. A (nearly) free lunch.
🤔 Speculative decoding trades speed via "cheap guessing + provable verification." Which parts of your workflow could "execute optimistically, then verify cheaply" instead of "confirming cautiously step by step"?

Deep QuestionsDeep Questions

1. Of the four methods, which change the model's "output distribution" and which don't? Why does this distinction matter?

Splitting them into two classes is the key to today's whole picture. Change the distribution: greedy (collapses to a point), temperature (sharpen/flatten), top-k/p/min-p (truncate the tail), contrastive decoding (expert minus amateur, a full reshaping of the score) — all actively "rewriting what the model says." Don't change the distribution: only speculative decoding — it's pure engineering acceleration, output strictly identical to the large model generating token by token. The significance is attribution: if the model hallucinates, the first four are results of "you chose to let it stray further from the peak" (tunable), whereas speculative decoding is never at fault (it just faster-reproduces what the large model would have said anyway). When debugging a "model suddenly talks nonsense" issue, first check whether it's the sampling parameters (temperature too high, top-p too wide) rather than the model itself — the most common, most overlooked layer.

2. Top-p, min-p, and the contrastive plausibility constraint differ in math form, but share one "spirit" — what is it?

All three answer the same question: "how many candidate words should we keep to be reasonable?" And every answer is dynamic, adapting to the model's confidence, not a fixed threshold. Top-p uses "cumulative probability ≥ p" — the more confident the model, the fewer candidates needed to reach p. Min-p uses "≥ top probability × ratio" — anchored directly on the top-1's confidence. The contrastive plausibility constraint "≥ α × top probability" is nearly the same formula as min-p. The shared spirit: narrow when the model is confident (trust it), widen when it's uncertain (give it room). This is more intuitive than top-k's fixed k — because "the number of reasonable candidates" was never meant to be a constant; it depends on how confident the model is at this step. It also echoes concept 1's insight: good text lives in a region that "scales with context," not a fixed-size neighborhood.

3. Why is speculative decoding's "small model guesses + large model verifies" lossless? What if you replaced it with a "weighted average of small and large outputs"?

The losslessness hinges on rejection sampling, not on "averaging." The acceptance probability min(1, p/q) plus the "resample from norm(max(0,p−q)) when rejected" rule can be shown algebraically to make any token's final output probability exactly equal the large model's p. Intuitively, the small model merely provides a proposal distribution; rejection sampling "corrects" it back to the target p — the small model guessing well affects only the acceptance rate (efficiency), not the final distribution (correctness). If you instead used a "weighted average" (say 0.5p+0.5q), the output distribution becomes a brand-new one that is neither p nor q — that's the contrastive-decoding class of "actively rewriting output," which sacrifices/changes quality and is no longer lossless. Same "two models collaborating," but speculative decoding is "accelerate with the small model, guarantee correctness with math," while contrastive decoding is "use the small model's errors as a cautionary example to rewrite the output" — utterly different goals; don't conflate them.

4. Concept 1 says "the highest-probability sequence is actually bad text," a shock to the naive "higher probability is better." What does it reveal about language models?

It reveals a deep mismatch: the model is trained to "maximize likelihood," but the value of human language does not lie at the peak of likelihood. Information-theoretically, a perfectly predictable (high-probability) word carries almost no information (self-information = −log p, larger p means less information). Holtzman's observation is precisely that human text's token probabilities fluctuate, whereas beam search pulls generation toward "a smooth curve of sustained high probability" — hence dull repetition. This also explains why sampling is necessary for open-ended generation, not just "adding a bit of noise to make do." More philosophically: it echoes complexity science's "interesting systems live at the edge of order and chaos" — good generation also lives in the narrow band between greedy (over-ordered) and high-temperature random (over-chaotic).

5. If we liken "decoding strategy" to human "modes of thought," what cognitive state does each of the four map to? What does this suggest for how you use AI?

An interesting mapping. Greedy ≈ snap intuition, always saying the smoothest first reaction — efficient but prone to mental ruts (local optima). Beam search ≈ deliberating, weighing several options before deciding, good for problems with a clear optimum, but for open creation it "overthinks into mediocrity." High-temperature sampling ≈ brainstorming, free association, allows leaps but may go off-topic. Contrastive decoding ≈ critical thinking — "how would a mediocre me answer? Subtract that part," actively fighting one's own clichés. Speculative decoding ≈ draft fast then refine, rather than chiseling word by word. The takeaway: when collaborating with AI, you too are implicitly choosing a "cognitive temperature" for each task. The true "super-individual" isn't one who always uses the same mode, but one who can consciously switch decoding strategies for the task at hand — for AI and for one's own brain alike.

Greedy & Beam SearchGreedy & Beam Search

Temperature & Truncation SamplingTemperature · Top-k · Top-p · Min-p

Contrastive DecodingContrastive Decoding

Speculative DecodingSpeculative Decoding

Further ReadingFurther Reading

Deep QuestionsDeep Questions