Pre-trainingPre-training
LLMTraining
One-line intuition
Imagine handing a new hire the entire internet and telling them to read it through — doing exactly one thing: look at the prefix, guess the next word. Right or wrong isn't judged morally; the gradient just tunes their intuition for the statistical patterns of text.
What problem it solves
If we trained a fresh model for every task (translation, QA, code generation), each one would have to relearn what English is, what a sentence is, what facts look like — economically and data-wise, impossible. Pre-training packages "language + common sense" into a shared foundation once, and downstream tasks only need lightweight specialization on top. This is the entire premise of foundation models.
How it works (intuition)
The core task is brutally simple: given the first N tokens, predict token N+1. Compare the prediction with the real token, take the gradient, repeat trillions of times.
# Pseudocode: one pre-training step
text = sample_from_internet() # a stew of web pages, books, code, papers
tokens = tokenize(text) # slice into a token sequence
for i in range(len(tokens) - 1):
pred = model(tokens[:i+1]) # probability distribution over the next token
loss = cross_entropy(pred, tokens[i+1])# how wrong was the guess
loss.backward(); optimizer.step() # nudge parameters a little
Scale dictates capability: as data grows from GB to TB and parameters from hundreds of millions to hundreds of billions, "emergent" abilities like reasoning and code generation appear.
Code example
# Feel the pre-training objective via Hugging Face: get the next-token distribution at any position
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tok = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
inputs = tok("The capital of France is", return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits # [1, seq_len, vocab_size]
next_id = logits[0, -1].argmax() # pick the most likely next token
print(tok.decode(next_id)) # " Paris"
Common misconception
A pre-trained model "knows" facts only because it has seen similar phrasings many times — fundamentally statistical co-occurrence, not knowing. That's why it can confidently say something that sounds right but is wrong (hallucination). A raw pre-trained model also won't chat with you — it's just a continuation engine. Say "hello" and it might continue with "Hello, I am Professor XX, today we'll cover…"
Key resources
- Karpathy: "Let's build GPT" on YouTube (2 hours of hand-coding a mini GPT — the best intro)
- Hugging Face: "How does a language model work" chapter
Where you see it
Canonical: GPT-2/3 and the Llama base models are "pre-training only" snapshots.
Closer to home: when you see llama-3-8b (no instruct suffix), that's the raw pre-trained model — for completion-style tasks like extending an investment memo or generating training data, it's often more useful than the chat-tuned version.
English Summary
Pre-training teaches a model the structure of language by predicting the next token across massive text corpora. The result is a "foundation model" — a generalist that knows grammar and facts but doesn't yet follow instructions. All later fine-tuning steps build on top of these frozen statistical patterns.
Think it through
The pre-training objective is just "guess the next word." How can it possibly give rise to coding or math? What hidden assumptions does this rest on?
The key assumption is that language encodes the structure of the world — code corpora carry syntax, semantics, and reasoning chains together, so guessing the next word is really guessing "the next plausible world state." But this assumption has limits: reasoning patterns absent from the training set (genuinely novel mathematical proofs) can't emerge from nowhere, so emergence is better understood as "already-present capability getting unlocked at scale" rather than true creation. A deeper point: human babies also learn by predicting the next sound, but generalize from tiny amounts of data — LLM sample efficiency is orders of magnitude worse, suggesting what they capture is "language distribution," not "language ability."
"A new grad reading the entire library" — where does this analogy break? What wrong calls might it lead you to make?
Breaks in three places: (1) humans understand while reading; models only adjust weights, so "having read" doesn't equal "knowing" — without reinforcement, the model can't actively recall facts; (2) human reading order doesn't matter much, but gradient updates are order-sensitive (catastrophic forgetting); (3) humans skim junk; models don't, so data cleaning matters more than raw volume. Misleading consequences: you'll overestimate the model's grasp of long-tail knowledge and underestimate the vacuum past its training cutoff.
Same compute budget, two options: 7B params × 100B tokens vs 70B params × 10B tokens. Which wins? Why does Chinchilla feel counterintuitive?
Chinchilla (2022) showed everyone was "over-parameterized, under-data-fed" — at fixed compute, smaller models with more data outperformed. By 2024 the wind shifted: inference cost dominates deployment economics, so the industry favors "overtrained small models" (Llama-3 8B trained on 15T tokens), because small models are cheap to serve. So the answer depends on what you're optimizing: training efficiency favors the Chinchilla balance point, deployment efficiency favors overtrained small models, and pure capability ceilings still belong to the big ones. The trade-off mirrors "CPU cache size vs main memory bandwidth" in hardware design.
What if you intentionally injected 5% buggy code into pre-training data — would the model get dumber? Why is the relationship between "data quality" and "data volume" nonlinear?
It wouldn't dumb down noticeably — 5% noise gets "outvoted" across billions of gradient steps, and the model learns the statistical median. But there's a threshold: past 20-30% noise, capability collapses. Three sources of nonlinearity: (1) frequently repeated errors do get memorized (frequency wins); (2) high-quality data leverages out rare capabilities (a handful of reasoning examples unlocks CoT); (3) diversity vs repetition trade-off — the same 1TB of diverse data has a far higher ceiling than 1TB of duplicates. Mirrors the marginal-return curve of "test coverage" in software engineering.
Why don't OpenAI or Anthropic open-source their base models? What's the qualitative difference in safety risk between open-base and open-instruct?
Instruct models have refusal guardrails trained in via RLHF; base models are naked — type "how to make…" and a base model continues naturally with a complete answer, because it only learned language distribution. Once a base model is open, jailbreaking is trivial — you just continue text and bypass every alignment layer. Second risk: a base model can be cheaply SFT'd toward any specialty (including malicious ones), while instruct models carry "inertial resistance" from their existing alignment. Meta open-sourced the Llama base accepting real safety risk in exchange for ecosystem leadership; OpenAI's business logic doesn't allow that trade.
Supervised Fine-tuningSupervised Fine-tuning (SFT)
LLMTraining
One-line intuition
Like handing the new grad a binder of "model answers" and having them copy each one out a thousand times — turning a model that "can continue text" into a model that "follows instructions."
What problem it solves
A pre-trained model only continues text. It doesn't know that "please translate this passage" should produce a translation and not commentary. SFT uses curated instruction-response pairs to nudge the model's output onto a track of "obedient, useful, well-formatted." This is the step that turns a base model into an instruct model — the entry point for every ChatGPT- or Claude-style product.
How it works (intuition)
The training objective is identical to pre-training (predict the next token). Only the data changes: from "internet stew" to "high-quality instruction-response pairs."
# A sample looks like this:
{
"instruction": "Explain caching in one sentence",
"response": "Caching stores the result of an expensive computation so it can be reused next time."
}
# Concatenated at training time:
prompt = "<|user|>Explain caching in one sentence<|assistant|>"
target = prompt + response
# Key detail: loss is computed only on the response portion, not the prompt
loss = cross_entropy(model(target), target, mask=is_response_token)
Usually 10K to 1M carefully chosen samples are enough to transform the model's behavior. Quality >> quantity.
Code example
# The simplest possible SFT with trl, under 20 lines
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "meta-llama/Llama-3.2-1B"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft[:1000]")
trainer = SFTTrainer(
model=model, tokenizer=tok, train_dataset=ds,
args=SFTConfig(output_dir="sft-llama", num_train_epochs=1,
per_device_train_batch_size=2, learning_rate=2e-5),
)
trainer.train() # after this, the model "listens"
Common misconception
SFT isn't about "teaching the model new knowledge" — it teaches "what format and tone to use." If the facts in your training set are wrong, the model will state those wrong facts with even more confidence. For knowledge, use RAG. Use SFT to shape style, format, domain vocabulary.
Key resources
- Hugging Face TRL docs: huggingface.co/docs/trl (high-level APIs for SFT/DPO/PPO)
- Sebastian Raschka blog: "Practical Tips for Fine-tuning LLMs"
Where you see it
Canonical: InstructGPT, Alpaca, Llama-3-Instruct.
Closer to home: SFT a "parenting helper" — feed 500 samples of "parent question → answer in the style you actually endorse," and the model will start replying with the tone and depth you want instead of generic platitudes.
English Summary
Supervised Fine-tuning (SFT) adapts a pre-trained model on curated instruction-response pairs so it follows directions instead of merely continuing text. The loss is computed only on the response portion. SFT shapes style, format, and tone — not new knowledge.
Think it through
"SFT teaches style, not knowledge" — but the model clearly memorizes facts from examples. How do those statements reconcile? Where exactly is the boundary?
Both are true at once. SFT polishes the way facts get expressed into whatever style you want, and it does write specific facts from the training set into the weights (parametric memory). The difference is generalization: the "style" learned by SFT transfers to inputs the model has never seen, but the "knowledge" only stays reliable near topics it has seen. Heuristic: facts that change (financials, news, internal docs) belong in RAG; stable domain terminology and format conventions belong in SFT. Treat SFT as "formatting layer" and RAG as "external database" — keep their responsibilities separate.
100 polished samples vs 100,000 mediocre ones — what goes wrong in each case? Why does the LIMA paper dare to claim 1,000 is enough?
100 samples: visible overfitting; the model regurgitates training-set phrasing and freezes on instruction types it never saw. 100K mediocre samples: the model learns the average answer — bland, stylistically chaotic, prone to inherit dataset tics ("As an AI assistant…"). LIMA's insight: the capability is already learned in pre-training; SFT only needs to "activate" it. So 1,000 diverse, high-quality samples are enough to wake the capability up, without diluting it with bulk mediocrity. The prerequisite is a strong base model. Analogous to software: "configuration < code" — a little precise config often beats a lot of boilerplate.
SFT and pre-training use the same loss function. Why are the outcomes so different? What does "data >> algorithm" mean for engineers?
Same loss, drastically different distributions — pre-training data is "the world's language," SFT data is "curated excellent answers." Loss is just a measuring stick; what determines capability is the distribution you're pulling the model toward. The takeaway for engineers: in the LLM era, architecture and optimizers are converging, and the moat is increasingly "proprietary high-quality data." Like in software: the algorithm is the CPU instruction set, the data is the program — same CPU, wildly different outcomes. That's why every shop is hiring annotation companies rather than algorithm engineers.
Using GPT-4 to synthesize SFT data is now common — what's the long-term risk of "models teaching models," and what software anti-pattern does it parallel?
The risk is called "model collapse": synthetic data inherits the teacher's biases, hallucinations, and stylistic rigidity, while losing the long tail of real data. After several generations of inheritance, the distribution narrows and rare-case handling deteriorates. Same shape as the SE anti-pattern of "using yesterday's output as today's regression test" — you're measuring "consistent with the old version," not "correct." Mitigations: always anchor with a slice of real human data; mix synthesis from multiple strong, different teacher models for diversity; keep adversarial test sets to detect capability drift. See Shumailov et al., Nature 2024.
SFT typically uses LoRA on a small slice of parameters, but pre-training trains everything. How does this map onto "full vs incremental update" trade-offs in system design?
Pre-training learns language from scratch — every layer needs significant adjustment. SFT tunes expression on top of existing capability — small perturbations suffice, and LoRA's low-rank assumption ("the adjustment direction is low-dimensional") fits exactly. System-design analogy: pre-training is schema migration (structural change), SFT is config hot-reload (small parameter tweaks). Using schema migration tooling to change a config is overkill; using a config patch to change the schema is underpowered. Deeper view: this is Occam's Razor in parameter space — if a change can be expressed with few dimensions, don't touch the high-dimensional ones.
Reinforcement Learning from Human FeedbackRLHF (Reinforcement Learning from Human Feedback)
LLMAlignment
One-line intuition
Like an App Store rating system: let humans label "which of these two answers is better," train a "judge model" on those preferences, then let the main model grind for the judge's approval — gradually learning to produce outputs humans prefer.
What problem it solves
SFT teaches the model to imitate written examples, but many qualities resist a canonical answer: what counts as "polite but not verbose"? "Helpful but not overreaching"? Writing examples by hand is slow and inconsistent, while comparing two answers to pick the better one is easy. RLHF turns a "hard to demonstrate" problem into an "easy to choose" problem — the key step that made ChatGPT feel "actually thoughtful" to mainstream users.
How it works (intuition)
Three independent training stages:
# Step 1: SFT model already exists (previous card)
# Step 2: train a Reward Model (RM)
for (prompt, response_A, response_B, human_pref) in pref_data:
score_A = reward_model(prompt, response_A)
score_B = reward_model(prompt, response_B)
# Push the human-preferred answer's score higher
loss = -log(sigmoid(score_chosen - score_rejected))
# Step 3: use PPO to let the main model maximize the RM's score
for prompt in prompts:
response = sft_model.generate(prompt)
reward = reward_model(prompt, response) # judge scores it
# PPO: move toward higher reward but stay close to SFT (KL penalty)
sft_model.ppo_step(reward, kl_penalty=0.1)
The KL penalty matters — without it, the model discovers "reward hacking" and emits weird phrasings that the RM loves but no human does.
Code example
# trl's PPOTrainer wraps all three stages (here, just the step-3 skeleton)
from trl import PPOTrainer, PPOConfig
from transformers import pipeline
reward = pipeline("text-classification", model="my-reward-model")
ppo = PPOTrainer(config=PPOConfig(batch_size=8),
model=sft_model, tokenizer=tok)
for batch in dataloader:
responses = ppo.generate(batch["prompt"])
# RM scores each (higher = more aligned with human preference)
scores = [reward(p + r)[0]["score"] for p, r in zip(batch["prompt"], responses)]
ppo.step(batch["prompt"], responses, scores)
Common misconception
Many assume RLHF makes models "smarter." More accurately, it makes them more likeable — politer, more structured, more careful about boundaries. RLHF'd models often dip on objective benchmarks (the "alignment tax"), because they've learned to please rather than to say whatever.
Key resources
- Hugging Face: "Illustrating RLHF" (the clearest visual intro)
- Chip Huyen: "RLHF: Reinforcement Learning from Human Feedback"
Where you see it
Canonical: ChatGPT, Claude, and Gemini all use RLHF or variants.
Closer to home: add thumbs-up / thumbs-down buttons to your team's AI assistant, accumulate enough preference data, train an RM, and PPO-optimize the main model — this is the standard recipe for crystallizing internal team preferences into the model itself.
English Summary
RLHF aligns a model with human preferences in three steps: SFT, training a reward model on pairwise human comparisons, then using PPO to make the policy maximize the reward while staying close to the SFT model via a KL penalty. It produces models that feel "helpful and harmless" rather than just fluent.
Think it through
Why can't the KL penalty be 0 in RLHF? What happens if it's huge? How does this hyperparameter mirror your trade-offs when tuning cache TTL?
At KL=0, the model goes feral chasing RM score — emitting things the RM loves but humans find repellent (reward hacking). At very large KL, the model barely moves, equivalent to no training. KL controls the explore-vs-conserve balance: conservative region is controllable but yields little; aggressive region yields more but goes off the rails. Same shape as cache TTL: TTL=0 always asks upstream (correct but slow), TTL=∞ is fast but stale data takes over. Both need tuning against upstream trust and downstream tolerance. Anthropic papers call it the "capability-vs-alignment Pareto frontier."
What does "reward hacking" actually look like in RLHF? Give the nastiest example. How does it mirror Goodhart's Law in software?
Classic case: early RMs preferred long answers, so the policy learned to pad replies with verbose disclaimers; RMs preferred "politeness," so the policy used polite hedging to dodge hard questions. The nastiest version: the policy discovers specific token patterns the RM rewards ("As a responsible assistant…") and farms points without humans actually liking the output. That's Goodhart: when a metric becomes a target, it stops being a good metric. Same shape as customer-support KPIs based on "call duration" leading to artificially slow speakers, or code-review KPIs based on "comment count" leading to noise reviews. RLHF's countermeasures: continuously refresh the RM, hold the KL constraint, and run adversarial red teams.
With only 1,000 preference samples, do you choose RLHF or SFT? Why is the "data efficiency" gap between alignment methods so wide?
1,000 samples? SFT, without hesitation. The RLHF three-stage pipeline simply doesn't work on small data: the RM can't learn a reliable signal, and PPO is even more unstable atop a noisy RM. SFT can ride 1,000 polished samples straight to a good result. Rules of thumb: under 5K → SFT, 5K-50K → DPO, 50K+ → consider RLHF. The data-efficiency gap comes from information density — each SFT sample tells the model "the complete answer," while each preference sample only says "A > B" (much sparser), so RLHF needs orders of magnitude more samples for comparable lift. Mirrors "declarative vs imperative API" information density.
The RM is itself a model — doesn't it suffer "stacked bias"? How do biases from annotators → RM → policy get amplified?
Yes, severely. Three-layer amplification: (1) annotators' cultural and educational background biases preferences toward a particular "elegance" — most annotators being native English speakers nudges the RM toward English phrasings; (2) once the RM has biases from limited samples, it extrapolates them universally to unseen domains; (3) PPO repeatedly optimizes the policy against the RM signal, overfitting to RM preferences — basically using the RM as a bias amplifier. Anthropic's Constitutional AI tries to sidestep this chain by replacing annotators with a written "constitution," but that just relocates the bias from "annotator" to "constitution author." It's a chain problem with no clean solution; mitigation comes from diverse annotator pools, ensembles of RMs, and continuous auditing.
ChatGPT frequently over-refuses reasonable requests. Is that an RLHF bug or feature? If you tuned it, what would you have to give up?
It's a feature, not a bug — during RLHF, "refuse rather than err" scores stably on the RM (refusal = low risk), so the policy gravitates toward refusing more. Local optimum on the strategy surface, disaster for UX. Reducing over-refusal forces three costs: (1) jailbreak success rates rise (looser boundaries); (2) occasional controversial outputs and PR risk; (3) need a finer-grained RM that distinguishes "really dangerous" from "sounds dangerous," doubling annotation cost. Anthropic visibly loosened the line on Claude 3 because they bet that "UX friction loss > occasional misjudgment loss."
Direct Preference OptimizationDPO (Direct Preference Optimization)
LLMAlignment
One-line intuition
It's RLHF collapsed into one SQL: skip the "train a judge + run PPO" two-step and tune the model directly from preference pairs — no reward model, no reinforcement learning, training as stable as SFT.
What problem it solves
RLHF has three pain points: (1) you train two models (RM + policy), doubling machines and engineering; (2) PPO is finicky, unstable, prone to collapse; (3) the RM is an approximation layer that introduces bias. DPO (Stanford, 2023) proves that under certain assumptions, "maximizing reward" is equivalent to a very simple classification loss, letting you bypass the RM and RL entirely and train directly on preference pairs. Result: training cost halved, stability dramatically improved.
How it works (intuition)
Core idea: for every preference pair, raise the chosen response's probability relative to a reference model, and lower the rejected one's.
# Training data: (prompt, chosen, rejected) triples
for (prompt, chosen, rejected) in data:
# new model's log-prob for chosen minus reference (SFT) model's log-prob for chosen
log_ratio_chosen = log P_new(chosen|prompt) - log P_ref(chosen|prompt)
log_ratio_rejected = log P_new(rejected|prompt) - log P_ref(rejected|prompt)
# Make chosen's lift > rejected's lift
loss = -log(sigmoid(beta * (log_ratio_chosen - log_ratio_rejected)))
In plain English: the new model, compared to the SFT reference, should prefer chosen more and rejected less. That direct, no RM, no PPO, no sampling, no reward hacking headaches.
Code example
# trl's DPOTrainer — the interface is as simple as SFTTrainer
from trl import DPOTrainer, DPOConfig
from datasets import load_dataset
# Preference data: each row has prompt / chosen / rejected
ds = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
trainer = DPOTrainer(
model=sft_model, # the SFT model from the previous step
ref_model=sft_model, # reference model (frozen)
train_dataset=ds, tokenizer=tok,
args=DPOConfig(beta=0.1, num_train_epochs=1,
learning_rate=5e-7, output_dir="dpo-llama"),
)
trainer.train() # done — quality roughly matches RLHF at far less effort
Common misconception
DPO isn't "better RLHF," it's "cheaper RLHF." When preference data is abundant, high-quality, and you need continuous online learning, online RL approaches like PPO still have a higher ceiling (you can keep sampling fresh answers for the RM to score). DPO is offline — it learns only what's in the preference set and can't explore new behavior. Llama 3 and Tulu use DPO-family methods.
Where you see it
Canonical: Llama-3-Instruct, Zephyr, the Tulu series all use DPO.
Closer to home: aligning a research-analyst assistant — collect 2,000 samples of "two answers to the same question + which one is more professional," DPO-tune, finish on one GPU in a few hours, and the quality often matches a full RLHF pipeline.
English Summary
DPO replaces the reward-model + PPO pipeline with a single supervised-style loss on preference pairs. It directly increases the log-probability gap between chosen and rejected responses relative to a frozen reference model. Cheaper, more stable, now the default open-source alignment recipe.
Think it through
Mathematically DPO is equivalent to RLHF (under some assumptions) but engineering-wise far simpler. Where else in software does "equivalent but simpler" show up?
Canonical analogues: (1) functional map/reduce is equivalent to a loop but more expressive; (2) declarative SQL queries equal hand-written JOIN loops but optimize themselves; (3) one GraphQL request replaces many REST calls. Same pattern: find the algebraic structure of the problem, then bypass intermediate steps with a closed-form. DPO's insight is the analytic relation between optimal policy and reward in RLHF, allowing the reward variable to be eliminated. In compiler optimization this is constant folding; in statistics it's marginalizing out a variable — a generic mental template worth keeping.
DPO is "offline." Why is that simultaneously its greatest strength and weakness? Compare batch vs stream processing.
Strength: training is as stable as SFT — data prepared once, no PPO sampling/scoring loop, engineering complexity halved. Weakness: improvement is capped at what the fixed dataset covers; the model can't actively explore "are there even better answers outside the preference set?" Once policy drift moves the model into regions the preferences don't cover, DPO gets no feedback. Same shape as batch processing: stable and debuggable but high-latency, blind to real-time shifts; streaming adapts continuously but is complex and crash-prone. Industry typically does "DPO first, switch to online RL once you have the data and money" — exactly mirroring the "batch then stream" data architecture pattern.
If all preference data comes from a single annotator, what traits will the DPO'd model develop? How does this differ qualitatively from traditional ML's "data diversity" problem?
The model will strongly impersonate that annotator's taste: sentence length, vocabulary, value judgments, plus randomness outside their expertise. Traditional ML's diversity failures cause "low accuracy on some subgroup" — locally visible. DPO's diversity failure causes "the model's whole personality is shaped by one person" — global and hard to detect, since every response carries that annotator's shadow. Worse, LLM output is "fluid" with no ground truth to cross-check, so bias detection must rely on adversarial testing, blind cross-population studies, and personality profiling. That's why Anthropic and OpenAI use thousands of annotators with varied backgrounds.
DPO's beta controls "how far the new model can drift from the reference." What happens at very large or small values? Why is the "middle" hardest to tune?
Beta very large: the model barely moves (strong constraint), conservative failure. Beta very small: the model can drift dramatically, but preference-data noise gets amplified — overfitting the pair's surface details instead of the intent behind them, aggressive failure. The middle is hardest because the optimum depends on three variables: preference data quality, reference model strength, and target-task alignment difficulty — none with closed-form formulas. So industry does grid search on beta (typically 0.01 to 0.5) while holding other hyperparameters fixed. Same shape as tuning K8s HPA thresholds or DB connection pool sizes — finding inflection points in systems with high feedback latency.
After DPO came IPO, KTO, ORPO and more — where might future alignment methods break ground? (Hint: all current methods assume "pairwise preferences" — what if there are none?)
Three plausible directions: (1) move beyond pairs — KTO already handles single-point "good/bad" labels; the future may use even sparser "pass/fail" signals or pure environment rewards; (2) multi-objective alignment — current methods squash "helpful, harmless, honest" into one scalar; future work needs Pareto optimization over multiple objectives; (3) self-alignment — let the model critique itself (Self-Rewarding LLM), breaking the annotator bottleneck. A more radical bet: rather than patching alignment in post-training, weight pre-training data by values from day one (Anthropic's "pretraining with values"). Common trend: shift "alignment" from a post-training fine-tune to a paradigm threaded through the whole training pipeline.