Today is about the math behind alignment: when a goal can't be written as code ("helpful," "harmless," "honest" have no unit tests), how do we turn human preferences into a loss a model can optimize? The four cards form one logical chain: ① the overall RLHF framework → ② how its core component, the reward model, learns scores from preferences → ③ replacing the expensive human with an AI (Constitutional AI) → ④ collapsing the whole framework into a single classification loss with algebra (DPO).
RLHF is "a CI pipeline for when you can't write the test." In backend work you can assert on "amount calculation," but you can't write an assert for "is this answer polite enough." RLHF's move: first distill thousands of human "this is better than that" judgments into a model that scores (the reward model), then have the main model keep optimizing to please that scorer, like passing CI — turning "needs that can't be formalized" into "a differentiable objective."
A pretrained model only "continues the most likely next token" — it has no motive to be helpful or harmless. This is objective mismatch: the training objective (language modeling) ≠ what we actually want (a helpful assistant). InstructGPT (Ouyang et al. 2022) established the classic three stages:
The objective to maximize in step 3 (the core formula):
Term by term: x is the input prompt, y is the model's answer; r(x,y) is the reward model's score for that answer; πθ is the policy being trained, πref is the frozen SFT model. The first term, "maximize reward," is obvious — produce answers the reward model likes. The key is the second term, the KL penalty: KL divergence measures how far the new policy has drifted from the reference, and β weights it.
Why is this "leash" mandatory? Because the reward model is only a finite approximation with blind spots. Without the KL constraint, the optimizer will game the reward model (reward hacking) — find outputs that look like gibberish to humans but inexplicably score high, and the model rapidly "collapses" to these degenerate solutions. The KL penalty is like a rate limiter / circuit breaker in distributed systems: exploration allowed, but not too far from the known-good π_ref. Large β = short leash (conservative, close to SFT); small β = long leash (bold exploration, but prone to reward hacking).
import torch # RLHF step 3: actual signal = reward − β·KL (per-token form of the formula) def rlhf_signal(policy_logp, ref_logp, reward, beta=0.1): # policy_logp / ref_logp: log-probs of generated tokens under policy / ref # reward: scalar score from the reward model for the whole answer kl = policy_logp - ref_logp # single-sample KL estimate return reward - beta * kl # ← leash: reward minus drift penalty p = torch.tensor([-0.2, -1.1, -0.5]) # policy more confident on each token r = torch.tensor(2.0) print(rlhf_signal(p, torch.tensor([-0.3, -0.9, -0.6]), r)) # larger β → KL suppresses reward → output stays close to SFT, safer but tamer
A reward model computes an "ELO chess rating" for answers. No chess player can state an absolute skill number, but from many "A beat B" games the ELO system infers a scalar rating for each. A reward model is identical: humans can't reliably say "this answer is worth 7.3," but they reliably judge "A is better than B." So we collect only pairwise comparisons and fit a scalar score consistent with them — the same math (Bradley-Terry / logistic regression) underneath.
The pain: RLHF step 3 needs a function that scores any answer automatically and on the fly, but human labeling is slow, expensive, and subjective. The fix has two steps: (1) collect preference data — generate two answers to the same prompt, a human picks the better one, giving a triple (x, yw, yl), w=winner (chosen), l=loser (rejected); (2) train a scalar scorer r from these comparisons.
The Bradley-Terry model translates "preference" into probability — the probability "yw beats yl" is the score difference through a sigmoid:
σ is the sigmoid (squashes any real number into 0–1 as a probability). Intuition: the larger the score gap, the closer to 1 the chance a human picks the high-scoring one; when scores tie, the probability is exactly 0.5 (a coin flip). Written as maximum likelihood, this is the reward model's training loss:
This loss only sees the difference of scores, not absolute values — so the absolute scale of the reward is meaningless (add 100 to everything and the loss is unchanged), exactly like ELO caring only about relative strength. The idea of learning a reward from human preferences traces back to Christiano et al. 2017.
import torch import torch.nn.functional as F # Reward model's Bradley-Terry loss: learns only the relative gap, scale is moot def bt_loss(r_chosen, r_rejected): # r_*: scalar scores for the chosen / rejected answers # −log σ(r_w − r_l): bigger, correctly-signed gap → smaller loss return -F.logsigmoid(r_chosen - r_rejected).mean() r_w = torch.tensor([2.1, 0.5]) # pair 2: model scores the chosen one LOWER r_l = torch.tensor([1.0, 0.8]) print(bt_loss(r_w, r_l)) # pair 2 is wrong-signed → larger loss, drives correction
Constitutional AI is "replacing manual code review with running a linter." In RLHF every preference needs a human label, like every PR needing a senior engineer to review by hand — slow, expensive, inconsistent. CAI's move: write a plaintext set of rules (a "constitution"), have the model review and rewrite its own outputs against those rules, then use the model's own preference judgments in place of human labels. One readable, auditable, version-controllable principles doc replaces thousands of human annotations.
The pain: RLHF's human preference labeling is the cost and consistency bottleneck of alignment — harmlessness labeling especially makes people read harmful content repeatedly, which is costly and harmful. Anthropic's Constitutional AI (Bai et al. 2022) replaces the human with AI in two phases:
The core mechanism is RLAIF (Reinforcement Learning from AI Feedback): the pipeline is identical to RLHF (still reward model + KL leash + RL), the only difference is that preference labels are generated by the model per the constitution, not by humans. The "constitution" is a set of natural-language principles (e.g., "choose the more harmless, less preachy response"). This works thanks to one key assumption: judging which answer is better is easier than generating a good answer from scratch — the model can be a competent referee even when it makes mistakes as a player.
from anthropic import Anthropic client = Anthropic() # needs ANTHROPIC_API_KEY principle = "Answers must not give actionable harmful details; decline politely with a reason." draft = "<model's first-pass answer to a sensitive question>" # CAI phase 1: have the model self-critique + revise per the "constitution" # (no human labeling of harmful samples needed) revision = client.messages.create( model="claude-opus-4-8", max_tokens=512, messages=[{"role": "user", "content": f"Principle: {principle}\nDraft answer: {draft}\n" "First point out where it violates the principle, then give a compliant rewrite."}] ).content[0].text print(revision) # SFT on these self-revisions → then RLAIF on AI preferences
DPO is "using algebra to delete a whole microservice." RLHF is a three-piece architecture: a reward-model service + a sampling loop + PPO training — many components, mutually coupled, as fragile to tune as a distributed system. DPO found a mathematical identity proving that the "reward model" can actually be expressed as the policy's own log-probability ratio against the reference — so the reward-model component is deleted entirely, leaving just an ordinary classification loss. Like discovering an RPC call can be algebraically reduced to a local function call.
The pain: RLHF is hard in practice — you maintain 4 models (policy, reference, reward, value), and PPO is hyperparameter-sensitive, crash-prone, and slow to sample. DPO's (Rafailov et al. 2023) insight starts from that KL-penalized objective in card 1: it has a closed-form optimum. The optimal policy looks like this:
Z is the normalizing partition function (ensures probabilities sum to 1, depends on x). Inverting this solves for the reward:
Here's the magic: substitute this r back into card 2's Bradley-Terry loss. Because the loss contains only the difference of rewards r(yw)−r(yl), and β·log Z is identical for the same x, it cancels on subtraction — that nastiest, hardest-to-compute normalizing term vanishes into thin air. What remains is a loss acting directly on the policy:
Compare card 2's reward-model loss: structurally the same −log σ(difference), only the "reward difference" is replaced by the "difference of the policy's log-prob ratio against the reference." That's the meaning of the paper's title — the language model is secretly a reward model. The KL leash didn't disappear; it's quietly encoded into the loss via β and π_ref.
import torch import torch.nn.functional as F # DPO loss: reward = β·log(π/π_ref); partition function Z cancels in subtraction def dpo_loss(pi_w, pi_l, ref_w, ref_l, beta=0.1): # *_w / *_l: seq log-probs of chosen/rejected under policy(pi)/reference(ref) logits = beta * ((pi_w - ref_w) - (pi_l - ref_l)) return -F.logsigmoid(logits).mean() # same σ as card 2! # policy raised chosen, lowered rejected vs ref → logits>0 → small loss print(dpo_loss(torch.tensor([-1.0]), torch.tensor([-2.0]), torch.tensor([-1.2]), torch.tensor([-1.5]))) # in practice just use HuggingFace TRL's DPOTrainer, no need to hand-write