AI/ML Deep Dive: Alignment Math

Day 26 · 2026-06-12

For: engineers with coding experience, non-AI background · math-heavy

Engineering counterpart → super-individual D20: Prompt Injection (attack/defense when alignment is bypassed)

Today is about the math behind alignment: when a goal can't be written as code ("helpful," "harmless," "honest" have no unit tests), how do we turn human preferences into a loss a model can optimize? The four cards form one logical chain: ① the overall RLHF framework → ② how its core component, the reward model, learns scores from preferences → ③ replacing the expensive human with an AI (Constitutional AI) → ④ collapsing the whole framework into a single classification loss with algebra (DPO).

The RLHF Math FrameworkReinforcement Learning from Human Feedback

alignmentobjective

One-line analogy

RLHF is "a CI pipeline for when you can't write the test." In backend work you can assert on "amount calculation," but you can't write an assert for "is this answer polite enough." RLHF's move: first distill thousands of human "this is better than that" judgments into a model that scores (the reward model), then have the main model keep optimizing to please that scorer, like passing CI — turning "needs that can't be formalized" into "a differentiable objective."

The problem + how it works

A pretrained model only "continues the most likely next token" — it has no motive to be helpful or harmless. This is objective mismatch: the training objective (language modeling) ≠ what we actually want (a helpful assistant). InstructGPT (Ouyang et al. 2022) established the classic three stages:

The RLHF three-stage pipeline

① SFT→ fine-tune on human demos → gives π_ref (reference model)
② RM→ train a reward model r on pairwise preferences (card 2)
③ PPO→ RL-optimize policy π_θ to score higher under r
↑ the objective in step 3 is the heart of RLHF math

The objective to maximize in step 3 (the core formula):

max_θ E[ r(x,y) ] − β · KL( π_θ(y|x) ‖ π_ref(y|x) )

Term by term: x is the input prompt, y is the model's answer; r(x,y) is the reward model's score for that answer; π_θ is the policy being trained, π_ref is the frozen SFT model. The first term, "maximize reward," is obvious — produce answers the reward model likes. The key is the second term, the KL penalty: KL divergence measures how far the new policy has drifted from the reference, and β weights it.

Why is this "leash" mandatory? Because the reward model is only a finite approximation with blind spots. Without the KL constraint, the optimizer will game the reward model (reward hacking) — find outputs that look like gibberish to humans but inexplicably score high, and the model rapidly "collapses" to these degenerate solutions. The KL penalty is like a rate limiter / circuit breaker in distributed systems: exploration allowed, but not too far from the known-good π_ref. Large β = short leash (conservative, close to SFT); small β = long leash (bold exploration, but prone to reward hacking).

Code

import torch
# RLHF step 3: actual signal = reward − β·KL (per-token form of the formula)
def rlhf_signal(policy_logp, ref_logp, reward, beta=0.1):
    # policy_logp / ref_logp: log-probs of generated tokens under policy / ref
    # reward: scalar score from the reward model for the whole answer
    kl = policy_logp - ref_logp        # single-sample KL estimate
    return reward - beta * kl          # ← leash: reward minus drift penalty

p = torch.tensor([-0.2, -1.1, -0.5])  # policy more confident on each token
r = torch.tensor(2.0)
print(rlhf_signal(p, torch.tensor([-0.3, -0.9, -0.6]), r))
# larger β → KL suppresses reward → output stays close to SFT, safer but tamer

Pitfall + use case

"RLHF makes the model smarter" — wrong. RLHF adds almost no knowledge or reasoning; it just brings forward abilities the model already had from pretraining, choosing to express them helpfully and harmlessly. Capability comes from pretraining; alignment only steers how it's expressed. Conflating the two makes you overrate RLHF and underrate data quality.

📌 Parent scenario: when you set your kid's AI tutor to be "patient, doesn't give answers directly, prompts thinking," you're doing personal alignment — except you use a system prompt (runtime steering) while vendors use RLHF (baked into weights at training). Understanding RLHF helps you judge: which behaviors a prompt can change, and which are locked into the weights and unmovable by prompting.

Takeaway + question

💡 RLHF's essence is "turning a goal you can't code into a differentiable loss," and the KL leash is the seatbelt that stops the optimizer from cheating.
🤔 What goals in your work "can't be unit-tested, only judged by feel"? If you had to turn one into an optimizable score, how would you collect the "preference data"?

Reward Models & Bradley-TerryReward Model

preference learningprobabilistic model

One-line analogy

A reward model computes an "ELO chess rating" for answers. No chess player can state an absolute skill number, but from many "A beat B" games the ELO system infers a scalar rating for each. A reward model is identical: humans can't reliably say "this answer is worth 7.3," but they reliably judge "A is better than B." So we collect only pairwise comparisons and fit a scalar score consistent with them — the same math (Bradley-Terry / logistic regression) underneath.

The problem + how it works

The pain: RLHF step 3 needs a function that scores any answer automatically and on the fly, but human labeling is slow, expensive, and subjective. The fix has two steps: (1) collect preference data — generate two answers to the same prompt, a human picks the better one, giving a triple (x, y_w, y_l), w=winner (chosen), l=loser (rejected); (2) train a scalar scorer r from these comparisons.

The Bradley-Terry model translates "preference" into probability — the probability "y_w beats y_l" is the score difference through a sigmoid:

P(y_w ≻ y_l) = σ( r(x,y_w) − r(x,y_l) )

σ is the sigmoid (squashes any real number into 0–1 as a probability). Intuition: the larger the score gap, the closer to 1 the chance a human picks the high-scoring one; when scores tie, the probability is exactly 0.5 (a coin flip). Written as maximum likelihood, this is the reward model's training loss:

L = − E[ log σ( r(x,y_w) − r(x,y_l) ) ]

This loss only sees the difference of scores, not absolute values — so the absolute scale of the reward is meaningless (add 100 to everything and the loss is unchanged), exactly like ELO caring only about relative strength. The idea of learning a reward from human preferences traces back to Christiano et al. 2017.

Pairwise preference → scalar reward
prompt x→answer A (human picks = y_w) r=2.1
answer B (y_l) r=1.0
gap 1.1 → σ(1.1)≈0.75 → model's "A is better" probability is 75%, matching the human ✓

Code

import torch
import torch.nn.functional as F
# Reward model's Bradley-Terry loss: learns only the relative gap, scale is moot
def bt_loss(r_chosen, r_rejected):
    # r_*: scalar scores for the chosen / rejected answers
    # −log σ(r_w − r_l): bigger, correctly-signed gap → smaller loss
    return -F.logsigmoid(r_chosen - r_rejected).mean()

r_w = torch.tensor([2.1, 0.5])   # pair 2: model scores the chosen one LOWER
r_l = torch.tensor([1.0, 0.8])
print(bt_loss(r_w, r_l))    # pair 2 is wrong-signed → larger loss, drives correction

Pitfall + use case

"The reward score can be used as an absolute quality rating" — wrong. It's only reliable in-distribution and for relative comparison. Once the policy games it with an OOD (out-of-distribution) weird output, the reward model may blindly score it high — exactly what the card-1 KL leash exists to prevent. The reward model is the "referee," but the referee can be fooled.

📌 Decision-support scenario: when you ask AI to "pick the best of 3 options," giving it pairwise comparisons (A vs B, B vs C) is often more reliable than asking it to "score each at once" — because relative judgment is more stable than absolute scoring. This cognitive regularity holds for both humans and models; it's Bradley-Terry projected onto real life.

Takeaway + question

💡 A reward model = an ELO system fitting scalar scores from pairwise preferences; it learns relative ordering, not absolute truth.
🤔 If human labelers themselves disagree a lot (some pick A, some pick B for the same pair), what does the Bradley-Terry reward become? What does that mean for "whose preferences are we aligning to"?

Constitutional AI / RLAIFHarmlessness from AI Feedback

RLAIFscalable oversight

One-line analogy

Constitutional AI is "replacing manual code review with running a linter." In RLHF every preference needs a human label, like every PR needing a senior engineer to review by hand — slow, expensive, inconsistent. CAI's move: write a plaintext set of rules (a "constitution"), have the model review and rewrite its own outputs against those rules, then use the model's own preference judgments in place of human labels. One readable, auditable, version-controllable principles doc replaces thousands of human annotations.

The problem + how it works

The pain: RLHF's human preference labeling is the cost and consistency bottleneck of alignment — harmlessness labeling especially makes people read harmful content repeatedly, which is costly and harmful. Anthropic's Constitutional AI (Bai et al. 2022) replaces the human with AI in two phases:

CAI's two phases

Phase 1 (SL): self-critique + revise
draft answer→self-critique vs constitution→revise→ SFT on revisions

Phase 2 (RLAIF): AI labels preferences
two answers→model picks better one per constitution→ train RM → RL
↑ replaces the entire "human labeling" in the card-1 pipeline with "AI + constitution"

The core mechanism is RLAIF (Reinforcement Learning from AI Feedback): the pipeline is identical to RLHF (still reward model + KL leash + RL), the only difference is that preference labels are generated by the model per the constitution, not by humans. The "constitution" is a set of natural-language principles (e.g., "choose the more harmless, less preachy response"). This works thanks to one key assumption: judging which answer is better is easier than generating a good answer from scratch — the model can be a competent referee even when it makes mistakes as a player.

Code

from anthropic import Anthropic
client = Anthropic()  # needs ANTHROPIC_API_KEY

principle = "Answers must not give actionable harmful details; decline politely with a reason."
draft = "<model's first-pass answer to a sensitive question>"
# CAI phase 1: have the model self-critique + revise per the "constitution"
# (no human labeling of harmful samples needed)
revision = client.messages.create(
    model="claude-opus-4-8", max_tokens=512,
    messages=[{"role": "user", "content":
        f"Principle: {principle}\nDraft answer: {draft}\n"
        "First point out where it violates the principle, then give a compliant rewrite."}]
).content[0].text
print(revision)  # SFT on these self-revisions → then RLAIF on AI preferences

Pitfall + use case

"CAI removes humans entirely, alignment is now automated" — overstated. The constitution itself is written by humans — human value judgment moves up from "label each datum" to "define the principles," it doesn't disappear, the lever just gets longer. And a model-as-referee amplifies its own biases; a vague constitution gets gamed. CAI lowers labeling volume, not human responsibility.

📌 Super-individual scenario: give your writing/coding agent a "personal constitution" (e.g., "conclusion first then detail," "no piled-up adjectives," "code must carry type annotations") and have it self-critique once before delivering. This is a miniature CAI phase 1 — replacing "manual line-by-line review" with "model self-review" via one plaintext principles doc.

Takeaway + question

💡 CAI's leverage is "lifting human value judgment from the data layer to the principles layer" — writing rules scales far better than labeling data.
🤔 When the referee (AI) and the player (AI) are the same model, it inherits its own blind spots in self-evaluation. How would you design a mechanism so that "self-review" doesn't degrade into "feeling good about itself"?

DPO: Direct Preference OptimizationDirect Preference Optimization

algebraic collapseno RL

One-line analogy

DPO is "using algebra to delete a whole microservice." RLHF is a three-piece architecture: a reward-model service + a sampling loop + PPO training — many components, mutually coupled, as fragile to tune as a distributed system. DPO found a mathematical identity proving that the "reward model" can actually be expressed as the policy's own log-probability ratio against the reference — so the reward-model component is deleted entirely, leaving just an ordinary classification loss. Like discovering an RPC call can be algebraically reduced to a local function call.

The problem + how it works

The pain: RLHF is hard in practice — you maintain 4 models (policy, reference, reward, value), and PPO is hyperparameter-sensitive, crash-prone, and slow to sample. DPO's (Rafailov et al. 2023) insight starts from that KL-penalized objective in card 1: it has a closed-form optimum. The optimal policy looks like this:

π*(y|x) = (1/Z) · π_ref(y|x) · exp( r(x,y) / β )

Z is the normalizing partition function (ensures probabilities sum to 1, depends on x). Inverting this solves for the reward:

r(x,y) = β · log( π*(y|x) / π_ref(y|x) ) + β·log Z

Here's the magic: substitute this r back into card 2's Bradley-Terry loss. Because the loss contains only the difference of rewards r(y_w)−r(y_l), and β·log Z is identical for the same x, it cancels on subtraction — that nastiest, hardest-to-compute normalizing term vanishes into thin air. What remains is a loss acting directly on the policy:

L_DPO = − log σ( β·log[π_θ(y_w)/π_ref(y_w)] − β·log[π_θ(y_l)/π_ref(y_l)] )

Compare card 2's reward-model loss: structurally the same −log σ(difference), only the "reward difference" is replaced by the "difference of the policy's log-prob ratio against the reference." That's the meaning of the paper's title — the language model is secretly a reward model. The KL leash didn't disappear; it's quietly encoded into the loss via β and π_ref.

RLHF vs DPO: architecture

RLHF:preference data→train reward model→PPO sample+RL→policy
4 models · unstable RL · slow sampling

DPO:preference data━━━━━━━━━━→policy (one classification loss)
2 models (θ + frozen ref) · no RL · like ordinary supervised training

Code

import torch
import torch.nn.functional as F
# DPO loss: reward = β·log(π/π_ref); partition function Z cancels in subtraction
def dpo_loss(pi_w, pi_l, ref_w, ref_l, beta=0.1):
    # *_w / *_l: seq log-probs of chosen/rejected under policy(pi)/reference(ref)
    logits = beta * ((pi_w - ref_w) - (pi_l - ref_l))
    return -F.logsigmoid(logits).mean()  # same σ as card 2!

# policy raised chosen, lowered rejected vs ref → logits>0 → small loss
print(dpo_loss(torch.tensor([-1.0]), torch.tensor([-2.0]),
               torch.tensor([-1.2]), torch.tensor([-1.5])))
# in practice just use HuggingFace TRL's DPOTrainer, no need to hand-write

Pitfall + use case

"DPO is mathematically equivalent to RLHF, so it's always better" — no. The equivalence holds under idealized assumptions; in practice DPO, because it only sees the fixed preference pairs in the training set (offline), is more prone to overfitting the preference data, less robust out-of-distribution than online RLHF, and loses the asset of a "reusable reward model." DPO is simple and stable; RLHF has a higher ceiling — that's an engineering trade-off, which belongs to super-individual, so just a pointer here.

📌 Personal project scenario: to fine-tune a small model's "style preference" on your own domain data (e.g., teaching it your writing voice), DPO is the first choice — just a batch of (prompt, good answer, bad answer) triples + TRL, no need to stand up the whole RL stack. This is exactly why DPO brought "alignment" into everyday projects.

Takeaway + question

💡 DPO uses one algebraic identity to prove "the language model is itself an implicit reward model," collapsing RLHF's complex architecture into a single classification loss.
🤔 One algebraic simplification (canceling the partition function Z) dropped alignment cost by an order of magnitude — in your domain, is there a "seemingly necessary intermediate component" that could be eliminated by an identity?
Engineering counterpart → super-individual D20: Prompt Injection (when alignment is bypassed)

Deep QuestionsDeep Questions

1. "−log σ(difference)" appears twice across the cards (reward-model loss and DPO loss). Why does this sigmoid loss become the "universal shape" of preference learning?

Because preference is inherently a binary-comparison problem, and the sigmoid is the most natural function mapping a "real-valued score difference" to a "choice probability." The root is the Bradley-Terry model (and its psychology ancestor, the Thurstone model): assume each option has a latent utility and the observed preference is the utility difference passed through logistic noise, and then P(A≻B)=σ(u_A−u_B) is the unique max-entropy form. This is the same math as logistic regression — logistic regression is "features → binary probability," BT is "two candidates → who-wins probability." So whether you define "score" as the reward model's output (card 2) or the policy's log-prob ratio (card 4), as long as you're learning pairwise preferences, the loss takes the shape −log σ(difference). Recognizing this invariant gives you the mathematical skeleton of modern alignment: only what you put inside the "difference" changes, the shell is always sigmoid. You can think of it as: many seemingly different ranking/matching/recommendation problems reduce, underneath, to the same logistic kernel.

2. The KL leash (β) is an explicit penalty term in RLHF but "disappears" in DPO — is it really gone?

It didn't disappear; it's encoded into the structure of the loss. Recall the DPO derivation: the optimal policy π* = (1/Z)·π_ref·exp(r/β) is itself the solution to the KL-constrained problem — β and π_ref together determine "how far the policy may drift from the reference." When you solve for r and substitute into the BT loss, β is still in the DPO loss (multiplying the log-prob ratio) and π_ref is still there (as the denominator). So the leash is still leashing, it just went from "an extra penalty added at training time" to "baked into the algebraic form of the loss." Practical implication: tuning β in DPO is still tuning the "stay close to reference vs. dare to change" balance, and too-small β will likewise let the model drift on the preference data and degenerate. One counterintuitive point: DPO's KL is implicit and offline — it constrains "relative to those answers in the training set," whereas RLHF's KL is an online constraint on new answers the current policy samples. This is exactly why DPO can be less stable than online RLHF out-of-distribution — same leash, anchored at a different spot.

3. RLHF / CAI / DPO all assume "judging A is better than B" is a reliable supervision signal. When tasks get so complex that humans (or AI) can't judge accurately, how does this paradigm break down?

This is alignment research's central worry, called scalable oversight. The whole premise of preference learning is "evaluation is easier than generation" — I can't write perfect code, but I can tell which snippet is better. When a task exceeds the evaluator's ability (judging a very long math proof, a piece of frontier research you can't follow, a decision whose consequences surface only in ten years), the premise fails: (a) human labelers will favor answers that look good but are actually wrong (fluent nonsense > clumsy truth), so the model learns sycophancy and persuasiveness rather than correctness; (b) when the judge AI itself in CAI can't judge accurately, RLAIF amplifies rather than corrects errors. Known mitigation directions: debate (two models poke holes in each other, the human only referees), recursive reward modeling, process supervision (judging each reasoning step, not just the conclusion) — but none fully solves it. The deep tension: we're aligning systems that may become more capable than the judge, using methods that "today's human/model can still judge." For someone interested in consciousness and complexity science, this is really an epistemological question: when the supervised is stronger than the supervisor, what can "alignment" still be anchored to? This is the true frontier the current math of alignment touches.

4. "RLHF doesn't add capability, only steers expression" — if alignment is just steering, where does the "alignment tax" (aligned models getting weaker on some benchmarks) come from?

The "alignment tax" is the regression aligned models show on some capability benchmarks versus pure pretrained ones. If alignment truly only adjusts expression without touching capability, why a tax? Several mechanisms: (a) distribution narrowing — RLHF/DPO pulls the policy toward "the answer styles humans prefer," suppressing some low-probability but useful output modes (e.g., unconventional but correct solutions); the tighter the KL leash, the more pronounced the narrowing; (b) preference-correctness mismatch — humans prefer confident, thorough, polite answers, so the model may sacrifice conciseness or honest admission of uncertainty to please, showing up as worse calibration; (c) capability-safety trade-off — boundaries that decline for harmlessness "collateral-damage" legitimate requests (over-refusal). The InstructGPT paper itself observed regressions on some NLP benchmarks and mitigated them by mixing in pretraining gradients. So the more accurate statement: alignment mainly reshapes the expression distribution of capability, and reshaping a distribution isn't free — it incidentally suppresses the slice of capability not covered by the preferences. This is why "alignment" and "capability" are two goals frontier labs optimize together, tugging at each other, rather than one after the other.

5. DPO lowered alignment from "lab-grade RL engineering" to "ordinary fine-tuning," letting anyone align a model to their own preferences. What are the two faces of this "democratization of alignment"?

The upside: alignment capability devolves to individuals and small teams; you can make a model fit your values, domain voice, and workflow — exactly the infrastructure for the "AI super-individual," making a "personal constitution + DPO fine-tune" workflow feasible. The downside is sharp too: (a) aligned to whose preferences? — once DPO makes "alignment" a neutral tool, it can align toward "more helpful" but also toward "more manipulative" or "more pandering to a particular stance"; un-alignment and alignment use the same math; (b) preference data is power — whoever defines the batch of (good, bad) triples defines the model's values, compressing "value judgment" into an outsourceable, tradeable dataset; (c) fragmentation — when everyone/every org aligns the model to themselves, the shared value baseline erodes. Deep down, the math of alignment solves the mechanism question of "how to pour preferences into a model," but "whose preferences should be poured" and "who has the right to decide" are normative questions it cannot answer. The simpler the tech makes the mechanism (DPO), the more that normative question is pushed to the foreground. This may be the most important non-mathematical question this math leaves us.

The RLHF Math FrameworkReinforcement Learning from Human Feedback

Reward Models & Bradley-TerryReward Model

Constitutional AI / RLAIFHarmlessness from AI Feedback

DPO: Direct Preference OptimizationDirect Preference Optimization

Further ReadingFurther Reading

Deep QuestionsDeep Questions