Distillation Pipeline · Constitutional/RLAIF · Rejection Sampling · Model Collapse
2026-06-10 · BigCat
The ceiling on self-improvement isn't how smart the model is — it's how reliable your verifier is.
// WHY THIS MATTERS
The 2026 reality: high-quality human data is nearly exhausted, yet the small model you want to fine-tune, the agent you want to eval, and the private assistant you want to align are all starved for data. Synthetic data is no longer a compromise for when you can't afford annotation — it's the main engine of frontier training, and also the easiest place to crash. This issue skips "what synthetic data is" and covers four engineering things: how to distill a usable dataset from a strong model (filtering is the lever, not generation), how to drop Constitutional AI's critique-revise loop into your own inference, which tasks can genuinely self-improve via rejection sampling, and why recursively eating your own outputs makes a model quietly collapse. One thread runs through all of it: self-improvement rests on the asymmetry that verification is cheaper than generation — wherever that asymmetry holds is exactly where self-improvement can happen.
// 01
Distillation: Generation Is Cheap, Filtering Is the Lever
Claim: a synthetic dataset's quality ceiling is set by the filter, not the generator; over-generate 10× and ruthlessly delete 90% is almost always right.
Background & Principle
Distillation has a plain engineering shape: a strong teacher (Opus-class) mass-produces "instruction + answer" pairs, fed to a student (a small model / your fine-tune) for SFT. Self-Instruct (Wang 2022) showed a model can bootstrap its own instruction data, but its real contribution isn't "it can generate" — it's the filtering pipeline: dedup (drop if ROUGE similarity to existing tasks is too high), de-degenerate (too short / too long / repetitive / "as an AI" boilerplate), and de-invalid (unverifiable or self-contradictory). Generating tokens is nearly free; quality comes almost entirely from deletion. A repeatedly confirmed lesson: 10K curated samples beat 100K average ones — the latter pours the teacher's filler and blind spots straight into the student.
Hands-on
# Distillation = over-generate + aggressive filter. Skeleton of the filterfrom anthropic import Anthropic
client = Anthropic()
def gen(seed): # teacher, high temperature, high volume
r = client.messages.create(model="claude-opus-4-8",
max_tokens=1024, temperature=1.0,
messages=[{"role":"user",
"content":f"From this seed, craft a new {{instruction, answer}}; answer must be checkable:\n{seed}"}])
return r.content[0].text
def keep(item, pool):
if len(item.answer) < 40: return False # de-degenerateif rouge_l(item.instr, pool) > 0.7: return False # dedupif"as an AI"in item.answer: return False # drop boilerplatereturn llm_judge(item) >= 4 # 1-5 scale, delete all <4
raw = [gen(s) for s in seeds for _ in range(10)] # 10× over-generate
data = [x for x in raw if keep(x, raw)] # often <15% survive
Failure mode: the teacher's systematic blind spots get silently inherited — it's consistently wrong on some class, the filter (same-family model) can't tell, so the error is written into the student with high confidence. Fix: filter with heterogeneous signals (judge models from a different family, executable verification, human spot-checks). Don't let the generator and the filter be the same brain.
Constitutional AI / RLAIF: A Reusable Critique-Revise Operator
Claim: the core of CAI isn't "training," it's a critique→revise operator — you can use it at the inference layer today, with zero RL.
Background & Principle
Constitutional AI (Bai 2022) replaces "humans labeling which answer is better" with "the model critiques and revises itself against a set of written principles": produce a draft, then list violations against the constitution, then revise accordingly — the SL stage trains on these revised samples; the RL stage (RLAIF) uses AI rather than humans to label preference pairs. For engineers, the valuable insight is: critique-revise is a reusable inference operator, isomorphic to Self-Refine (Madaan 2023). Without any training, you can insert a round of self-critique into your call chain, turning "values / style / safety constraints" from imperative sentences in a prompt into an explicit, auditable check step. The essence: upgrade constraints from "hope the model remembers" to "force the model to verify item by item."
Hands-on
# Inference-time CAI: one principle = one critique-revise operator
PRINCIPLE = "Answers must give executable steps; no vague advice; flag risks explicitly"
draft = ask(f"{task}")
critique = ask(f"Audit the answer against this principle; list only violations (NONE if none):\n"
f"Principle: {PRINCIPLE}\nAnswer: {draft}")
final = draft if"NONE"in critique else \
ask(f"Revise per the critique; output the revised answer only:\nAnswer: {draft}\nCritique: {critique}")
Chain several principles into a pipeline and you have a micro-constitution. Run it over a dataset and the revised outputs become SFT corpus — exactly the SL stage of CAI.
Failure mode: self-critique's unreliability roots in backward rationalization (covered in Day 5) — the model tends to justify its own draft rather than find faults, so the critique step may spin in place ("looks like criticism, actually upholds the verdict"). Two counters: (1) give critique and generation different role prompts, even different models, to create adversarial pressure; (2) make critique output a structured per-item checklist rather than free text, forcing a yes/no on each principle. An open "do you think this is good?" almost always returns yes.
Verifiable Self-Improvement: Rejection Sampling / STaR
Claim: whether a model can get stronger on its own hinges on one thing — do you have a verifier cheaper and more reliable than the generator.
Background & Principle
Genuine bootstrapping self-improvement has a minimal formula: over-sample → use a verifier to keep the correct ones → fine-tune only on the correct → repeat. Rejection Sampling Fine-tuning is the one-round version; STaR (Zelikman 2022) is the iterative version — plus a rationalization trick: for problems answered wrong, feed back the correct answer and have the model produce a rationale that reaches it, then add to the training set. The key is not sampling but the existence and asymmetry of the verifier: math has ground-truth answers, code runs unit tests, SQL compares results — here "judging correctness" is far cheaper than "generating a solution," so the loop holds and keeps climbing. Tasks without that asymmetry (open writing, subjective judgment) make the loop amplify noise instead of signal. The training of o1 / R1-class reasoning models is essentially this loop industrialized.
# One round of rejection sampling for code: verifier = run unit testsdef harvest(q):
cands = [solve(q, temperature=1.0) for _ in range(16)] # over-sample
good = [c for c in cands if run_tests(c, q.tests)] # verifierreturn dedup(good)[:2] # keep ≤2 per problem, avoid skew
sft_set = [(q, sol) for q in tasks for sol in harvest(q)]
# failed problems → rationalize: give gold answer, have model supply reasoning, re-verify
Failure mode: when the verifier has holes, the loop learns to exploit them (reward hacking) — incomplete test coverage means the model produces "passes the tests but is actually wrong" solutions and trains more of them. Second trap: keeping only passing samples narrows the distribution (all the problems the model already handles), so hard ones are never learned. Counter: include adversarial cases in tests and sample stratified by difficulty so easy problems don't drown the set.
Model Collapse: Recursively Eating Your Own Output Quietly Collapses
Claim: the danger of synthetic data isn't getting dumber — it's the distribution's tail silently vanishing while benchmarks still rise and diversity is already dead.
Background & Principle
Shumailov et al. (The Curse of Recursion 2023; Nature 2024) proved: when one generation of models is trained mainly on the previous generation's output, recursively, the distribution's tail disappears generation by generation — rare-but-real patterns go first, eventually converging to a bland high-frequency center. The scary part is its stealth: benchmark scores on common tasks may still rise while long-tail coverage, stylistic diversity, and rare knowledge have collapsed; by the time you notice, the loss is often irreversible. Distinguish two cases: §3's verifier-filtered self-improvement is NOT collapse (signal preserved, noise deleted); collapse comes from indiscriminately feeding generated data back as if it were real data, recursively. Frontier models use vast synthetic data and still improve precisely via three gates: real-data anchoring + verification filtering + accumulate, not replace.
Hands-on
# Three gates against collapse# 1) Anchor: real data stays dominant, synthetic is only incremental
mix = 0.7 * real + 0.3 * synthetic
# 2) Accumulate, don't replace: keep every generation of real corpus
corpus = real_v1 + real_v2 + synthetic_filtered # append, not replace# 3) Monitor distribution, not just accuracy — diversity can drop while accuracy doesn't
assert distinct_n(outputs, n=3) > THRESH # n-gram diversity
assert tail_coverage(outputs) > THRESH # rare-category recall
"Accumulate, don't replace" is the key result of Gerstgrasser et al. 2024: as long as each round adds real data rather than letting synthetic supplant it, collapse can be broken. Change your eval metrics too — don't only watch accuracy; watch diversity and long-tail coverage.
Failure mode: monitoring synthetic-data training with a single metric (accuracy / one benchmark) is a buried landmine — it's nearly insensitive to collapse. By the time users start complaining that "answers are getting samey / formulaic," the distribution has usually already collapsed and is hard to reverse.
// CAPSTONE · Build a Self-Improving Data Flywheel for Your Private Task
Chain the four points into a weekend project: for a narrow task you actually use (e.g. "turn meeting notes into structured to-dos"), build a small self-augmenting dataset and touch every gate by hand.
Distill (§1): over-generate 200 "input→output" pairs with Opus, filter aggressively (dedup + structure check + judge≥4), aim to keep only ~30 gems.
Critique operator (§2): write 2-3 constitution rules (e.g. "every to-do must have an owner and a due date"), run critique-revise on each sample, using a heterogeneous judge to avoid self-endorsement.
Verifier (§3): define a programmatic check for this task — can the output be parsed by a JSON schema, are fields complete. The part you can write a verifier for is the part fit for self-improvement; the part you can't, honestly leave to human review.
Anti-collapse (§4): when expanding in round two, real samples (your hand-edits) always stay ≥60%, and add rather than replace; monitor distinct-3 diversity and stop if it drops.
Settle up: compare a "distill once" student vs a "distill + verifier, two iterations" student. You'll feel it firsthand — the part that can improve is exactly the part you can verify, not one bit more.
After this, you'll meet any "self-improvement / synthetic data" narrative with one question first: where's the verifier? where's the real-data anchor? Without those two, the rest is a castle in the air, ready to collapse.
// DEEP THINKING
RLAIF labels preferences with AI — where do those preferences originate? Does it just launder the base model's bias into something that looks "aligned"?
Yes, that's a real risk. The AI's preferences still root in pretraining data plus how that constitution is written; RLAIF only amplifies and freezes them into a reward signal. If the base model is systematically skewed on some value axis, RLAIF will self-consistently reinforce rather than correct it. So CAI's true lever is the textual quality and coverage of the constitution, not "AI doing the labeling" itself. In practice you need external anchors (human red-teaming, heterogeneous model audits) to puncture the self-consistent loop — otherwise you're scoring the model with the model's own bias.
STaR only works on tasks with a ground-truth verifier (math/code). Can open-ended generation (writing, strategy) self-improve?
It can, but you must reshape "verification." Open tasks have no unique answer, yet often have checkable necessary conditions: writing can verify "contains required points / violates the style spec / self-contradicts," strategy can verify "satisfies constraints / gets refuted by an adversary." Swap a single verifier for a combination of weaker checks, or use debate / an adversary to manufacture a win/lose signal, and the loop partly holds. But the weaker and more subjective the signal, the slower the climb and the easier the reward hacking — exactly why open-task self-improvement is far harder than math.
Frontier models use vast synthetic data yet keep getting stronger — does that contradict "model collapse"?
No contradiction, because they don't run the collapse experiment's setup. Collapse comes from indiscriminate recursion: feeding the prior generation's unfiltered output back as real data, replacing it. Frontier practice is the opposite — synthetic data is first filtered by a verifier / reward model (delete noise, keep signal), and real data stays anchored and accumulated rather than replaced. In other words, collapse is "a closed loop eating itself," controlled self-improvement is "a half-open loop with an external signal." What decides which way you go is precisely whether there's a verification/real anchor independent of the generator.
Can a distilled student exceed its teacher? Under what conditions?
Yes, and it's already routine. Three mechanisms: (1) best-of-N distillation — the teacher samples N times and only the best goes to the student, so the student learns the teacher's "upper bound," not its average; (2) verifier filtering — distilling only correct solutions lets the student learn a distribution better than the teacher's expectation; (3) multi-teacher / multi-source fusion. The common thread: the student learns not the teacher itself but the product of "teacher + a filtering/search process." Pure imitation (learning the teacher's average output) can't surpass the teacher — which is exactly why §1 stresses filtering as the lever.
"Verification is cheaper than generation" is the bedrock of self-improvement — which important tasks fail this asymmetry?
The asymmetry vanishes when verification is as hard as, or harder than, generation: (1) long-horizon consequences — investment decisions, medical plans, where correctness takes a long time to reveal and can't be verified instantly; (2) subjective value — what counts as good literature or good design, where verification itself needs equal intelligence and lacks consensus; (3) verification requires an irreversible action — you can only run it once in the real world. In these domains self-improvement degenerates into talking to itself; you must introduce humans, time, or real-world feedback as the external verifier. Judging whether a task belongs to this class is the first call on whether to put it on a self-improvement flywheel at all.