Same architecture, same compute — feed different data and you get a model a whole tier apart. Since 2023 the consensus has sharpened: a model's capability ceiling is increasingly set by its data, not its parameter count. Today covers the four core stages that turn "raw web pages" into "a corpus that trains good models" — synthesizing, curating, deduplicating, quality filtering. This is the academic view of "what training data actually is," not a production-ETL tuning manual.
Synthetic data = using a strong model as a "data factory" to mass-produce training samples. It's the same idea as AlphaGo's self-play: instead of being fed external game records, the system generates its own experience and learns from it. The closer backend analogy: generating test fixtures with a faker — except here you're not filling tables with dummy rows, you're producing high-quality "instruction-answer" pairs to actually train the model.
Pain point: human-written high-quality data is scarce, expensive, and not diverse enough. An annotator can't write more than a few hundred good Q&A a day, while a model needs millions. Self-Instruct (Wang et al. 2022) offers a "bootstrapping" mechanism: take a few human-written seed tasks as examples, have a strong model generate many new "instruction + input + output" triples in the same style, filter out invalid and duplicate ones, then fine-tune the model on them. With this, vanilla GPT-3 made a leap in instruction-following approaching the human-annotated version.
phi-1 (Gunasekar et al. 2023, "Textbooks Are All You Need") pushed it to the extreme: it used GPT-3.5 to generate "textbook-quality" synthetic code material, and with only ~7B tokens and a 1.3B-parameter small model hit 50.6% pass@1 on HumanEval — data quality converted directly into effective parameter count. It's one of the strongest pieces of evidence for "data > scale."
from anthropic import Anthropic client = Anthropic() # needs ANTHROPIC_API_KEY # Self-Instruct core: bootstrap many new samples from a few seed tasks seed = ["rewrite this in a formal tone", "write unit tests for this code"] def synthesize(seed, n=5): prompt = (f"Given these tasks: {seed}\n" f"Generate {n} new 'instruction + input + ideal output' triples " f"similar in style but novel in content, as a JSON array.") resp = client.messages.create( model="claude-opus-4-7", max_tokens=2048, messages=[{"role": "user", "content": prompt}]) return resp.content[0].text # then JSON-parse raw = synthesize(seed) # Key: generated != usable. You MUST filter — drop near-seed, malformed, wrong
Data curation is the LLM's ETL pipeline: raw Common Crawl pages are dirty data in the "data lake," and they go through a chain of transforms — extraction, cleaning, filtering, dedup — before becoming a trainable "data warehouse." Each stage is an operator whose gain you can measure independently — the same engineering discipline as attaching a monitoring metric to every stage of a data pipeline.
Pain point: the vast majority of raw web pages are garbage — nav bars, ads, SEO farms, gibberish, templated boilerplate. Train on it directly and you get the classic garbage in, garbage out. FineWeb (Penedo et al. 2024) systematically open-sourced a 15-trillion-token curation pipeline, with these core stages:
FineWeb's real methodological contribution isn't "this flow" but that for every stage added, they trained a small model and quantified its gain on downstream benchmarks — turning "does this cleaning step help?" from a hunch into a falsifiable experiment. Decontamination is especially critical: if test sets leak into training data, the eval scores are inflated cheating scores.
import trafilatura # web main-text extraction, strips nav/ads/footers # Minimal pipeline: each stage is an independently measurable filter # (detect_lang / symbol_ratio shown as illustrative helpers) def curate(raw_html: str): text = trafilatura.extract(raw_html) # ① extract main text if not text or len(text) < 200: # ② too short: drop return None if detect_lang(text) != "en": # ③ language filter return None if symbol_ratio(text) > 0.1: # ④ too much symbol/noise: drop return None return text # FineWeb methodology: per added stage, train a small model, quantify its gain
Deduplication is the same idea as the content-addressable storage / Git object dedup you know: identical content should be stored once. The hard part is "near-duplicates" — the same article reposted by 1,000 sites with a tweaked title and first paragraph. Exact hashing can't catch those; you need MinHash, which plays a role much like a Bloom filter: it uses a probabilistic signature to cheaply judge "are these two nearly identical?" without expensive token-by-token comparison.
Pain point: web corpora are shockingly duplicated. Lee et al. 2021 ("Deduplicating Training Data Makes Language Models Better") found a single 61-word English sentence in C4 repeated over 60,000 times. Duplication causes three harms: verbatim memorization (privacy/copyright risk), wasted compute on redundant samples, and train-test leakage. The paper showed: after dedup, the chance of emitting memorized text dropped to about 1/10, while reaching equal or better accuracy in fewer steps.
The mechanism has two layers: exact dedup via hashing (a whole span must match character-for-character); near dedup via MinHash + LSH. The core intuition is Jaccard similarity — the intersection over union of two documents' word sets. Comparing all N documents pairwise is O(n²), infeasible at scale. MinHash's trick: generate a fixed-length "signature" per document such that the fraction of matching signature positions ≈ their Jaccard similarity, compressing similarity estimation into cheap signature comparison; LSH then buckets likely-similar documents so you only compare within a bucket.
But less duplication isn't strictly better. Muennighoff et al. 2023 ("Scaling Data-Constrained Language Models") found that under data constraints, intentional repetition (running a few more epochs) up to about 4 epochs gives loss nearly indistinguishable from fresh data, decaying quickly only after that. So dedup should remove accidental web redundancy, not ban all repetition — deliberate epoch repetition within budget is perfectly healthy.
from datasketch import MinHash, MinHashLSH # pip install datasketch # Near dedup: estimate Jaccard via MinHash, avoid O(n^2) pairwise compare def sig(text, num_perm=128): m = MinHash(num_perm=num_perm) for tok in set(text.split()): # signature over the word set m.update(tok.encode("utf8")) return m lsh = MinHashLSH(threshold=0.8, num_perm=128) # sim>0.8 = duplicate docs = {"d1": "nice weather today lets take a walk", "d2": "nice weather today lets go for a stroll"} for name, text in docs.items(): s = sig(text) if lsh.query(s): # near-duplicate exists -> skip continue lsh.insert(name, s) # otherwise index it
Quality filtering = putting a spam filter / content-scoring gate in front of the training corpus. It's isomorphic to the anti-spam systems you know: first use heuristic rules (keywords, ratio thresholds) to block obvious junk, then use a trained classifier to score the gray zone and let things through by threshold. The only difference is you're filtering not emails, but whether to feed a span of text to the model.
Pain point: even after dedup there's plenty of low-value text — grammatically broken SEO farms, pure link lists, meaningless machine-generated content. These aren't duplicates, but they contribute little to learning and can even hurt. Quality filtering works at two tiers:
Key engineering intuition: using an LLM to score is too expensive — you can't run it over 15T tokens. So the recipe is "LLM as teacher labels a few samples → train a lightweight fastText/embedding classifier → run that over everything," milliseconds per item. This is the same idea as phi-1's "textbook quality" filtering — the definition of good data gets frozen into a reproducible scorer.
from anthropic import Anthropic client = Anthropic() # FineWeb-Edu idea: LLM as teacher scores "educational value" (few samples only) def edu_score(text: str) -> int: resp = client.messages.create( model="claude-haiku-4-5-20251001", max_tokens=8, messages=[{"role": "user", "content": f"Rate this text's educational/knowledge value 0-5, reply digit only:\n{text[:1000]}"}]) return int(resp.content[0].text.strip()) # Keep score >=3. These LLM labels train a lightweight fastText classifier; # filtering the massive corpus runs THAT classifier (ms each), not the LLM