AI/ML Explained: Data Engineering

Day 29 · 2026-06-15 · Difficulty ★★★☆☆
For: engineers with coding experience, non-AI background

Same architecture, same compute — feed different data and you get a model a whole tier apart. Since 2023 the consensus has sharpened: a model's capability ceiling is increasingly set by its data, not its parameter count. Today covers the four core stages that turn "raw web pages" into "a corpus that trains good models" — synthesizing, curating, deduplicating, quality filtering. This is the academic view of "what training data actually is," not a production-ETL tuning manual.

Synthetic DataSynthetic Data

data generationbootstrapping
One-line analogy

Synthetic data = using a strong model as a "data factory" to mass-produce training samples. It's the same idea as AlphaGo's self-play: instead of being fed external game records, the system generates its own experience and learns from it. The closer backend analogy: generating test fixtures with a faker — except here you're not filling tables with dummy rows, you're producing high-quality "instruction-answer" pairs to actually train the model.

Problem it solves + how it works

Pain point: human-written high-quality data is scarce, expensive, and not diverse enough. An annotator can't write more than a few hundred good Q&A a day, while a model needs millions. Self-Instruct (Wang et al. 2022) offers a "bootstrapping" mechanism: take a few human-written seed tasks as examples, have a strong model generate many new "instruction + input + output" triples in the same style, filter out invalid and duplicate ones, then fine-tune the model on them. With this, vanilla GPT-3 made a leap in instruction-following approaching the human-annotated version.

phi-1 (Gunasekar et al. 2023, "Textbooks Are All You Need") pushed it to the extreme: it used GPT-3.5 to generate "textbook-quality" synthetic code material, and with only ~7B tokens and a 1.3B-parameter small model hit 50.6% pass@1 on HumanEval — data quality converted directly into effective parameter count. It's one of the strongest pieces of evidence for "data > scale."

Code example
from anthropic import Anthropic
client = Anthropic()  # needs ANTHROPIC_API_KEY

# Self-Instruct core: bootstrap many new samples from a few seed tasks
seed = ["rewrite this in a formal tone", "write unit tests for this code"]

def synthesize(seed, n=5):
    prompt = (f"Given these tasks: {seed}\n"
              f"Generate {n} new 'instruction + input + ideal output' triples "
              f"similar in style but novel in content, as a JSON array.")
    resp = client.messages.create(
        model="claude-opus-4-7", max_tokens=2048,
        messages=[{"role": "user", "content": prompt}])
    return resp.content[0].text  # then JSON-parse

raw = synthesize(seed)
# Key: generated != usable. You MUST filter — drop near-seed, malformed, wrong
Common misconception + your scenario
"Synthetic data is a free lunch — generate infinitely" — wrong. Training a model on model-generated data risks model collapse: recursively training on synthetic data narrows the distribution and loses tail detail; after a few generations it degrades into "average mush." Synthetic data also inherits the teacher's biases and errors — students can't learn what the teacher doesn't know. So synthetic data is almost always mixed with real data and gated by filtering.
📌 Super-individual scenario: when making practice problems for your kid, don't hand-write each one. Take 3-5 "seed problems" you approve of, have the model bootstrap a batch on the same concepts and difficulty, then you act as the filter and keep the good ones. That's a miniature Self-Instruct — humans set the bar, the model scales output.
Takeaway + question
💡 Synthetic data converts "data scarcity" into "how to generate + how to filter" — the value lives entirely in the filtering, not the generation.
🤔 If a model's training data is increasingly produced by the previous generation of models, which way will the internet's whole corpus drift over the long run?

Data CurationData Curation

pipelineETL
One-line analogy

Data curation is the LLM's ETL pipeline: raw Common Crawl pages are dirty data in the "data lake," and they go through a chain of transforms — extraction, cleaning, filtering, dedup — before becoming a trainable "data warehouse." Each stage is an operator whose gain you can measure independently — the same engineering discipline as attaching a monitoring metric to every stage of a data pipeline.

Problem it solves + how it works

Pain point: the vast majority of raw web pages are garbage — nav bars, ads, SEO farms, gibberish, templated boilerplate. Train on it directly and you get the classic garbage in, garbage out. FineWeb (Penedo et al. 2024) systematically open-sourced a 15-trillion-token curation pipeline, with these core stages:

Web corpus curation pipeline (each step drops a lot)

raw Common Crawl
↓ ① text extraction (HTML → text, strip nav/ads/footers)
↓ ② language ID (keep target language only)
↓ ③ quality filtering (heuristics + classifier, see concept 4)
↓ ④ deduplication (remove near-duplicates, see concept 3)
↓ ⑤ decontamination (remove leaked benchmark test sets)

clean training corpus (may be only a few % of the raw volume)

FineWeb's real methodological contribution isn't "this flow" but that for every stage added, they trained a small model and quantified its gain on downstream benchmarks — turning "does this cleaning step help?" from a hunch into a falsifiable experiment. Decontamination is especially critical: if test sets leak into training data, the eval scores are inflated cheating scores.

Code example
import trafilatura  # web main-text extraction, strips nav/ads/footers

# Minimal pipeline: each stage is an independently measurable filter
# (detect_lang / symbol_ratio shown as illustrative helpers)
def curate(raw_html: str):
    text = trafilatura.extract(raw_html)      # ① extract main text
    if not text or len(text) < 200:           # ② too short: drop
        return None
    if detect_lang(text) != "en":             # ③ language filter
        return None
    if symbol_ratio(text) > 0.1:             # ④ too much symbol/noise: drop
        return None
    return text
# FineWeb methodology: per added stage, train a small model, quantify its gain
Common misconception + your scenario
"More data is always better — more never hurts" — wrong. Post-Chinchilla, everyone understood that at fixed compute, quality and mixture matter more than raw volume; dirty data not only fails to help, it actively drags the model down and contaminates evals. Curation is fundamentally subtraction — cutting 15T raw down to a few percent; what remains is the gold.
📌 Super-individual scenario: before building a personal knowledge base (to feed RAG), apply the "curation pipeline" mindset — extract main text, strip web noise, drop duplicate clippings, bucket by topic. The cleaner the retrieval corpus you give the AI, the better the recall. Dirty in, dirty out holds for personal knowledge bases too.
Takeaway + question
💡 The core discipline of curation is "every step must quantify its gain" — turning data cleaning from craft into falsifiable experiment.
🤔 Decontamination removes test questions from the training set. If it's done imperfectly, does a high eval score prove the model can reason, or just that it memorized the answers?

DeduplicationDeduplication

MinHashsimilarity
One-line analogy

Deduplication is the same idea as the content-addressable storage / Git object dedup you know: identical content should be stored once. The hard part is "near-duplicates" — the same article reposted by 1,000 sites with a tweaked title and first paragraph. Exact hashing can't catch those; you need MinHash, which plays a role much like a Bloom filter: it uses a probabilistic signature to cheaply judge "are these two nearly identical?" without expensive token-by-token comparison.

Problem it solves + how it works

Pain point: web corpora are shockingly duplicated. Lee et al. 2021 ("Deduplicating Training Data Makes Language Models Better") found a single 61-word English sentence in C4 repeated over 60,000 times. Duplication causes three harms: verbatim memorization (privacy/copyright risk), wasted compute on redundant samples, and train-test leakage. The paper showed: after dedup, the chance of emitting memorized text dropped to about 1/10, while reaching equal or better accuracy in fewer steps.

The mechanism has two layers: exact dedup via hashing (a whole span must match character-for-character); near dedup via MinHash + LSH. The core intuition is Jaccard similarity — the intersection over union of two documents' word sets. Comparing all N documents pairwise is O(n²), infeasible at scale. MinHash's trick: generate a fixed-length "signature" per document such that the fraction of matching signature positions ≈ their Jaccard similarity, compressing similarity estimation into cheap signature comparison; LSH then buckets likely-similar documents so you only compare within a bucket.

But less duplication isn't strictly better. Muennighoff et al. 2023 ("Scaling Data-Constrained Language Models") found that under data constraints, intentional repetition (running a few more epochs) up to about 4 epochs gives loss nearly indistinguishable from fresh data, decaying quickly only after that. So dedup should remove accidental web redundancy, not ban all repetition — deliberate epoch repetition within budget is perfectly healthy.

Code example
from datasketch import MinHash, MinHashLSH  # pip install datasketch

# Near dedup: estimate Jaccard via MinHash, avoid O(n^2) pairwise compare
def sig(text, num_perm=128):
    m = MinHash(num_perm=num_perm)
    for tok in set(text.split()):     # signature over the word set
        m.update(tok.encode("utf8"))
    return m

lsh = MinHashLSH(threshold=0.8, num_perm=128)  # sim>0.8 = duplicate
docs = {"d1": "nice weather today lets take a walk",
        "d2": "nice weather today lets go for a stroll"}
for name, text in docs.items():
    s = sig(text)
    if lsh.query(s):                  # near-duplicate exists -> skip
        continue
    lsh.insert(name, s)               # otherwise index it
Common misconception + your scenario
"All duplicates must be purged, none left" — wrong. Treating deliberate epoch repetition as redundancy and deleting it wastes the legitimate tool of "reusing data under constraints" (Muennighoff: ~near-lossless within 4 epochs). Distinguish accidental redundancy (reposts, boilerplate templates) from intentional repetition (controlled multi-epoch training) — remove the former, keep the latter.
📌 Super-individual scenario: near-dedup your material before loading it into RAG / a notes base. If you stored the web, PDF, and clipped versions of the same article, they'll crowd the top-k and dilute diversity at retrieval time. One MinHash dedup pass and results get clean instantly.
Takeaway + question
💡 The essence of dedup is using probabilistic signatures to compress O(n²) similarity comparison to linear — plus telling "redundancy to remove" from "repetition to keep."
🤔 How do you behaviorally distinguish a model "reciting training data verbatim" from "having learned a generalizable rule"? Why does dedup reduce memorization and improve generalization at the same time?

Quality FilteringQuality Filtering

classifiersignal-to-noise
One-line analogy

Quality filtering = putting a spam filter / content-scoring gate in front of the training corpus. It's isomorphic to the anti-spam systems you know: first use heuristic rules (keywords, ratio thresholds) to block obvious junk, then use a trained classifier to score the gray zone and let things through by threshold. The only difference is you're filtering not emails, but whether to feed a span of text to the model.

Problem it solves + how it works

Pain point: even after dedup there's plenty of low-value text — grammatically broken SEO farms, pure link lists, meaningless machine-generated content. These aren't duplicates, but they contribute little to learning and can even hurt. Quality filtering works at two tiers:

  • Heuristic rules (e.g. the Gopher set): document word count, mean word length, symbol-to-word ratio, fraction of bullet lines, whether it has enough stopwords — cheap, interpretable, a coarse first pass;
  • Classifier filtering: train a lightweight classifier to judge "does this text look like high-quality reference material (Wikipedia, textbooks)?" FineWeb-Edu used an LLM to score samples' "educational value," then trained a classifier to reproduce that score, distilling out 1.3T tokens of "educational-grade" corpus — yielding clear gains on knowledge/reasoning benchmarks like MMLU and ARC.

Key engineering intuition: using an LLM to score is too expensive — you can't run it over 15T tokens. So the recipe is "LLM as teacher labels a few samples → train a lightweight fastText/embedding classifier → run that over everything," milliseconds per item. This is the same idea as phi-1's "textbook quality" filtering — the definition of good data gets frozen into a reproducible scorer.

Code example
from anthropic import Anthropic
client = Anthropic()

# FineWeb-Edu idea: LLM as teacher scores "educational value" (few samples only)
def edu_score(text: str) -> int:
    resp = client.messages.create(
        model="claude-haiku-4-5-20251001", max_tokens=8,
        messages=[{"role": "user", "content":
            f"Rate this text's educational/knowledge value 0-5, reply digit only:\n{text[:1000]}"}])
    return int(resp.content[0].text.strip())

# Keep score >=3. These LLM labels train a lightweight fastText classifier;
# filtering the massive corpus runs THAT classifier (ms each), not the LLM
Common misconception + your scenario
"More aggressive filtering is better — the higher the bar the cleaner" — wrong. Over-filtering costs diversity: if "quality" is defined as "looks like Wikipedia," you systematically delete dialects, spoken language, niche domains, and non-mainstream writing — the model becomes formal but stiff and narrow. Quality filtering trades off signal-to-noise against diversity; it's not monotonically better to be stricter. The definition of "high quality" itself carries a value judgment.
📌 Super-individual scenario: apply the classifier idea to a reading list — define a few "high-signal" samples you endorse, let the AI learn your taste and score new articles, and auto-archive the low scorers. Spending your scarce attention only on high-SNR inputs is, at its core, the same thing as quality-filtering a training corpus.
Takeaway + question
💡 The engineering recipe for quality filtering is "LLM labels a few, lightweight classifier runs all"; its fundamental tension is signal-to-noise vs diversity — there is no free "cleaner is always better."
🤔 When "high quality" is defined by a classifier, which is itself trained on human preferences — are we filtering out noise, or quietly instilling a particular worldview into the model?

Further ReadingFurther Reading

Deep QuestionsDeep Questions

1. Synthetic data (creating data) and quality filtering (deleting data) look opposite — why are they so often used together in one pipeline as complements?
They're two sides of the same goal: raising the average information quality of the corpus. Quality filtering is subtraction — deleting low-value parts from existing massive real data; synthetic data is addition — supplying high-quality samples where real data is scarce or absent (rare task types, certain reasoning formats). Both rely on the core ability to "judge what counts as good data": filtering uses that judgment to score and delete, synthesis uses it to generate and screen. phi-1 used both — filtering "textbook-quality" real data from the web AND synthesizing textbooks with GPT-3.5, training on the mix. Deeper: as real high-quality data gets mined out (high-quality human text is a finite resource), synthesis shifts from "supplement" to "mainstay," and filtering's gatekeeping becomes even more critical — otherwise model-collapse risk spikes. There's a distributed-systems intuition here: when cache hit-rate won't climb, you can either evict cold data (filter) or pre-warm hot data (synthesize) — you need both hands for stability.
2. Muennighoff found data can be repeated ~4 epochs near-losslessly, but Lee proved dedup makes models better. Do these conclusions contradict each other?
No, because they're about two different kinds of repetition. Lee targets accidental redundancy inside a dataset — one sentence appearing 60,000 times, a severe distribution skew: the model overfits that sentence, recites it verbatim, and those "free" repeats crowd out gradient updates that could have learned diverse content. Muennighoff targets controlled multi-epoch passes over a clean dataset — given the data is already deduplicated and healthily distributed, repetition is applied uniformly across all samples. The distinction: accidental redundancy is local, uneven over-exposure (harmful); epoch repetition is global, uniform reuse (healthy under a constrained budget). Database analogy: a hot key getting hammered (skew, needs governance) vs a whole table being scanned multiple times (normal batch processing). So the correct order is dedup first, then decide epoch count — the two findings snap together into one coherent data strategy.
3. If internet content is increasingly AI-generated, the next generation of models will inevitably train on a web "full of synthetic data." What happens?
This is the real-world version of the model collapse hypothesis, and one of the most-watched open problems right now. The theoretical worry: model-generated text has a narrower distribution than human text, with the tail (rare-but-real expressions, minority views, long-tail facts) systematically weakened. If new models heavily learn from old models' outputs, the distribution keeps narrowing across generations, eventually degrading into a "smooth but impoverished average" — like re-saving a JPEG over and over, detail vanishing pass by pass. But reality isn't this bleak; mitigations are being actively researched: (a) anchoring on real data — always keeping enough raw human data as ballast; (b) provenance labeling and filtering — detecting and controlling the synthetic fraction; (c) quality gating of synthetic data — letting only strictly filtered, high-quality synthetic content in. The deeper question is fascinating for anyone into complexity science: this is a dynamical system with a feedback loop — models shape content, content trains models. Whether it converges to a stable point or keeps degrading depends on whether the external energy input of "new real information added by humans" can outpace the entropy growth of the synthetic loop. At root it's about whether an open system can sustain negentropy.
4. "High-quality data" gets defined by human preference and frozen into a classifier — what value choices does that quietly make at the data layer, and what does it mean for the model's worldview?
This is the most overlooked power question in data engineering. Every filter is an implicit value ranking: treating "looks like Wikipedia/textbook" as high quality defines formal written language, mainstream narratives, English-centrism, academic norms as "good," while branding dialects, spoken language, folk knowledge, non-Western perspectives, and minority expression as "low quality" to be filtered out. The model's eventual worldview is the cumulative projection of these filtering decisions — what it "thinks" the world is like is largely what the curators (usually engineers at a few tech companies) thought the world should be like. Concrete consequences: (a) skewed capability — strong in domains judged high-quality (coding, academic English), weak in filtered-out ones (low-resource languages, casual dialogue); (b) normative homogenization — everyone uses similar "quality classifiers," different models' preferences converge, and diversity erodes at the ecosystem level; (c) hard to audit — the value judgment is encoded in a classifier's weights, invisible and unchallengeable from outside. For anyone pursuing the "AI super-individual," the lesson: don't treat any model's output as neutral truth — it carries the value fingerprint of whoever curated its training data. Staying clear-eyed about "who defined what this model sees as good" is itself a key meta-skill.