Training a frontier model burns tens of millions of dollars and tens of thousands of GPUs for months. That implies a brutal reality: you get almost no chances to iterate—you can't train a GPT-4, see it underperform, and retune. So how do you decide "how big a model, how much data, how much compute"? The answer is Scaling Laws: fit a curve at small scale, extrapolate to large scale, and predict a big model's behavior in advance.
This has been the foundational methodology of AI industrialization for the past six years—turning "alchemy" into "engineering budget planning." Today, four core ideas: power-law scaling (how loss falls with scale), Chinchilla compute-optimal (how to split a fixed budget between params and data), emergent abilities (why some abilities "suddenly appear"), and whether emergence is a mirage (an unsettled debate). One throughline: from a predictable smooth curve, to an unpredictable jump, to questioning whether the jump is real or an artifact of the ruler.
Like capacity-planning load tests: you measure throughput on 1/10/100 machines, find the points form a clean log-log line, and extrapolate "what 1,000 machines would hit"—without actually building 1,000. A scaling law plots "model scale → loss" as the same extrapolable line, letting you predict big models from small ones.
Pain point: big models are too expensive, so you must predict in advance. Scaling Laws for Neural Language Models (Kaplan et al. 2020, arXiv 2001.08361) systematically measured a range of models and found a striking regularity: a language model's cross-entropy loss (how accurately it predicts the next token) falls as a power law in three quantities—parameter count N, dataset size D, and training compute C. For parameters:
L(N) ≈ (Nc / N)αN
Each symbol: L is loss (lower is better), N is parameters, Nc and αN are fitted constants (α ≈ 0.076, small). What a power law means: take logs of both sides and log L vs log N is a straight line—which is why a log-log plot lets you extrapolate from small to large. Kaplan found this line stays straight across 7 orders of magnitude, suspiciously clean for empirical science.
The most counterintuitive insight: larger models are more sample-efficient. So under fixed compute, the optimal move is to train a very large model and stop early, before convergence—spend compute on "bigger," not on "more passes over data." This conclusion was half-corrected by Chinchilla (next card).
import numpy as np # Measured (params N, loss) for a few small models; fit power law, extrapolate N = np.array([1e6, 1e7, 1e8, 1e9]) # parameter counts loss = np.array([4.2, 3.6, 3.1, 2.7]) # measured loss # A power law is a line in log-log: log L = a*log N + b → linear fit a, b = np.polyfit(np.log(N), np.log(loss), 1) def predict(n): return np.exp(b) * n ** a # L(N) = e^b * N^a print(predict(1e11)) # extrapolated loss at 100B params (never trained) # a is negative (loss drops as N grows); real-world adds an irreducible term
Like splitting a fixed budget between CPU and RAM: you have fixed server spend and must divide it. Kaplan's result was misread as "spend it all on CPU (a bigger model)"; Chinchilla, redoing the experiments, found CPU and RAM (model size and data) should each grow about equally—lopsided spending is waste.
Pain point: given a fixed compute budget C (C ≈ 6·N·D, a function of params × data), how do you split between "bigger model" and "more data" to minimize loss? Training Compute-Optimal Large Language Models (Hoffmann et al. 2022, arXiv 2203.15556, the Chinchilla paper) trained over 400 models to scan this, concluding:
At the optimum: N ∝ C0.5, D ∝ C0.5 → double the params, double the data
Meaning: params N and data D should be scaled equally, with a practical rule of thumb of about 20 training tokens per parameter. This directly contradicted the prevailing practice—GPT-3 (175B params trained on only ~300B tokens) was severely "undertrained." Using the same compute, Chinchilla built a smaller (70B) model fed 4× the data (~1.4T tokens) and beat the larger Gopher on benchmarks like MMLU.
Why did Kaplan and Chinchilla differ? Mainly because settings like learning-rate schedules in Kaplan's experiments under-weighted the "data dimension." After the correction, the field shifted from "blindly stacking params" to "params and data in balance"—which is why later models' param counts stopped exploding while training-data volume surged.
# Given a compute budget, estimate compute-optimal params and data # Approx: training compute C ≈ 6 * N * D (FLOPs), and optimal D ≈ 20 * N def compute_optimal(C): # sub D=20N → C ≈ 6 * N * 20N = 120 N^2 N = (C / 120) ** 0.5 # optimal param count D = 20 * N # optimal token count return N, D N, D = compute_optimal(C=1e23) # give a compute budget (FLOPs) print(f"params ~{N:.1e}, tokens ~{D:.1e}") # want 70B params → about 1.4T tokens, exactly the Chinchilla recipe
Like water boiling at 100°C: temperature rises smoothly, but at a critical point the state jumps. Some abilities (multi-step arithmetic, following examples to reason) are zero/random in small models, then shoot up once scale crosses a threshold—"invisible in small models, appears only when scaled" is emergence. Backend analogy: collective emergent behaviors invisible on a single node (cache stampede, herd effects) that only surface at a certain node count.
Tension: the last card said loss falls smoothly and is extrapolable. But Emergent Abilities of Large Language Models (Wei et al. 2022, arXiv 2206.07682) noted—downstream task performance isn't necessarily smooth. The paper cataloged a set of abilities: below a certain param scale, the model is near random chance on the task (multi-step arithmetic, exam-style QA, few-shot learning from examples); past the critical scale, accuracy jumps sharply.
Why does this matter? Because it means you can't simply extrapolate downstream ability from small models—you measure random performance on a task at 1B and 10B and might conclude "this won't work," yet at 100B it might suddenly work. This gives "keep scaling" a strong motive: maybe the next ability hides in the next order of magnitude. The effectiveness of CoT (chain-of-thought prompting) was also observed to be scale-dependent—useless or worse in small models, only large models benefit.
import numpy as np # Use exact-match (all-or-nothing) to score a task sizes = np.array([1e8, 1e9, 1e10, 5e10, 1e11]) # Needs 5 correct steps in a row; per-step accuracy rises smoothly with scale per_step = np.array([0.30, 0.45, 0.60, 0.80, 0.92]) # Task score = prob of all 5 steps right = per_step^5 → only nonzero after scaling task_acc = per_step ** 5 for s, a in zip(sizes, task_acc): print(f"{s:.0e}: {a:.2%}") # 0.2% → 0.6% → 7.8% → 33% → 66% # Looks like "sudden emergence"—but that's a setup for the next card
Like using the wrong progress bar: track a project by "done only when all tests pass" and you see a fake jump "0% → 100% one day"; switch to a continuous metric "tests passed / total" and progress was climbing smoothly all along. The jump is made by the ruler, not by the thing itself.
The reversal: Are Emergent Abilities of Large Language Models a Mirage? (Schaeffer et al. 2023, arXiv 2304.15004) sharply questioned the previous card's "emergence"—many reported emergences are caused by the researcher's choice of metric, not by a real jump in model behavior.
Mechanism: the key is whether the metric is nonlinear/discontinuous or linear/continuous. With exact-match (full credit only if all steps right), smooth improvement in per-step ability gets amplified by a power into "looks like a sudden jump" (exactly the per_step ** 5 effect in the last card's code). But switch to a continuous metric—per-token loss, or partial credit—and on the same outputs from the same models, the curve becomes smooth, predictable, and emergence disappears. The paper's core claim: the metric choice, not model capability, manufactures the appearance of "emergence."
But be honest: this does not mean "emergence doesn't exist at all." The Mirage paper showed many reported emergences are metric artifacts, but it didn't—and can't—prove every jump is fake. This is a still-active open debate—its real contribution is forcing the whole field to re-examine its rulers.
import numpy as np sizes = np.array([1e8, 1e9, 1e10, 5e10, 1e11]) per_step = np.array([0.30, 0.45, 0.60, 0.80, 0.92]) # per-step accuracy: smooth # Ruler A: all 5 steps right (nonlinear) → manufactures a "jump" discontinuous = per_step ** 5 # Ruler B: continuous metric, just per-step accuracy (already smooth) continuous = per_step print("nonlinear:", np.round(discontinuous, 3)) # [.002 .018 .078 .328 .659] looks emergent print("continuous:", np.round(continuous, 3)) # [.30 .45 .60 .80 .92] smooth, extrapolable # Same underlying ability, swap the ruler → "emergence" appears or vanishes