AI/ML Explained: Scaling Laws

Day 14 · 2026-05-31 · Difficulty ★★★★☆

For: engineers with coding experience, non-AI background

Intro: Why is "how much intelligence your money buys" predictable?Why Scaling Laws Matter

Training a frontier model burns tens of millions of dollars and tens of thousands of GPUs for months. That implies a brutal reality: you get almost no chances to iterate—you can't train a GPT-4, see it underperform, and retune. So how do you decide "how big a model, how much data, how much compute"? The answer is Scaling Laws: fit a curve at small scale, extrapolate to large scale, and predict a big model's behavior in advance.

This has been the foundational methodology of AI industrialization for the past six years—turning "alchemy" into "engineering budget planning." Today, four core ideas: power-law scaling (how loss falls with scale), Chinchilla compute-optimal (how to split a fixed budget between params and data), emergent abilities (why some abilities "suddenly appear"), and whether emergence is a mirage (an unsettled debate). One throughline: from a predictable smooth curve, to an unpredictable jump, to questioning whether the jump is real or an artifact of the ruler.

Neural Scaling LawsPower-Law Scaling

Kaplan 2020Power lawExtrapolable

One-line analogy

Like capacity-planning load tests: you measure throughput on 1/10/100 machines, find the points form a clean log-log line, and extrapolate "what 1,000 machines would hit"—without actually building 1,000. A scaling law plots "model scale → loss" as the same extrapolable line, letting you predict big models from small ones.

What it solves + how it works

Pain point: big models are too expensive, so you must predict in advance. Scaling Laws for Neural Language Models (Kaplan et al. 2020, arXiv 2001.08361) systematically measured a range of models and found a striking regularity: a language model's cross-entropy loss (how accurately it predicts the next token) falls as a power law in three quantities—parameter count N, dataset size D, and training compute C. For parameters:

L(N) ≈ (N_c / N)^α_N

Each symbol: L is loss (lower is better), N is parameters, N_c and α_N are fitted constants (α ≈ 0.076, small). What a power law means: take logs of both sides and log L vs log N is a straight line—which is why a log-log plot lets you extrapolate from small to large. Kaplan found this line stays straight across 7 orders of magnitude, suspiciously clean for empirical science.

The most counterintuitive insight: larger models are more sample-efficient. So under fixed compute, the optimal move is to train a very large model and stop early, before convergence—spend compute on "bigger," not on "more passes over data." This conclusion was half-corrected by Chinchilla (next card).

On a log-log plot, loss is a straight line (illustrative)

loss
hi
    ·
       ·
          ·
lo            · → approaches an irreducible floor
  small     model scale N (log)    large

Straight → extrapolable; but never reaches 0 (irreducible loss)

Code example

import numpy as np
# Measured (params N, loss) for a few small models; fit power law, extrapolate
N    = np.array([1e6, 1e7, 1e8, 1e9])      # parameter counts
loss = np.array([4.2, 3.6, 3.1, 2.7])      # measured loss
# A power law is a line in log-log: log L = a*log N + b → linear fit
a, b = np.polyfit(np.log(N), np.log(loss), 1)
def predict(n):
    return np.exp(b) * n ** a       # L(N) = e^b * N^a
print(predict(1e11))   # extrapolated loss at 100B params (never trained)
# a is negative (loss drops as N grows); real-world adds an irreducible term

Common misconception + use case

"A power law means infinite params drive loss to 0"—wrong. The real formula has an irreducible constant term, corresponding to language's inherent randomness (the next word genuinely has many valid options). The curve approaches that floor, not 0. Returns to scale are diminishing—each halving of loss costs exponentially more.

📌 Judgment case: scaling laws give BigCat the power to read the marginal returns of "compute vs data vs model size." When a team claims "our model is stronger because it has more params," ask: on that power-law curve, is the loss drop they bought with extra compute worth it? Much of "bigger" is just sliding down the diminishing-returns tail.

Takeaway + question

💡 Scaling laws = loss falls as a power law in scale, a straight extrapolable line in log-log—turning "alchemy" into "budget planning."
🤔 Why does the industry treat a clean 7-orders-of-magnitude line as iron law even without a "theory"? Which "know-that-not-why" empirical laws have you met in systems performance (e.g. queueing theory, Amdahl's law)?

Compute-Optimal / ChinchillaCompute-Optimal Scaling

Hoffmann 2022Params vs data~20 tokens/param

One-line analogy

Like splitting a fixed budget between CPU and RAM: you have fixed server spend and must divide it. Kaplan's result was misread as "spend it all on CPU (a bigger model)"; Chinchilla, redoing the experiments, found CPU and RAM (model size and data) should each grow about equally—lopsided spending is waste.

What it solves + how it works

Pain point: given a fixed compute budget C (C ≈ 6·N·D, a function of params × data), how do you split between "bigger model" and "more data" to minimize loss? Training Compute-Optimal Large Language Models (Hoffmann et al. 2022, arXiv 2203.15556, the Chinchilla paper) trained over 400 models to scan this, concluding:

At the optimum: N ∝ C^0.5, D ∝ C^0.5 → double the params, double the data

Meaning: params N and data D should be scaled equally, with a practical rule of thumb of about 20 training tokens per parameter. This directly contradicted the prevailing practice—GPT-3 (175B params trained on only ~300B tokens) was severely "undertrained." Using the same compute, Chinchilla built a smaller (70B) model fed 4× the data (~1.4T tokens) and beat the larger Gopher on benchmarks like MMLU.

Why did Kaplan and Chinchilla differ? Mainly because settings like learning-rate schedules in Kaplan's experiments under-weighted the "data dimension." After the correction, the field shifted from "blindly stacking params" to "params and data in balance"—which is why later models' param counts stopped exploding while training-data volume surged.

Same compute budget, two allocations

Kaplan route: params ████ huge data █ scarce → undertrained
Chinchilla: params ██ medium data ███ ample → better at same compute

Rule of thumb: tokens ≈ 20 × params (compute-optimal point)

Code example

# Given a compute budget, estimate compute-optimal params and data
# Approx: training compute C ≈ 6 * N * D (FLOPs), and optimal D ≈ 20 * N
def compute_optimal(C):
    # sub D=20N → C ≈ 6 * N * 20N = 120 N^2
    N = (C / 120) ** 0.5          # optimal param count
    D = 20 * N                    # optimal token count
    return N, D

N, D = compute_optimal(C=1e23)   # give a compute budget (FLOPs)
print(f"params ~{N:.1e}, tokens ~{D:.1e}")
# want 70B params → about 1.4T tokens, exactly the Chinchilla recipe

Common misconception + use case

"20 tokens/param is a law you must obey"—no. It's training-compute-optimal and ignores inference cost entirely. If a model will be deployed and called billions of times, inference cost dwarfs training—then you should deliberately "over-train" a smaller model (feed far more than 20× tokens). Small models are cheap to serve, and the extra training is amortized once. The Llama family took exactly this route: small models trained on data far beyond Chinchilla-optimal.

📌 Cross-disciplinary case: Chinchilla is a constrained-optimization problem (split a fixed budget between two variables), isomorphic to capacity planning and portfolio allocation. The transfer lesson for BigCat: "optimal ratio" always demands first asking "which objective function are we optimizing"—training-optimal ≠ deployment-optimal ≠ total-cost-optimal. Wrong objective, wrong ratio.

Takeaway + question

💡 Chinchilla = under fixed compute, scale params and data equally (≈20 tokens/param); it corrected "blindly stacking params," but is optimal only for "training cost."
🤔 The same "training vs inference" cost tradeoff decides "how big to train." If your app is low-frequency high-value (a few complex decisions/day) vs high-frequency low-value (thousands/sec), how does this ratio flip?

Emergent AbilitiesPhase Transitions in Capability

Wei 2022Phase transitionNon-extrapolable

One-line analogy

Like water boiling at 100°C: temperature rises smoothly, but at a critical point the state jumps. Some abilities (multi-step arithmetic, following examples to reason) are zero/random in small models, then shoot up once scale crosses a threshold—"invisible in small models, appears only when scaled" is emergence. Backend analogy: collective emergent behaviors invisible on a single node (cache stampede, herd effects) that only surface at a certain node count.

What it solves + how it works

Tension: the last card said loss falls smoothly and is extrapolable. But Emergent Abilities of Large Language Models (Wei et al. 2022, arXiv 2206.07682) noted—downstream task performance isn't necessarily smooth. The paper cataloged a set of abilities: below a certain param scale, the model is near random chance on the task (multi-step arithmetic, exam-style QA, few-shot learning from examples); past the critical scale, accuracy jumps sharply.

Why does this matter? Because it means you can't simply extrapolate downstream ability from small models—you measure random performance on a task at 1B and 10B and might conclude "this won't work," yet at 100B it might suddenly work. This gives "keep scaling" a strong motive: maybe the next ability hides in the next order of magnitude. The effectiveness of CoT (chain-of-thought prompting) was also observed to be scale-dependent—useless or worse in small models, only large models benefit.

Smooth decline vs emergent jump

loss (pretraining): smooth, extrapolable

some downstream accuracy:
chance ····················▕ jumps past threshold
1B 10B ↑threshold 100B

Extrapolating from small models → misjudge "this task is impossible"

Code example

import numpy as np
# Use exact-match (all-or-nothing) to score a task
sizes = np.array([1e8, 1e9, 1e10, 5e10, 1e11])
# Needs 5 correct steps in a row; per-step accuracy rises smoothly with scale
per_step = np.array([0.30, 0.45, 0.60, 0.80, 0.92])
# Task score = prob of all 5 steps right = per_step^5 → only nonzero after scaling
task_acc = per_step ** 5
for s, a in zip(sizes, task_acc):
    print(f"{s:.0e}: {a:.2%}")   # 0.2% → 0.6% → 7.8% → 33% → 66%
# Looks like "sudden emergence"—but that's a setup for the next card

Common misconception + use case

"Emergence = the model suddenly gained intelligence / consciousness"—don't mystify it. Emergence describes a nonlinear jump in one specific task metric, a statistical phenomenon of capability, not a "flash of awakening." Romanticizing it as "AI waking up" misuses the technical term.

📌 Cross-disciplinary case: emergence is a core motif of complex-systems science (among BigCat's interests)—ant-colony collective intelligence, neurons-to-consciousness, individual-to-macro markets. LLM emergence offers a quantifiable sample. But beware: "looks emergent" and "truly a phase transition" are different things—the next card exists to unmask exactly this.

Takeaway + question

💡 Emergent abilities = some downstream abilities are near zero in small models and appear suddenly past a critical scale, non-extrapolable—the core motive for "keep scaling."
🤔 If an ability is "non-extrapolable, only known by scaling up," what does that say about the scientific method of "predicting big systems from small experiments"? Which fields share this "invisible below scale" phenomenon?

Is emergence a mirage?The Mirage Critique

Schaeffer 2023Metric artifactOpen debate

One-line analogy

Like using the wrong progress bar: track a project by "done only when all tests pass" and you see a fake jump "0% → 100% one day"; switch to a continuous metric "tests passed / total" and progress was climbing smoothly all along. The jump is made by the ruler, not by the thing itself.

What it solves + how it works

The reversal: Are Emergent Abilities of Large Language Models a Mirage? (Schaeffer et al. 2023, arXiv 2304.15004) sharply questioned the previous card's "emergence"—many reported emergences are caused by the researcher's choice of metric, not by a real jump in model behavior.

Mechanism: the key is whether the metric is nonlinear/discontinuous or linear/continuous. With exact-match (full credit only if all steps right), smooth improvement in per-step ability gets amplified by a power into "looks like a sudden jump" (exactly the per_step ** 5 effect in the last card's code). But switch to a continuous metric—per-token loss, or partial credit—and on the same outputs from the same models, the curve becomes smooth, predictable, and emergence disappears. The paper's core claim: the metric choice, not model capability, manufactures the appearance of "emergence."

But be honest: this does not mean "emergence doesn't exist at all." The Mirage paper showed many reported emergences are metric artifacts, but it didn't—and can't—prove every jump is fake. This is a still-active open debate—its real contribution is forcing the whole field to re-examine its rulers.

Same model outputs, two rulers, two conclusions

nonlinear metric (all-or-nothing): ·········· "emergence!"

continuous metric (per-token / partial credit): smooth, predictable

Data unchanged; only the measurement changed → conclusion flips

Code example

import numpy as np
sizes    = np.array([1e8, 1e9, 1e10, 5e10, 1e11])
per_step = np.array([0.30, 0.45, 0.60, 0.80, 0.92])   # per-step accuracy: smooth
# Ruler A: all 5 steps right (nonlinear) → manufactures a "jump"
discontinuous = per_step ** 5
# Ruler B: continuous metric, just per-step accuracy (already smooth)
continuous = per_step
print("nonlinear:", np.round(discontinuous, 3))  # [.002 .018 .078 .328 .659] looks emergent
print("continuous:", np.round(continuous, 3))    # [.30 .45 .60 .80 .92] smooth, extrapolable
# Same underlying ability, swap the ruler → "emergence" appears or vanishes

Common misconception + use case

"The Mirage paper proves emergence is a hoax / LLMs have no real ability"—over-reading. It proves the "shape of emergence" is sensitive to the metric, a reminder not to be scared or fooled by a steep curve; but models really are stronger at scale—that's a fact. Reading it as "emergence is all fake" and reading emergence as "AI awakening" are the same laziness—the truth is "it depends how you measure."

📌 Judgment case: this is the most valuable lesson for BigCat—a metric is a stance. Faced with any "AI capability surging / plateauing" benchmark chart, first ask: is this a continuous metric or all-or-nothing? Would swapping rulers flip the conclusion? The same vigilance applies to your own agent evals, A/B tests, and OKRs—pick the wrong ruler and you'll "see" an inflection point that doesn't exist.

Takeaway + question

💡 The Mirage critique = many "emergences" are appearances manufactured by nonlinear metrics; switch to a continuous metric and they smooth out—not "emergence doesn't exist," but "it depends what ruler you measure with."
🤔 "Measurement decides what you see"—is this the same as the "observation changes the object" claim in quantum or psychological measurement, or fundamentally different? To what extent can a pure metric choice count as an "objective finding"?

Deep QuestionsDeep Questions

1. Threading the four concepts together, what cognitive evolution about "scale" do they tell?

Throughline: from "predictable smoothness," to "unpredictable jump," to "is the jump real or made by the ruler." (1) Kaplan: loss falls as a smooth power law, extrapolable, turning alchemy into budget planning. (2) Chinchilla: refines within the predictable frame—under fixed budget, scale params and data equally. Both say "scale is controllable, computable." (3) Emergence: cold water—downstream ability isn't necessarily smooth, some appear only past a critical scale, non-extrapolable. (4) Mirage: reversal again—many "jumps" are artifacts of nonlinear metrics; switch to continuous and they smooth out.

At heart it's a tug-of-war over "can we predict big models": yes → no → maybe yes, the ruler was just wrong. The meta-lesson for BigCat: when you meet a "discontinuity narrative," be wary—rule out measurement artifact before concluding "qualitative change." That's the watershed between mature engineering judgment and hype.

2. Chinchilla says "20 tokens/param is optimal," but Llama deliberately violated it with far more than 20×. Contradiction? What deeper principle is behind it?

No contradiction—they optimize different objective functions. Chinchilla optimizes "lowest loss under fixed training compute," ignoring the model's fate after training. Llama optimizes "price-performance over training + massive inference": if the model is called billions of times, inference cost dwarfs training, so deliberately over-training a smaller model pays off—small models are cheap per call, and the extra training amortizes once.

The deeper principle: "optimal" is always relative to an objective function; optimality without an objective is meaningless. A database's "optimal index" depends on read/write ratio; a cache's "optimal size" on the hit curve and RAM price. The Chinchilla → Llama shift is essentially the industry switching its objective from "training-optimal" to "deployment economics." For BigCat: whenever someone hands you an "optimal configuration," the first reaction should be "which objective is it optimizing? Is that my objective?"—that one question blocks most wrong transfers.

3. The emergence vs Mirage debate is still unsettled. As an engineer, how do you act while there's no conclusion?

The key is separating "what's consensus vs what's still contested," acting on consensus while staying open on the dispute. Consensus: (a) big models really are much stronger on many tasks, whatever the curve shape is called; (b) the choice of eval metric dramatically changes the curve shape you see—Mirage nailed this, no dispute. Still contested: whether there exist "true phase transitions that survive any reasonable continuous metric."

Action rules: (1) for decisions, prefer continuous, extrapolable metrics (loss, partial credit); treat exact-match as "for display," not "for planning." (2) don't declare "qualitative change" or "we hit a wall" from one steep curve—re-verify with a different ruler. (3) admit extrapolation has limits, and reserve budget to "verify after scaling." This is a mature stance: where science is undecided, don't fake an answer; design strategies robust to both possibilities. Same for BigCat's AI super-individual decisions—rather than bet on a narrative, build workflows that "don't lose whether emergence is real or not."

4. Scaling laws are purely empirical fitted curves with no first principles. Is it safe to build industrial decisions on them?

Scaling laws are like early-thermodynamics empirical laws—the ideal gas law was used to build factories for centuries before it had a theoretical explanation, because within its range it was extremely reliable. Same here: a clean line across 7 orders of magnitude, predictive enough to guide tens of millions in investment. But the danger of empirical laws is extrapolating past their range: you don't know where the line bends.

Concrete risks: (1) data wall—the power law assumes infinite data supply, but high-quality text is finite, so the curve may deviate after saturation. (2) architecture shift—the constants are fit to a specific architecture; change it and the curve changes. (3) downstream ability ≠ loss—smooth pretraining loss doesn't mean the ability you care about is smooth.

Safe usage: treat it as "a reliable engineering tool within its range," not "cosmic truth." Inside the range, use it boldly for budget planning; near the edges (data exhaustion, architecture turnover) stay wary and reserve verification budget. Same as using Amdahl's law or queueing models for capacity planning—useful, but know when the assumptions break.

5. If scaling laws keep holding, will "more compute = stronger model" always be true? What might the end of this road be?

The power law already contains its own endgame: diminishing returns. Each halving of loss costs exponentially more, and loss has an irreducible floor. So "infinite compute for infinite intelligence" fails mathematically—you're using exponential cost to approach a finite limit.

Several real walls: (1) data wall—you must scale data equally, but high-quality human text is running short, forcing synthetic data and the like. (2) compute/energy wall—frontier training is approaching physical and economic limits. (3) pretraining loss ≠ real use—pushing loss lower doesn't proportionally convert into reasoning, alignment, reliability.

The industry's bets have already shifted: from simply scaling pretraining to test-time compute (let the model think more steps, the o1 route), data quality, and new architectures. A reasonable read: pure-pretraining dividends enter the diminishing-returns tail, but "scaling thinking" migrates to a new axis—compute is still core, just spent on "thinking at inference" instead of "params at training." For BigCat: watch not "how big can models get," but which direction, and how fast, the "cost-per-unit-intelligence" curve moves down—every notch it loosens redraws the boundary of automatable work.

Intro: Why is "how much intelligence your money buys" predictable?Why Scaling Laws Matter

Neural Scaling LawsPower-Law Scaling

Compute-Optimal / ChinchillaCompute-Optimal Scaling

Emergent AbilitiesPhase Transitions in Capability

Is emergence a mirage?The Mirage Critique

Further ReadingFurther Reading

Deep QuestionsDeep Questions