Picture refactoring under a "code-size budget." The model first solves every training example with a giant hardcoded lookup table — 100% on the training set, every new input wrong. A background linter (this is weight decay) keeps penalizing "code size." Under sustained pressure the system eventually "discovers" the real algorithm (a compact function) that is both smaller and generalizes. The jump is sudden because the compact algorithm only works once it is fully formed — a half-built algorithm is worse than the lookup table.
Common sense says "training past overfitting only makes things worse." Grokking (Power et al. 2022, OpenAI; "grok" is sci-fi slang for "deeply understand") gives a counterexample on small algorithmic datasets (e.g. "addition mod 97"): training accuracy hits 100% within a few hundred steps, validation accuracy stays flat at chance level, then — after orders of magnitude more steps — validation accuracy suddenly jumps from ~0 to ~100%.
Mechanism (intuition): the network has two ways to solve the task — (1) memorize: store each training example as a lookup table in the weights, large norm, no generalization; (2) generalize: learn the underlying structure (for modular addition, a set of Fourier-style "trigonometric-identity circuits"), small norm, extrapolates. SGD reaches the easier memorization solution first; weight decay then keeps applying a norm penalty and slowly pushes the weights toward the more compact generalizing solution. Nanda et al. 2023 reverse-engineered exactly this circuit with mechanistic interpretability, showing the generalizing circuit grows gradually — it just gets masked by the "memorize → clean up" process, so macro-level metrics look like a sudden jump. Without weight decay, the model stays at the memorization solution and never groks — regularization is the key variable here.
import torch, torch.nn as nn # addition mod 97 (a+b)%97: only such algorithmic tasks reliably reproduce grokking (sketch) model = nn.Sequential(nn.Embedding(97, 128), nn.Flatten(), nn.Linear(256, 256), nn.ReLU(), nn.Linear(256, 97)) # key: weight_decay != 0 is necessary for grokking; without it the model only memorizes opt = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1.0) for step in range(100_000): # steps must far exceed the moment train acc hit 100% loss = loss_fn(model(x_train), y_train) opt.zero_grad(); loss.backward(); opt.step() # train acc hits 100% in hundreds of steps; val acc stays flat for tens of thousands, # then "suddenly" jumps to ~100%
Like a connection pool or a hash table, "just barely enough" is the worst place to be. When model capacity exactly equals the number of training samples (the interpolation threshold), the system is forced into a single solution that must contort itself through every noisy point — like forcing a high-degree polynomial through N points, violently oscillating. With capacity far above the sample count, infinitely many fitting solutions exist, and SGD's implicit bias picks the smoothest, lowest-norm one — which is actually more robust. The worst spot is "not too much, not too little."
The classical bias-variance tradeoff predicts test error as a U-shape in model capacity — too small underfits, too large overfits. But in deep learning, models with billions of parameters "overparameterize" to the point of memorizing the whole dataset yet still generalize well, directly slapping down the U-shape.
Belkin et al. 2019 and Nakkiran et al. 2019 (Deep Double Descent) give a unified picture: test error first falls then rises (classical U), peaks at the interpolation threshold (where the model can just achieve 0 training error), and then descends again as capacity grows further — hence double descent.
Mechanism (intuition): at the interpolation threshold, the model's degrees of freedom exactly equal the constraints, the solution is unique and forced to fit the noise too, so variance explodes. Past the threshold, the extra degrees of freedom give SGD room to choose, and its implicit regularization favors low-norm / flat solutions — effectively automatic Occam. Nakkiran also shows double descent appears not only with model size but with the number of training epochs (epoch-wise), and that in some regimes more data actually hurts.
import numpy as np from sklearn.linear_model import Ridge # random-feature regression: sweep #features (=capacity), watch double descent in test error n_train, errs = 100, [] # fix 100 training samples for n_feat in [10, 50, 90, 100, 110, 300, 1000]: # capacity crosses interpolation threshold(=100) W = np.random.randn(d, n_feat) # random feature map Z_tr, Z_te = X_tr @ W, X_te @ W m = Ridge(alpha=1e-6).fit(Z_tr, y_tr) # near-zero reg, forces the interpolating solution errs.append(((m.predict(Z_te) - y_te)**2).mean()) # errs peaks at n_feat≈100(=n_train), lower on both sides → double descent
Like percolation in a random graph. Add edges one by one and connectivity grows smoothly — until at some critical edge density, a "giant connected component" spanning the whole graph suddenly appears. No single edge is special, but macroscopic connectivity undergoes a phase transition at the critical point. Some LLM abilities behave exactly this way with scale: add "edges" smoothly (parameters / data / compute), and past some critical scale the ability emerges.
Scaling laws (Day 14) say pretraining loss falls smoothly and predictably with scale. But Wei et al. 2022 (Emergent Abilities) found that some downstream abilities (multi-digit arithmetic, chain-of-thought reasoning, instruction following) stay at chance level on small models and appear suddenly only past a certain scale threshold — they can't be extrapolated from smaller models. Beneath the smooth loss curve hides a non-smooth ability curve.
Mechanism (intuition): by analogy with physical phase transitions — water temperature rises smoothly, but at 0°C / 100°C there's a discontinuous change of state. The "physical basis" of emergence is still debated, but one explanation is: a complex ability needs several sub-circuits in place at once, and missing any one fails the whole; scale smoothly raises each sub-circuit's success probability p, but their conjunction (AND) is highly nonlinear, so the joint success rate spikes at some point. The code below computes exactly this intuition — "smooth p amplified by AND into a cliff" — and it leads straight into the next card's debate.
import numpy as np scale = np.linspace(1, 10, 100) # model scale (a proxy on a log axis) p = 1 / (1 + np.exp(-(scale - 5))) # per-subskill success rate: rises "smoothly" with scale for k in [1, 5, 20]: # task needs k subskills to all hold (AND) task_acc = p ** k # conjunction → highly nonlinear # larger k makes task_acc look more like a "cliff" at some scale # — smooth p manufactures the appearance of emergence print(k, np.round(task_acc[[30, 50, 70]], 3))
Like a threshold alert masking a smooth signal. Underlying latency creeps smoothly from 90ms to 110ms, but your "p99 < 100ms" boolean SLA flips from green to red suddenly — it's the metric's quantization that manufactures the "cliff"; the underlying signal is actually continuous. Schaeffer's claim is exactly this: ability improves smoothly, but quantize it with an "all-or-nothing" discrete metric and a fake cliff appears.
If emergence really is unpredictable, it's a nightmare for safety and planning — you can't know what ability the next scale will produce. Schaeffer et al. 2023 (Are Emergent Abilities a Mirage?, NeurIPS Outstanding Paper) argues that many "emergences" are caused by the researcher's chosen metric, not by a fundamental jump in model behavior.
Mechanism (intuition): take 5-digit addition, where the per-token correctness probability p rises smoothly with scale. Use exact match (score only if all 5 digits are correct) and the score ≈ p5; as p moves smoothly from 0.7 to 0.9, p5 jumps from 0.17 to 0.59, and with more digits the discrete metric shows a steep jump. Switch to a continuous metric (token edit distance, log-likelihood) and the same model's curve becomes smooth. But the debate is unsettled: Wei and others respond that some abilities still show nonlinear kinks even under continuous metrics; and "can it be predicted" vs "is the metric smooth" are two separate questions. The middle-ground consensus: some emergence is a measurement artifact, some may be a real ability phase transition — you have to check case by case with multiple metrics.
import numpy as np scale = np.linspace(1, 10, 100) p = 1 / (1 + np.exp(-(scale - 5))) # per-token correctness: rises smoothly exact_match = p ** 5 # discrete metric: all 5 tokens correct → fake cliff token_acc = p # continuous metric: per-token → smooth # same underlying model: exact_match looks "emergent", token_acc looks like a smooth ramp print("discrete", np.round(exact_match[::25], 3)) print("continuous", np.round(token_acc[::25], 3)) # the curve's shape is half the model, half the ruler you picked