AI/ML Explained: Anomalies in Training

Day 44 · 2026-06-30 · Level: Advanced / Theory
For: engineers with coding experience, not specialized in AI
Today is about mechanism and phenomena: why training curves show counterintuitive kinks like "memorize for ages, then suddenly click," "bigger is better," and "abilities that appear out of nowhere." These complement Day 14's scaling laws (smooth and predictable) — they are the non-smooth cracks in that otherwise smooth picture.

GrokkingDelayed Generalization

phase transitionregularization
One-line analogy

Picture refactoring under a "code-size budget." The model first solves every training example with a giant hardcoded lookup table — 100% on the training set, every new input wrong. A background linter (this is weight decay) keeps penalizing "code size." Under sustained pressure the system eventually "discovers" the real algorithm (a compact function) that is both smaller and generalizes. The jump is sudden because the compact algorithm only works once it is fully formed — a half-built algorithm is worse than the lookup table.

What problem it solves + how it works

Common sense says "training past overfitting only makes things worse." Grokking (Power et al. 2022, OpenAI; "grok" is sci-fi slang for "deeply understand") gives a counterexample on small algorithmic datasets (e.g. "addition mod 97"): training accuracy hits 100% within a few hundred steps, validation accuracy stays flat at chance level, then — after orders of magnitude more steps — validation accuracy suddenly jumps from ~0 to ~100%.

training steps (log axis) → accuracy
train 100% reached in hundreds of steps
val   ━━━━━━━━━━━━━━━┓
     chance ~1% (flat for tens of thousands of steps)
                           100%
  ↑ memorization (overfit)    ↑ grokking: the generalizing circuit forms

Mechanism (intuition): the network has two ways to solve the task — (1) memorize: store each training example as a lookup table in the weights, large norm, no generalization; (2) generalize: learn the underlying structure (for modular addition, a set of Fourier-style "trigonometric-identity circuits"), small norm, extrapolates. SGD reaches the easier memorization solution first; weight decay then keeps applying a norm penalty and slowly pushes the weights toward the more compact generalizing solution. Nanda et al. 2023 reverse-engineered exactly this circuit with mechanistic interpretability, showing the generalizing circuit grows gradually — it just gets masked by the "memorize → clean up" process, so macro-level metrics look like a sudden jump. Without weight decay, the model stays at the memorization solution and never groks — regularization is the key variable here.

Code example
import torch, torch.nn as nn
# addition mod 97 (a+b)%97: only such algorithmic tasks reliably reproduce grokking (sketch)
model = nn.Sequential(nn.Embedding(97, 128), nn.Flatten(),
                      nn.Linear(256, 256), nn.ReLU(), nn.Linear(256, 97))
# key: weight_decay != 0 is necessary for grokking; without it the model only memorizes
opt = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1.0)

for step in range(100_000):          # steps must far exceed the moment train acc hit 100%
    loss = loss_fn(model(x_train), y_train)
    opt.zero_grad(); loss.backward(); opt.step()
    # train acc hits 100% in hundreds of steps; val acc stays flat for tens of thousands,
    # then "suddenly" jumps to ~100%
Common misconception + practical scenario
Misconception: "validation accuracy isn't moving = it can't be trained, time to stop." In the grokking regime this stops you right at the memorization stage and misses the later generalization jump. But don't over-read it either: grokking appears under specific conditions (small data, strong regularization, clean algorithmic structure) — it doesn't mean every task should be "trained forever." On large real tasks early stopping is still common.
📌 Cross-disciplinary scenario: grokking is a mathematical model of "insight." No visible progress for a long time (rote memorization), then a sudden click (grasping the underlying structure). Compare it with the Buddhist distinction between "gradual cultivation" and "sudden awakening" — awakening may be exactly that phase transition appearing once continuous effort crosses a "norm threshold": steady work shows up as a discontinuous experience at the critical point.
Takeaway + question
💡 Generalization isn't "remembering more" — it's "finding a shorter description," the algorithmic version of Occam's razor, quietly enforced by weight decay.
🤔 When you self-study a new field, do you also see this lag between the "rote" stage and the "click" stage? What is the weight decay quietly pressing on your own brain?

Double DescentDouble Descent

generalization theoryoverparameterization
One-line analogy

Like a connection pool or a hash table, "just barely enough" is the worst place to be. When model capacity exactly equals the number of training samples (the interpolation threshold), the system is forced into a single solution that must contort itself through every noisy point — like forcing a high-degree polynomial through N points, violently oscillating. With capacity far above the sample count, infinitely many fitting solutions exist, and SGD's implicit bias picks the smoothest, lowest-norm one — which is actually more robust. The worst spot is "not too much, not too little."

What problem it solves + how it works

The classical bias-variance tradeoff predicts test error as a U-shape in model capacity — too small underfits, too large overfits. But in deep learning, models with billions of parameters "overparameterize" to the point of memorizing the whole dataset yet still generalize well, directly slapping down the U-shape.

Belkin et al. 2019 and Nakkiran et al. 2019 (Deep Double Descent) give a unified picture: test error first falls then rises (classical U), peaks at the interpolation threshold (where the model can just achieve 0 training error), and then descends again as capacity grows further — hence double descent.

test error vs model capacity (taller bar = larger error)
small    underfit
medium  classical U-valley
≈#samples interpolation threshold = peak (worst!)
larger   second descent
huge     overparameterized, optimal
↑ the danger zone isn't "too big" — it's "just barely enough"

Mechanism (intuition): at the interpolation threshold, the model's degrees of freedom exactly equal the constraints, the solution is unique and forced to fit the noise too, so variance explodes. Past the threshold, the extra degrees of freedom give SGD room to choose, and its implicit regularization favors low-norm / flat solutions — effectively automatic Occam. Nakkiran also shows double descent appears not only with model size but with the number of training epochs (epoch-wise), and that in some regimes more data actually hurts.

Code example
import numpy as np
from sklearn.linear_model import Ridge
# random-feature regression: sweep #features (=capacity), watch double descent in test error
n_train, errs = 100, []                        # fix 100 training samples
for n_feat in [10, 50, 90, 100, 110, 300, 1000]:   # capacity crosses interpolation threshold(=100)
    W = np.random.randn(d, n_feat)               # random feature map
    Z_tr, Z_te = X_tr @ W, X_te @ W
    m = Ridge(alpha=1e-6).fit(Z_tr, y_tr)      # near-zero reg, forces the interpolating solution
    errs.append(((m.predict(Z_te) - y_te)**2).mean())
# errs peaks at n_feat≈100(=n_train), lower on both sides → double descent
Common misconception + practical scenario
Misconception: "bigger models overfit more easily." In the overparameterized regime it's often the opposite — on fixed data, bigger tends to be better. But that doesn't mean "blindly scale up" always wins: the real danger zone is the medium capacity that "just barely fits," where variance is highest and sensitivity to noise is greatest. What you want to avoid is the critical point, not large models.
📌 Decision-support scenario: double descent offers a counterintuitive mental model — "stuck at the critical point" is often the most fragile. Many systems share this structure: team size exactly equal to the number of tasks, a portfolio concentrated at a critical count, a connection pool exactly equal to peak concurrency... "not too much, not too little" is frequently the worst choice. Either keep ample slack or deliberately constrain.
Takeaway + question
💡 "Just barely enough" is often the worst engineering point: either clearly under-resourced (controlled underfitting) or substantially over-resourced (let implicit regularization work).
🤔 In your systems, which "exactly matched" design points are sitting on their own "interpolation threshold" — most fragile yet unnoticed?

Phase Transitions & EmergenceEmergence

scalecomplexity
One-line analogy

Like percolation in a random graph. Add edges one by one and connectivity grows smoothly — until at some critical edge density, a "giant connected component" spanning the whole graph suddenly appears. No single edge is special, but macroscopic connectivity undergoes a phase transition at the critical point. Some LLM abilities behave exactly this way with scale: add "edges" smoothly (parameters / data / compute), and past some critical scale the ability emerges.

What problem it solves + how it works

Scaling laws (Day 14) say pretraining loss falls smoothly and predictably with scale. But Wei et al. 2022 (Emergent Abilities) found that some downstream abilities (multi-digit arithmetic, chain-of-thought reasoning, instruction following) stay at chance level on small models and appear suddenly only past a certain scale threshold — they can't be extrapolated from smaller models. Beneath the smooth loss curve hides a non-smooth ability curve.

Mechanism (intuition): by analogy with physical phase transitions — water temperature rises smoothly, but at 0°C / 100°C there's a discontinuous change of state. The "physical basis" of emergence is still debated, but one explanation is: a complex ability needs several sub-circuits in place at once, and missing any one fails the whole; scale smoothly raises each sub-circuit's success probability p, but their conjunction (AND) is highly nonlinear, so the joint success rate spikes at some point. The code below computes exactly this intuition — "smooth p amplified by AND into a cliff" — and it leads straight into the next card's debate.

Code example
import numpy as np
scale = np.linspace(1, 10, 100)             # model scale (a proxy on a log axis)
p = 1 / (1 + np.exp(-(scale - 5)))         # per-subskill success rate: rises "smoothly" with scale

for k in [1, 5, 20]:                       # task needs k subskills to all hold (AND)
    task_acc = p ** k                       # conjunction → highly nonlinear
    # larger k makes task_acc look more like a "cliff" at some scale
    # — smooth p manufactures the appearance of emergence
    print(k, np.round(task_acc[[30, 50, 70]], 3))
Common misconception + practical scenario
Misconception: "emergence = the model suddenly acquired a qualitatively new cognition / proof of AGI awakening." This is over-mystification. Emergence is first of all an empirical observation; whether it's "real, and why it happens" is an open question (see the next card). Reading a kink in a curve as the emergence of consciousness mistakes a measurement phenomenon for a metaphysical event.
📌 Complexity-science scenario: emergence, criticality, and self-organized criticality all meet here — right in the bullseye of cross-disciplinary interest. But precisely because it's so seductive, stay vigilant — the next card explains that the "emergence" you observe may be partly a measurement artifact, not entirely a phase transition in the world itself.
Takeaway + question
💡 Emergence may be a true phase transition, or a measurement prism — both look identical on the curve, so you must distinguish them with mechanism and multiple metrics.
🤔 How do you tell apart "the system really underwent a phase transition" from "your chosen metric merely manufactured a cliff"?

Is Emergence a Mirage?The Mirage Debate

metricsdebate
One-line analogy

Like a threshold alert masking a smooth signal. Underlying latency creeps smoothly from 90ms to 110ms, but your "p99 < 100ms" boolean SLA flips from green to red suddenly — it's the metric's quantization that manufactures the "cliff"; the underlying signal is actually continuous. Schaeffer's claim is exactly this: ability improves smoothly, but quantize it with an "all-or-nothing" discrete metric and a fake cliff appears.

What problem it solves + how it works

If emergence really is unpredictable, it's a nightmare for safety and planning — you can't know what ability the next scale will produce. Schaeffer et al. 2023 (Are Emergent Abilities a Mirage?, NeurIPS Outstanding Paper) argues that many "emergences" are caused by the researcher's chosen metric, not by a fundamental jump in model behavior.

same underlying model, two rulers tell two stories

continuous metric (token accuracy / log-likelihood)
   smooth ramp

discrete metric (exact match: all-correct only = p5)
   fake cliff (looks emergent)

Mechanism (intuition): take 5-digit addition, where the per-token correctness probability p rises smoothly with scale. Use exact match (score only if all 5 digits are correct) and the score ≈ p5; as p moves smoothly from 0.7 to 0.9, p5 jumps from 0.17 to 0.59, and with more digits the discrete metric shows a steep jump. Switch to a continuous metric (token edit distance, log-likelihood) and the same model's curve becomes smooth. But the debate is unsettled: Wei and others respond that some abilities still show nonlinear kinks even under continuous metrics; and "can it be predicted" vs "is the metric smooth" are two separate questions. The middle-ground consensus: some emergence is a measurement artifact, some may be a real ability phase transition — you have to check case by case with multiple metrics.

Code example
import numpy as np
scale = np.linspace(1, 10, 100)
p = 1 / (1 + np.exp(-(scale - 5)))   # per-token correctness: rises smoothly

exact_match = p ** 5                  # discrete metric: all 5 tokens correct → fake cliff
token_acc   = p                       # continuous metric: per-token → smooth
# same underlying model: exact_match looks "emergent", token_acc looks like a smooth ramp
print("discrete", np.round(exact_match[::25], 3))
print("continuous", np.round(token_acc[::25], 3))
# the curve's shape is half the model, half the ruler you picked
Common misconception + practical scenario
Misconception: reading this debate as "emergence has been debunked" or "emergence is real" — both oversimplify. The correct reading: the apparent suddenness of emergence is highly sensitive to the choice of metric; this neither denies that ability is improving nor proves a mysterious phase transition. Its real lesson is about measurement, not about whether the model has a "soul."
📌 AI-super-individual / decision scenario: when evaluating models, teams, or even yourself, beware of fake cliffs and fake plateaus manufactured by discrete metrics. A person's ability may be accumulating smoothly, but a pass/fail KPI makes it look like "stuck for ages, then a sudden breakout." To see the true slope, switch to a continuous ruler.
Takeaway + question
💡 A "cliff" is often carved by your ruler, not the shape of the world — switch rulers and the cliff may become a gentle slope.
🤔 Which pass/fail metrics in your daily life are disguising smooth progress as stagnation or a sudden jump?

Further ReadingFurther Reading

Deep QuestionsDeep Questions

1. Grokking and double descent are both sudden changes in the "late-training / overparameterized" regime. Are they two facets of the same phenomenon?
They share a deep mechanism: among many feasible solutions, the training dynamics eventually favor the "simpler / smoother" one. Double descent looks along the capacity axis — past the interpolation threshold, the extra degrees of freedom let SGD's implicit regularization pick a low-norm solution, so error descends a second time. Grokking looks along the time axis — in a fixed overparameterized model, weight decay slowly moves the weights from the "memorization solution" to the "generalization solution" over many steps. They're two sides of one coin: double descent says "when capacity is large enough, good solutions exist"; grokking says "when a good solution exists, it takes time + regularization to push you across to it." Viewed together, the core in both is implicit bias performing Occam selection in overparameterized space — one viewed across capacity, the other across time. That's why recent work (including epoch-wise double descent) increasingly braids these two threads.
2. If emergence is partly a measurement artifact, how do scaling laws' "predictability" and emergence's "unpredictability" coexist?
The key is to separate two layers: "pretraining loss" and "downstream task metric." Scaling laws describe pretraining cross-entropy loss — it does fall smoothly and extrapolably with scale, because loss is a continuous, per-token averaged quantity. Downstream "unpredictable emergence" usually appears on discrete, thresholded task metrics (exact match, accuracy). Schaeffer's argument: the underlying layer (loss / per-token probability) is smooth and predictable all along; it's the discrete ruler upstairs that turns smoothness into a cliff. So the two don't conflict — they measure different layers. What remains open: can all emergence be "reduced" to a smooth underlying layer this way? Apparently not fully — some abilities still show nonlinear kinks under continuous metrics, possibly involving a real representational phase transition. The honest state of affairs: loss-predictable ≠ ability-predictable, and that gap is the technical root of the "capability surprise" worry in AI safety.
3. SGD's "implicit regularization" is the lead actor in both grokking and double descent — what exactly does it prefer? Why does nobody write the term explicitly, yet it works everywhere?
This is modern deep learning theory's most fascinating and most unfinished question. Empirically, SGD (especially with weight decay) in overparameterized networks favors low-norm, flat minima, low-complexity solutions — even with infinitely many solutions driving training error to 0, it systematically lands on the "simpler" ones. Why? A few clues: (a) gradient noise acts like temperature, biasing optimization toward wide basins rather than sharp spikes (sharp solutions are perturbation-sensitive and generalize poorly — the sharpness-generalization link from Day 43); (b) for linear / matrix-factorization models one can prove gradient descent converges to the minimum-norm solution; (c) weight decay explicitly amplifies this preference. But "nobody writes the term" is precisely the point — it isn't an explicit regularizer in the loss, but an implicit constraint induced jointly by the optimizer + architecture + initialization. Fully characterizing it (on general nonlinear networks) is still open. For BigCat's distributed intuition: it's a bit like "eventual consistency" — not hardcoded in one line, but an emergent property of protocol dynamics — the behavior lives in the process, not in a declaration.
4. From complexity science, are the "phase transitions" in neural network training truly isomorphic to physical phase transitions (Ising model, percolation), or just a seductive analogy?
Somewhere in between, and the boundary is sharpening fast. At the analogy level: the "memorization phase → generalization phase" in the loss landscape, and the interpolation threshold on the capacity axis, do formally resemble physics' critical points, order parameters, and symmetry breaking — all "macroscopic behavior changes abruptly once a control parameter crosses a critical value." Evidence for true isomorphism: some simplified models (random features, deep linear nets, certain SSMs) admit approximate analytic phase diagrams with computable critical exponents that connect to the renormalization group and random matrix theory in statistical physics; percolation in particular has a concrete sub-circuit-connectivity model supporting "ability emergence." But beware: real deep nonlinear networks are extremely high-dimensional, the loss is non-convex, and the dynamics are far from equilibrium — many strict premises of physical phase transitions (thermal equilibrium, a definable free energy) don't hold. The pragmatic judgment: physics is the best scaffolding language we have — giving us falsifiable concepts like order parameter, criticality, universality class — but "isomorphism" must be verified case by case, not assumed because the pictures look alike. This is exactly the frontier where BigCat's interests across physics / complexity / distributed systems can shine.
5. What do these anomalies mean for "do we actually understand deep learning"?
They jointly point to a humbling fact: we can build deep learning systems long before we can explain them. Grokking shows "a flat training curve" doesn't mean "nothing is being learned"; double descent shows decades of bias-variance intuition fail in the overparameterized regime; the emergence-vs-metric debate shows we're still arguing over how to measure whether an ability even exists and when it appears. These aren't footnotes but three cracks, reminding us that current generalization theory is more post-hoc description than ahead-of-time prediction. Flip it around, though, and they're the best research entry points: each anomaly is a window, and mechanistic interpretability (e.g. Nanda reverse-engineering the grokking circuit) drills in through exactly these windows. For someone pursuing the "AI super-individual," real literacy isn't memorizing the names of these phenomena, but cultivating an instinct — when you see a beautiful curve, first ask "is this the shape of the world, or the shape of my ruler?" That skepticism outlasts any single conclusion.