Day 10 covered how to walk — AdamW, learning rates, the basic mechanics of gradient descent. Today we go one layer deeper: what does the terrain of training actually look like? Why do some valleys generalize well and others poorly? And when Adam isn't enough, from which directions do frontier optimizers push? This is an issue about the geometry of optimization.
Training a neural net = finding a valley floor in a terrain with hundreds of millions of dimensions. But valleys come in two kinds: a wide, flat basin, or a narrow, deep crack. By analogy to your system configs — a flat minimum is like a high-tolerance config (one parameter drifts a bit, it still works), a sharp minimum is like a config balanced on a tightrope (any parameter off by a hair and the system collapses). How well a model generalizes depends largely on which kind of valley it lands in.
The loss function L(w) is a function of the parameters w (often hundreds of millions of dimensions); training is gradient descent searching for a minimum of L. But here's the puzzle: two models with training loss both near 0 — why does one generalize well on the test set and the other badly?
Key insight (the "flat minima" notion goes back to Hochreiter & Schmidhuber in the 1990s): the test-set loss surface is slightly shifted relative to the training surface (because train/test distributions differ a little). If the minimum you sit in is "flat," you're still in a low-loss region after the shift — good generalization. If it's "sharp," a small shift sends the loss soaring — poor generalization. Flat = robust to perturbations of the data distribution.
Li et al. 2018 used a technique called filter normalization to project the billion-dimensional surface down to 2D and made a beautiful finding: ResNet's skip connections dramatically smooth a terrain that's otherwise sharp, chaotic and ravine-ridden into a smooth basin — a geometric explanation for why deep nets suddenly became trainable once skip connections were added.
import torch, copy # Quantify the "sharpness" of the minimum a trained model sits in: # randomly perturb the weights, see how much loss can rise def sharpness(model, loss_fn, batch, rho=0.05, trials=10): base = loss_fn(model, batch).item() worst = base for _ in range(trials): m2 = copy.deepcopy(model) with torch.no_grad(): for p in m2.parameters(): # perturbation radius scaled to the parameter's own size p.add_(torch.randn_like(p) * rho * p.norm()) worst = max(worst, loss_fn(m2, batch).item()) return worst - base # how much higher the worst neighbor is = sharpness # lower sharpness → flatter minimum → usually better generalization
sharpness probe above — at equal validation performance, the lower-sharpness version is more robust to live data drift and deserves priority for going live.If flat minima are better, can we write "find flatness" directly into the training objective? That's exactly what SAM does. It's like chaos engineering / fault injection — not content with "the current config runs," it actively asks "does it still run under the worst perturbation around me?", deliberately picking the worst point in the neighborhood to optimize, forcing itself into a wide basin.
Plain SGD only minimizes the loss L(w) at the current point. But Keskar et al. 2016 found a famous phenomenon — the large-batch generalization gap: large-batch training tends to fall into sharp minima (because large batches have low gradient noise, removing the randomness that "shakes" the model out of cracks), hurting generalization.
Foret et al. 2020 proposed SAM, changing the objective from "minimize current loss" to min-max:
Symbol by symbol: w is the parameters, ε is a perturbation vector added to the weights, ρ (rho) is the neighborhood radius (a hyperparameter, e.g. 0.05), ‖·‖ is a vector norm. The inner max means "within a ball of radius ρ, how high can the loss go"; the outer min means "minimize that worst-case value." Intuition: stop asking how low the loss is under your feet; ask how low the highest loss in the neighborhood is — only a wide, flat basin can keep the "neighborhood worst case" low too.
In practice each step needs two forward passes: the first computes the gradient and finds the "sharpest direction" ε̂ = ρ·g/‖g‖ (climb one step toward the steepest slope); the second recomputes the gradient at w+ε̂, and that gradient is what actually updates the weights. The cost is roughly 2× training time, traded for better generalization.
# One SAM step: climb to the worst neighbor first, then descend (two passes) def sam_step(model, loss_fn, batch, optimizer, rho=0.05): loss_fn(model, batch).backward() # 1) compute perturbation ê = ρ·g/‖g‖, push weights toward the "sharpest" direction gn = torch.norm(torch.stack( [p.grad.norm() for p in model.parameters()])) eps = {} with torch.no_grad(): for p in model.parameters(): e = p.grad * rho / (gn + 1e-12); eps[p] = e; p.add_(e) optimizer.zero_grad() # 2) recompute gradient at the perturbed point — this is the real update direction loss_fn(model, batch).backward() with torch.no_grad(): for p in model.parameters(): p.sub_(eps[p]) # step back to origin optimizer.step() # update using the "worst neighbor's" gradient → toward flat region
First-order methods (SGD / Adam) walk by looking only at the slope under their feet — in a long, narrow "ravine" they zigzag left and right, taking forever to reach the bottom. Second-order methods look at one more thing: curvature (how the terrain bends), so they can cut diagonally straight toward the floor. Analogy: first-order is like congestion control that only sees the instantaneous gradient; second-order is like also knowing "the curvature characteristics of this link" and cutting straight to it.
The gradient g is the first derivative (slope). The theoretically optimal Newton's method uses the Hessian matrix H (second derivative, describing curvature) as a preconditioner:
It corrects "ill-conditioned" ravines — the kind where directions have wildly different scales and Adam zigzags. But the pain point is fatal: H is n×n, with n = hundreds of millions of parameters → it can neither be stored nor inverted (complexity O(n²)–O(n³)). It's like how you'd never store a fully-connected adjacency matrix over hundreds of millions of nodes.
So there are two routes that "use structural assumptions to cut O(n²) down to something affordable":
A⊗B — factor a big block into two small factors, storable and invertible.Both are essentially "approximate the full curvature via a structured factorization." In 2024–2025 these methods (and variants like SOAP) re-surged in large-model pre-training, because they converge in fewer steps — and at massive scale, steps saved are real wall-clock saved.
# Adam only uses a diagonal "adaptivity" (treats each param as independent); # second-order methods precondition with curvature, sensing coupling between params. from torch_optimizer import Shampoo # pip install torch-optimizer opt = Shampoo( model.parameters(), lr=1e-3, # maintain a small preconditioner per tensor dimension, # not one n×n giant — factor "full curvature" into small factors update_freq=20) # refresh the preconditioner only every 20 steps (amortize cost) for batch in loader: opt.zero_grad() loss_fn(model, batch).backward() opt.step()
After Adam reigned for nearly a decade, two challengers appeared in 2023. Lion is like a minimalist refactor — cut half of Adam's state (keep only momentum), and take only the sign of the direction in the update; Sophia is like lightweight second-order — quietly estimate a little curvature, but at tiny cost.
Adam stores two states per parameter: the first moment m (momentum) and the second moment v (a moving average of squared gradients). Memory ≈ 2× the parameter count. Pain point: at tens of billions of parameters, the optimizer state itself eats enormous memory. The two challengers seek cheaper update rules from different directions:
update = sign(β₁·m + (1−β₁)·g). The sign means every parameter takes the same step magnitude (only direction differs), which is itself an implicit regularizer. Because the sign update gives a larger effective step, the paper notes Lion typically needs a learning rate 3–10× smaller than Adam.Common theme: in the gap where "Adam is good enough, but not lean enough / not fast enough," each found a way out — from memory (Lion) and from curvature (Sophia).
from lion_pytorch import Lion # pip install lion-pytorch # Lion stores only momentum (not Adam's second moment v) → optimizer memory halved opt = Lion(model.parameters(), lr=1e-4, # typically 1/3 to 1/10 of Adam's weight_decay=1e-2) # Core update rule (conceptual): # update = sign(β1·m + (1-β1)·g) ← take only the sign, uniform step magnitude # m = β2·m + (1-β2)·g ← momentum updated with a different coefficient for batch in loader: opt.zero_grad() loss_fn(model, batch).backward() opt.step()
v (moving average of squared gradients) can be seen as a crude estimate of the diagonal of the Fisher / Hessian — it gives each parameter an adaptive step, equivalent to "keep only the diagonal of the curvature matrix, ignore all coupling between parameters." True second-order methods (K-FAC / Shampoo) keep the off-diagonal coupling too, and are in theory more accurate. But Adam reigned for a decade on sheer cost-effectiveness: the diagonal approximation makes per-step cost nearly identical to SGD, it's robust to hyperparameters, and it works almost everywhere. The coupling info is valuable, but the cost of "computing + storing the preconditioner" often eats up the gain from "fewer steps" — only in massive-scale pre-training, where steps are extremely expensive, does this account flip positive. It's a classic engineering trade-off: approximation accuracy vs cost per step, with no universal optimum, only "which pays off at your scale." It also explains Sophia's design philosophy — don't seek the full Hessian; just diagonal + occasional updates + clipping, squeezing second-order's benefit into Adam-level cost.