AI/ML Explained: The Geometry of Optimization

Day 43 · 2026-06-29 · Level: Advanced (math-heavy)
For: engineers with coding experience, non-AI background

Day 10 covered how to walk — AdamW, learning rates, the basic mechanics of gradient descent. Today we go one layer deeper: what does the terrain of training actually look like? Why do some valleys generalize well and others poorly? And when Adam isn't enough, from which directions do frontier optimizers push? This is an issue about the geometry of optimization.

Loss Landscape & Flat MinimaLoss Landscape & Flat Minima

geometric intuitiongeneralization
One-line analogy

Training a neural net = finding a valley floor in a terrain with hundreds of millions of dimensions. But valleys come in two kinds: a wide, flat basin, or a narrow, deep crack. By analogy to your system configs — a flat minimum is like a high-tolerance config (one parameter drifts a bit, it still works), a sharp minimum is like a config balanced on a tightrope (any parameter off by a hair and the system collapses). How well a model generalizes depends largely on which kind of valley it lands in.

What it solves + how it works

The loss function L(w) is a function of the parameters w (often hundreds of millions of dimensions); training is gradient descent searching for a minimum of L. But here's the puzzle: two models with training loss both near 0 — why does one generalize well on the test set and the other badly?

Key insight (the "flat minima" notion goes back to Hochreiter & Schmidhuber in the 1990s): the test-set loss surface is slightly shifted relative to the training surface (because train/test distributions differ a little). If the minimum you sit in is "flat," you're still in a low-loss region after the shift — good generalization. If it's "sharp," a small shift sends the loss soaring — poor generalization. Flat = robust to perturbations of the data distribution.

Li et al. 2018 used a technique called filter normalization to project the billion-dimensional surface down to 2D and made a beautiful finding: ResNet's skip connections dramatically smooth a terrain that's otherwise sharp, chaotic and ravine-ridden into a smooth basin — a geometric explanation for why deep nets suddenly became trainable once skip connections were added.

Loss cross-section along one dimension (solid = train, dotted = test, slightly shifted)

Flat minimum (generalizes well)
\________/ ← train valley floor
 \._._._._._/ ← after the test shift, still in low-loss region

Sharp minimum (generalizes poorly)
\  / ← train valley floor (just as low)
 \./  ↑ test shifts a bit, loss soars
↑ Same training loss; geometric shape decides whether generalization differs wildly
Code example
import torch, copy

# Quantify the "sharpness" of the minimum a trained model sits in:
# randomly perturb the weights, see how much loss can rise
def sharpness(model, loss_fn, batch, rho=0.05, trials=10):
    base = loss_fn(model, batch).item()
    worst = base
    for _ in range(trials):
        m2 = copy.deepcopy(model)
        with torch.no_grad():
            for p in m2.parameters():
                # perturbation radius scaled to the parameter's own size
                p.add_(torch.randn_like(p) * rho * p.norm())
        worst = max(worst, loss_fn(m2, batch).item())
    return worst - base   # how much higher the worst neighbor is = sharpness
# lower sharpness → flatter minimum → usually better generalization
Common misconception + practical scenario
Misconception: "lower training loss is always better." Wrong. Two models both pushing loss toward 0 can generalize wildly differently — the deciding factor is the geometric shape of the minimum, not the loss value itself. Grinding training loss to the extreme can actually drop you into a sharp, overfitted valley.
📌 Super-individual scenario: when BigCat fine-tunes a small model for deployment, beyond validation loss, run the sharpness probe above — at equal validation performance, the lower-sharpness version is more robust to live data drift and deserves priority for going live.
Takeaway + question
💡 A network's generalization is written in the geometric shape of the minimum, not just in the loss number.
🤔 Is the "flat = robust" intuition the same thing as your engineering philosophy in distributed systems — deliberately leaving tolerance margins, never pinning parameters at a critical edge?

Sharpness-Aware MinimizationSharpness-Aware Minimization (SAM)

optimization objectiverobustness
One-line analogy

If flat minima are better, can we write "find flatness" directly into the training objective? That's exactly what SAM does. It's like chaos engineering / fault injection — not content with "the current config runs," it actively asks "does it still run under the worst perturbation around me?", deliberately picking the worst point in the neighborhood to optimize, forcing itself into a wide basin.

What it solves + how it works

Plain SGD only minimizes the loss L(w) at the current point. But Keskar et al. 2016 found a famous phenomenon — the large-batch generalization gap: large-batch training tends to fall into sharp minima (because large batches have low gradient noise, removing the randomness that "shakes" the model out of cracks), hurting generalization.

Foret et al. 2020 proposed SAM, changing the objective from "minimize current loss" to min-max:

minw   max‖ε‖≤ρ   L(w + ε)

Symbol by symbol: w is the parameters, ε is a perturbation vector added to the weights, ρ (rho) is the neighborhood radius (a hyperparameter, e.g. 0.05), ‖·‖ is a vector norm. The inner max means "within a ball of radius ρ, how high can the loss go"; the outer min means "minimize that worst-case value." Intuition: stop asking how low the loss is under your feet; ask how low the highest loss in the neighborhood is — only a wide, flat basin can keep the "neighborhood worst case" low too.

In practice each step needs two forward passes: the first computes the gradient and finds the "sharpest direction" ε̂ = ρ·g/‖g‖ (climb one step toward the steepest slope); the second recomputes the gradient at w+ε̂, and that gradient is what actually updates the weights. The cost is roughly 2× training time, traded for better generalization.

SAM's two steps (inside a sharp valley)

 /\  current point w at the sharp floor
/ ●\
   climb along the gradient to the worst point in the neighborhood w+ε̂ (see how sharp this valley is)
   update using that point's gradient → push w out of the sharp valley, toward a wide basin
↑ Plain SGD only looks at the gradient at ●; SAM looks at the "worst neighbor's" gradient
Code example
# One SAM step: climb to the worst neighbor first, then descend (two passes)
def sam_step(model, loss_fn, batch, optimizer, rho=0.05):
    loss_fn(model, batch).backward()
    # 1) compute perturbation ê = ρ·g/‖g‖, push weights toward the "sharpest" direction
    gn = torch.norm(torch.stack(
        [p.grad.norm() for p in model.parameters()]))
    eps = {}
    with torch.no_grad():
        for p in model.parameters():
            e = p.grad * rho / (gn + 1e-12); eps[p] = e; p.add_(e)
    optimizer.zero_grad()
    # 2) recompute gradient at the perturbed point — this is the real update direction
    loss_fn(model, batch).backward()
    with torch.no_grad():
        for p in model.parameters(): p.sub_(eps[p])  # step back to origin
    optimizer.step()      # update using the "worst neighbor's" gradient → toward flat region
Common misconception + practical scenario
Misconception: "SAM always helps, just bolt it on." Not necessarily. It doubles compute cost; on already-flat tasks the gain is tiny; the gain is largest on small-data, overfitting-prone settings (e.g. small-dataset vision tasks).
📌 Personal-project scenario: when BigCat fine-tunes a model on a small dataset (most prone to overfitting), SAM is a high-value "generalization insurance" — spend twice the training time, get a model that's steadier on new data.
Takeaway + question
💡 SAM turns "find a robust solution" from post-hoc prayer into the optimization objective itself.
🤔 Is the min-max "optimize for the worst case" idea — also seen in adversarial training, and in designing your systems for p99 rather than the mean in capacity planning — the same mathematical skeleton?

Second-Order MethodsSecond-Order Methods (K-FAC / Shampoo)

curvaturepreconditioning
One-line analogy

First-order methods (SGD / Adam) walk by looking only at the slope under their feet — in a long, narrow "ravine" they zigzag left and right, taking forever to reach the bottom. Second-order methods look at one more thing: curvature (how the terrain bends), so they can cut diagonally straight toward the floor. Analogy: first-order is like congestion control that only sees the instantaneous gradient; second-order is like also knowing "the curvature characteristics of this link" and cutting straight to it.

What it solves + how it works

The gradient g is the first derivative (slope). The theoretically optimal Newton's method uses the Hessian matrix H (second derivative, describing curvature) as a preconditioner:

Δw = − H−1 g

It corrects "ill-conditioned" ravines — the kind where directions have wildly different scales and Adam zigzags. But the pain point is fatal: H is n×n, with n = hundreds of millions of parameters → it can neither be stored nor inverted (complexity O(n²)–O(n³)). It's like how you'd never store a fully-connected adjacency matrix over hundreds of millions of nodes.

So there are two routes that "use structural assumptions to cut O(n²) down to something affordable":

  • K-FAC (Martens & Grosse 2015): approximate each layer's Fisher matrix (a form of curvature) as a Kronecker product of two small matrices A⊗B — factor a big block into two small factors, storable and invertible.
  • Shampoo (Gupta et al. 2018): maintain a small preconditioner matrix per dimension of the parameter tensor, instead of one giant matrix across all parameters.

Both are essentially "approximate the full curvature via a structured factorization." In 2024–2025 these methods (and variants like SOAP) re-surged in large-model pre-training, because they converge in fewer steps — and at massive scale, steps saved are real wall-clock saved.

Top view of an ill-conditioned ravine (steep across, gentle along)

First-order (SGD/Adam)   Second-order (K-FAC/Shampoo)
●╲          
 ╱●           ╲
●╲ ← zigzag       ╲ ← curvature straightens it
 ╱●             ◎ straight to the floor
↑ Curvature info shrinks steps in steep directions, lengthens them in gentle ones — fewer detours
Code example
# Adam only uses a diagonal "adaptivity" (treats each param as independent);
# second-order methods precondition with curvature, sensing coupling between params.
from torch_optimizer import Shampoo  # pip install torch-optimizer

opt = Shampoo(
    model.parameters(), lr=1e-3,
    # maintain a small preconditioner per tensor dimension,
    # not one n×n giant — factor "full curvature" into small factors
    update_freq=20)   # refresh the preconditioner only every 20 steps (amortize cost)

for batch in loader:
    opt.zero_grad()
    loss_fn(model, batch).backward()
    opt.step()
Common misconception + practical scenario
Misconception: "second-order is always faster than Adam." Not necessarily. Computing and storing the preconditioner per step is expensive; it only pays off when "steps saved" exceed "the extra cost per step." It's also more complex to implement and more sensitive to hyperparameters. So it isn't universal — it's a weapon for the "steps are extremely expensive" regime of massive-scale pre-training.
📌 Cross-disciplinary scenario: when BigCat reads that a big lab's pre-training used Shampoo / SOAP, you can see through it at a glance — it's not showing off, it's the engineering trade of "more compute per step for fewer total steps" at massive scale.
Takeaway + question
💡 The whole art of second-order methods is "use curvature information as much as possible without storing the full curvature matrix."
🤔 Is the Kronecker-factorization idea of "use a structural assumption to cut O(n²)" the same compression philosophy you apply with sparse / block / low-rank approximations of large matrices?

Sign Momentum & Modern OptimizersLion & Sophia

frontieroptimizers
One-line analogy

After Adam reigned for nearly a decade, two challengers appeared in 2023. Lion is like a minimalist refactor — cut half of Adam's state (keep only momentum), and take only the sign of the direction in the update; Sophia is like lightweight second-order — quietly estimate a little curvature, but at tiny cost.

What it solves + how it works

Adam stores two states per parameter: the first moment m (momentum) and the second moment v (a moving average of squared gradients). Memory ≈ 2× the parameter count. Pain point: at tens of billions of parameters, the optimizer state itself eats enormous memory. The two challengers seek cheaper update rules from different directions:

  • Lion (Chen et al. 2023, "sign momentum"): it was itself discovered automatically via program search (treating "find an optimization algorithm" as searching a space of programs), not hand-designed. It stores only momentum m (memory halved), and takes the sign in the update: update = sign(β₁·m + (1−β₁)·g). The sign means every parameter takes the same step magnitude (only direction differs), which is itself an implicit regularizer. Because the sign update gives a larger effective step, the paper notes Lion typically needs a learning rate 3–10× smaller than Adam.
  • Sophia (Liu et al. 2023): designed for LLM pre-training, it lightly estimates a diagonal Hessian as preconditioner only every few steps, then applies element-wise clipping to the update to control the worst-case step size — trading a little second-order info for faster convergence.

Common theme: in the gap where "Adam is good enough, but not lean enough / not fast enough," each found a way out — from memory (Lion) and from curvature (Sophia).

Optimizer state comparison (what's stored per parameter)

SGD 0 (leanest, but slow/rough)
Adam 2: momentum m + second moment v (steadiest default)
Lion 1: momentum m only (memory halved, sign update)
Sophia ~2: m + occasionally-updated diagonal Hessian (lightweight 2nd-order)
↑ Optimizer evolution = reallocating budget among memory / compute / generalization
Code example
from lion_pytorch import Lion  # pip install lion-pytorch

# Lion stores only momentum (not Adam's second moment v) → optimizer memory halved
opt = Lion(model.parameters(),
           lr=1e-4,            # typically 1/3 to 1/10 of Adam's
           weight_decay=1e-2)

# Core update rule (conceptual):
#   update = sign(β1·m + (1-β1)·g)   ← take only the sign, uniform step magnitude
#   m      = β2·m + (1-β2)·g          ← momentum updated with a different coefficient
for batch in loader:
    opt.zero_grad()
    loss_fn(model, batch).backward()
    opt.step()
Common misconception + practical scenario
Misconception: "a new optimizer must dominate Adam across the board." Reality: Lion / Sophia only win on specific scales / tasks; AdamW is still the steadiest default. And switching optimizers requires re-tuning the learning rate (Lion especially sensitive) — blindly copying settings often does worse than Adam.
📌 Decision-support scenario: when BigCat reads "a new model used Lion to save memory," you can see it's not voodoo, but a clear trade-off of "cut half the optimizer state + sign update" — and judge from there whether it fits your own resource constraints.
Takeaway + question
💡 The history of optimizers is a history of repeatedly reallocating budget among memory, compute, and generalization.
🤔 Lion was discovered by program search, not hand-designed — when "designing the algorithm" itself can be automated, how does the researcher's role change?

Further ReadingFurther Reading

Deep QuestionsDeep Questions

1. Is "flat minima always generalize well" settled? Are there counterexamples?
Not settled — in fact it's an active debate. Dinh et al. 2017 (Sharp Minima Can Generalize For Deep Nets) gave a sharp rebuttal: exploiting the positive scale invariance of ReLU nets — multiply one layer's weights by k and divide the next by k, the function is unchanged but the curvature (sharpness) in parameter space can be arbitrarily inflated or shrunk. That is, the same function can be made to look "sharp" or "flat" depending on how you parameterize it. This shows a naive "sharpness" definition is not a reliable measure of generalization. Later work (adaptive sharpness, reparameterization-invariant sharpness measures) tries to patch the hole. The deeper lesson: geometric intuition is powerful, but "in which coordinate system you measure" is itself a trap — the same class of problem as "curvature depends on the metric" in physics. So SAM's effectiveness is perhaps better explained not as "it found an objectively flat point" but as "its perturbation mechanism acts as a regularizer in practice."
2. Why is "gradient noise" actually good for generalization? What is small-batch randomness really doing?
One of the core counterintuitions of modern optimization. View SGD as "gradient descent + noise": each batch's gradient ≈ true gradient + a random perturbation, and smaller batches mean larger noise. This noise has two effects: (1) escaping sharp minima — sharp valleys are narrow and steep, noise easily "shakes" the model out, so only wide flat basins can "trap" a noisy trajectory; SGD thus naturally prefers flat solutions (echoing Keskar's finding: large batch → low noise → stuck in sharp valleys → poor generalization); (2) something like annealing — large noise early explores broadly, then as the learning rate decays the noise shrinks for fine convergence. In other words, gradient noise isn't a bug to eliminate but a feature of implicit regularization. This explains why "just increase batch size and scale up the learning rate proportionally" can't speed up training losslessly — you simultaneously weaken this beneficial noise. It's isomorphic to wisdom you've seen in distributed systems: "a bit of jitter actually avoids thundering herds and resonance."
3. Adam is a "diagonal-approximate second-order method." So why hasn't true second-order replaced it?
Adam's v (moving average of squared gradients) can be seen as a crude estimate of the diagonal of the Fisher / Hessian — it gives each parameter an adaptive step, equivalent to "keep only the diagonal of the curvature matrix, ignore all coupling between parameters." True second-order methods (K-FAC / Shampoo) keep the off-diagonal coupling too, and are in theory more accurate. But Adam reigned for a decade on sheer cost-effectiveness: the diagonal approximation makes per-step cost nearly identical to SGD, it's robust to hyperparameters, and it works almost everywhere. The coupling info is valuable, but the cost of "computing + storing the preconditioner" often eats up the gain from "fewer steps" — only in massive-scale pre-training, where steps are extremely expensive, does this account flip positive. It's a classic engineering trade-off: approximation accuracy vs cost per step, with no universal optimum, only "which pays off at your scale." It also explains Sophia's design philosophy — don't seek the full Hessian; just diagonal + occasional updates + clipping, squeezing second-order's benefit into Adam-level cost.
4. Lion was discovered by an "algorithm that searches for algorithms." If optimizers can be searched, where's the boundary?
Lion comes from a fascinating paradigm: formalize "designing an optimization algorithm" as searching a space of programs, letting the machine evolve update rules itself. Its success poses a sharp question: which things once belonging to "researcher creativity" are becoming searchable optimization problems? The boundary currently lies in a few places: (1) designing the search space is still human-defined — you can't search out an algorithm beyond the operator combinations you allow; framing the space is itself creative work; (2) the generalization gap — an algorithm searched on a small proxy task may not stay good at scale; the paper spends real effort on "program selection and simplification" to cross this gap; (3) interpretability — a discovered rule (like sign momentum) still needs humans to understand "why it works," else it's just a black box. The deeper implication: this is AI research's self-referential moment — using machine learning to improve the tools of machine learning. It won't make researchers obsolete, but shifts their role from "hand-crafting specific algorithms" up to "designing the search space, defining the objective, interpreting the discovery." This aligns with the core of BigCat's pursuit of the "AI super-individual": push down what can be automated, move yourself up to a higher level of abstraction.
5. Tying it together: from "terrain" to "optimizers," what unifying view runs through this issue?
The four concepts are facets of one story, whose through-line is "geometry decides everything." The loss landscape (concept 1) is the terrain itself — its flatness/sharpness decides generalization; SAM (concept 2) is actively choosing where to land — not letting you stop in a sharp valley; second-order methods (concept 3) are reading the terrain's curvature — so the steps walk smarter; Lion/Sophia (concept 4) are practical approximations to the terrain under resource constraints. Connect them and "training a neural net" turns from a vague "alchemy of tuning" into a clear proposition: on a billion-dimensional terrain whose shape is set by the architecture, using finite memory and compute, find a wide basin that's robust to perturbation. Each tool answers one word of that sentence — how the architecture shapes the terrain (skip connections smooth it), how to actively find a wide basin (SAM), how to walk fast (second-order), how to walk lean (Lion). This way of thinking — "reduce a complex phenomenon to one unified framework of geometry + resources" — is the underlying aesthetic shared by complexity science, distributed systems, and even the Buddhist view of "dependent origination": appearances vary endlessly, structure can be astonishingly simple.