AI/ML Explained: Activations & Normalization

Day 32 · 2026-06-18
For: engineers with coding experience, not from an AI background

Activation Functions & ReLUActivation Functions / ReLU

non-linearityfoundation
One-line analogy

An activation function is the network's non-linear switch. Without it, stacking 100 layers is exactly equivalent to one layer — like chaining 100 middlewares that only do linear transforms (pure forwarding, pure scaling); the compiler collapses them all into a single matrix. ReLU = max(0, x), the same as SQL's GREATEST(0, x): zero out negatives, pass positives — the most primitive gate there is.

Problem it solves + how it works

The core pain point is linear collapsibility: a layer computes y = Wx + b (matmul + bias). Two stacked linear layers W₂(W₁x) = (W₂W₁)x are still a single linear map — no matter how deep, the expressive power equals one layer, fitting only lines/hyperplanes. Activations inject non-linearity between layers so the network can approximate any complex function (the intuition behind the "universal approximation theorem").

Why non-linearity is mandatory

No activation: W₁W₂W₃ a single W (depth wasted)
With activation: W₁ReLUW₂ReLUW₃ one layer (depth matters)

Early nets used Sigmoid / Tanh (S-shaped curves), but they "saturate" at the extremes — when input is very large or small the gradient approaches 0, and in backprop gradients multiplied layer by layer decay exponentially, so deep nets can't learn (vanishing gradients). ReLU's revolution: in the positive region the gradient is constant 1, never decaying, making deep nets trainable. The cost is Dying ReLU: if a neuron's input stays negative for long, its output is stuck at 0, its gradient is 0, and it's permanently dead — its most famous failure mode.

Code example
import torch, torch.nn as nn

# Intuition: two stacked "no-activation" linears == one linear
x = torch.randn(4, 8)
lin = nn.Sequential(nn.Linear(8, 16), nn.Linear(16, 8))
# Two Linears in series is still a linear map; an equivalent matrix replaces it

relu_net = nn.Sequential(
    nn.Linear(8, 16),
    nn.ReLU(),          # ← inject non-linearity; now depth "counts"
    nn.Linear(16, 8),
)
print(relu_net(x).shape)  # torch.Size([4, 8])

# ReLU itself is this simple:
def relu(x): return torch.clamp(x, min=0)  # max(0, x)
Common pitfall + practical scenario
"Activations exist to mimic biological neuron firing" — that's an after-the-fact analogy, not the design motive. The real motive is pure math: break linear collapsibility. Reading it as "the non-linear operator that makes depth meaningful" is far more accurate than "simulating the brain" — the latter breeds wrong intuitions when you tune.
📌 BigCat scenario: when you read that a model "gains points by swapping the activation," don't treat it as magic. It's fundamentally changing how gradients flow and which neurons fire — the same class of "alter signal-transmission characteristics" decision as tuning the rate-limit threshold / circuit-breaker curve in a distributed system.
Takeaway + question
💡 Activations aren't decoration; they're the precondition for the "deep" in deep learning — without them a 10,000-layer net equals one layer.
🤔 If non-linearity is the source of expressive power, is "stronger non-linearity" always better? Why do practitioners favor a nearly linear, simple function like ReLU over a highly curved, complex one?

GELU & SwiGLUGELU / SwiGLU

modern activationgatingLLM standard
One-line analogy

ReLU is a hard if-else router: negatives are cut to zero. GELU is a probabilistic soft router: it smoothly weights through inputs by "how likely they're worth keeping." SwiGLU goes further — a learnable dynamic gate, like a smart load balancer that decides how much traffic to let through based on request content, with the gate opening set jointly by data and parameters.

Problem it solves + how it works

ReLU's hard cutoff has two flaws: ① it's non-differentiable at 0 (a kink); ② Dying ReLU. GELU (Hendrycks & Gimpel, 2016) replaces the kink with a smooth curve. It's defined as GELU(x) = x · Φ(x), where Φ(x) is the standard normal cumulative distribution function (CDF) — intuitively, Φ(x) is "the probability that a standard normal random number is less than x." The larger x is, the closer this probability is to 1 (pass almost everything); the more negative x is, the closer to 0 (block almost everything), with a smooth transition in between. So GELU weights probabilistically by the input's relative magnitude, rather than hard-gating by sign like ReLU. The negative region keeps a sliver of signal (never fully dead), and gradients are smoother.

Behavior of three activations in neg/pos regions
input x: -3 -1 0 1 3
ReLU   0 0 0 1 3 hard cutoff, has a kink
GELU  ≈0 -.16 0 .84 2.99 smooth, slight neg leak
↑ GELU keeps a little signal in the neg region, differentiable everywhere

SwiGLU (from Shazeer 2020, GLU Variants Improve Transformer) isn't a single activation but a redesign of the Transformer's feed-forward layer (FFN). A plain FFN: W₂ · act(W₁x). GLU-style structures add gating: two linear projections — one as "content," one passed through an activation as the "gate" — multiplied element-wise: (W_v·x) ⊙ Swish(W_g·x). Here ⊙ is element-wise multiplication, and the gate opening is decided dynamically by the input — one extra layer of data-dependent modulation beyond a fixed activation. This is the standard FFN in LLaMA, PaLM, and other modern LLMs.

Code example
import torch, torch.nn as nn, torch.nn.functional as F

# GELU: built into PyTorch, one line
x = torch.randn(2, 8)
y = F.gelu(x)              # x * Φ(x), a smooth ReLU

# A SwiGLU-style FFN (same idea as LLaMA)
class SwiGLU_FFN(nn.Module):
    def __init__(self, dim, hidden):
        super().__init__()
        self.w_gate = nn.Linear(dim, hidden, bias=False)  # gate
        self.w_val  = nn.Linear(dim, hidden, bias=False)  # content
        self.w_out  = nn.Linear(hidden, dim, bias=False)
    def forward(self, x):
        # Swish(gate) decides how much value passes per dimension
        return self.w_out(F.silu(self.w_gate(x)) * self.w_val(x))
# F.silu == Swish == x*sigmoid(x); * is the element-wise gating multiply
Common pitfall + practical scenario
"SwiGLU is 'more non-linear' than ReLU, hence stronger" — inaccurate. SwiGLU's gain comes mainly from the multiplicative interaction the gate introduces (data-dependently amplifying/suppressing features), not a steeper non-linearity. Note: GLU-style structures split a layer into two projections, adding parameters, so in practice the hidden dim is shrunk to about 2/3 to keep total params equal — a detail from the paper, otherwise the comparison is unfair.
📌 BigCat scenario: understanding the gating mechanism matters more than memorizing names — it's isomorphic to the feature flags / dynamic routing / content-based conditional pass-through you already know. LLMs are full of this "let the data decide the signal path" design; attention is one instance.
Takeaway + question
💡 From ReLU to GELU to SwiGLU, the through-line is "hard switch → soft weighting → learnable gate" — increasingly letting the data decide how signal flows.
🤔 Gating (multiplicative interaction) and going deeper (stacking more layers) both boost expressive power. Under a fixed parameter budget, how would you trade off "a wider gated FFN" against "deeper simple layers"?

BatchNorm vs LayerNormBatchNorm vs LayerNorm

normalizationtraining stability
One-line analogy

Normalization = a standardization (z-score) applied before data enters the next layer, pulling activations back to a uniform "mean 0, variance 1" scale — like normalizing features before a query so a feature with a huge magnitude doesn't drown out the rest. The difference is which dimension you compute statistics over: BatchNorm relies on global statistics across the whole batch (like a rate limiter keyed off a global counter — jittery at small batch); LayerNorm computes per sample (like per-request local normalization, ignoring neighbors).

Problem it solves + how it works

Pain point: in deep nets, each layer's input distribution drifts sharply as earlier layers update, forcing later layers to "chase a moving target," making training slow and unstable. Normalization forces each layer's activations back to a stable distribution. The core formula is y = γ · (x − μ) / √(σ² + ε) + β, term by term: μ is the mean, σ² the variance (first standardize to 0 mean, 1 variance); ε is a tiny constant to avoid division by zero; γ, β are learnable scale and shift — this step is crucial: normalize first, then let the model learn back the scale it needs, so normalization doesn't sacrifice expressive power.

Key difference: which axis you compute μ and σ over
tensor [batch=N, features=D]

BatchNorm: for each feature, across the whole batch (↓ slice vertically)
  sample1sample2sample3 → same column normalized together

LayerNorm: for each sample, across all its features (→ slice horizontally)
  sample1's D features → normalized within this row, independent of others

Why do Transformers / LLMs use LayerNorm instead of BatchNorm? Three hard reasons: ① sequence lengths vary and samples in a batch aren't aligned, so cross-batch statistics are ill-defined; ② BatchNorm is statistically noisy at small batch and behaves inconsistently between train and inference (inference uses running global statistics); ③ at inference the batch may be 1, and BatchNorm's batch statistics simply break. LayerNorm computes per sample, so training and inference are identical, fully decoupled from batch size and sequence length — a natural fit for NLP. BatchNorm remains the workhorse in CNN vision.

Code example
import torch, torch.nn as nn

x = torch.randn(32, 512)   # [batch=32, features=512]

bn = nn.BatchNorm1d(512)   # across batch: per-feature statistics
ln = nn.LayerNorm(512)     # across features: per-sample statistics

# Verify the difference in stat dimension
# BatchNorm: each "column" (feature) pulled to 0 mean → mean(dim=0)≈0
print(bn(x).mean(dim=0).abs().max())   # ≈ 0
# LayerNorm: each "row" (sample) pulled to 0 mean → mean(dim=1)≈0
print(ln(x).mean(dim=1).abs().max())   # ≈ 0

# At inference with batch=1: BatchNorm needs eval() for global stats, LayerNorm doesn't care
bn.eval(); print(bn(torch.randn(1, 512)).shape)
Common pitfall + practical scenario
"Normalization works by removing 'internal covariate shift'" — that's the original paper's explanation, later disputed. Santurkar et al. (2018) argue normalization's real effect is more likely smoothing the loss landscape (steadier gradients, larger usable learning rate) rather than reducing distribution drift. So don't treat "internal covariate shift" as gospel — it's a contested motivating hypothesis.
📌 BigCat scenario: seeing "BatchNorm" on a model card basically tells you it's a vision model; "LayerNorm/RMSNorm" basically signals a Transformer/LLM — a reliable shortcut for classifying the architecture family, just like seeing "Raft/Paxos" tells you it's a consensus system.
Takeaway + question
💡 Normalization doesn't change information, only stabilizes scale; choosing BatchNorm vs LayerNorm is really asking "should statistics be computed across samples or within a sample" — decided by the task's batch/sequence properties.
🤔 BatchNorm quietly introduces coupling between samples at training time (one sample's normalization depends on others in the batch). That coupling is both a source of regularization and a breeding ground for bugs — where do you think it goes wrong?

Root Mean Square NormRMSNorm

normalizationsimplifiedmodern LLM
One-line analogy

RMSNorm is a stripped-down LayerNorm: the authors found that LayerNorm's "subtract the mean (re-centering)" step is actually dispensable, so they cut it, keeping only "divide by magnitude (re-scaling)." It's like removing a field from an RPC protocol you found nobody actually uses — saving serialization overhead with no accuracy loss. The entire LLaMA family and most modern LLMs switched to it.

Problem it solves + how it works

LayerNorm does two things: re-centering (subtract mean μ to center the data) and re-scaling (divide by std to unify magnitude). Zhang & Sennrich (2019) hypothesize that only re-scaling matters; re-centering is redundant. RMSNorm therefore divides only by the Root Mean Square (RMS): y = γ · x / RMS(x), where RMS(x) = √( (1/D)·Σxᵢ² ). Intuitively RMS is "how large this vector is overall" (average of squared components, then square root), and it needs no mean computation, no subtraction, and no β bias.

LayerNorm vs RMSNorm: which steps are saved

LayerNorm: compute μsubtract μcompute vardivide σ×γ +β

RMSNorm: (skip centering)compute RMSdivide RMS×γ one fewer mean, one fewer subtract, no β

Payoff: less compute, faster (the paper reports a notable speedup), with quality on par with — or better than — LayerNorm at large scale. In LLMs with hundreds of billions of parameters, where every layer normalizes countless times, this per-step saving is amplified into a meaningful drop in total training/inference cost — the practical reason for its wide adoption.

A related mechanism: where normalization goes. Early Transformers used Post-LN (norm after the residual add), which trained unstably when deep and needed warmup. Modern designs generally moved to Pre-LN (norm before the sub-layer), giving steadier gradients and removing the fiddly warmup — one of the key changes behind "why today's Transformers are easier to train."

Code example
import torch, torch.nn as nn

class RMSNorm(nn.Module):
    def __init__(self, dim, eps=1e-6):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(dim))  # only γ, no β
        self.eps = eps
    def forward(self, x):
        # RMS = sqrt(mean(x^2)); note: no mean subtraction
        rms = x.pow(2).mean(dim=-1, keepdim=True).add(self.eps).rsqrt()
        return x * rms * self.weight   # divide by magnitude, then scale

x = torch.randn(2, 512)
print(RMSNorm(512)(x).shape)   # torch.Size([2, 512])
# torch 2.4+ also has built-in nn.RMSNorm(512); use the official one in prod
Common pitfall + practical scenario
"RMSNorm is an approximation/degenerate LayerNorm, so it's weaker" — backwards. It's a deliberate simplification built on the "re-centering is redundant" hypothesis, and it measures as good or better while being faster. The pitfall's root is equating "fewer steps" with "less capable" — in deep learning, removing parts proven redundant is a common form of progress (Occam's razor).
📌 BigCat scenario: the RMSNorm story is a lovely engineering-philosophy sample — "question whether a step that's been carried along by default is actually necessary." That's the same instinct as cutting a habitual sync barrier in a distributed system and finding it faster yet still correct. Worth carrying into your own technical decisions.
Takeaway + question
💡 RMSNorm = LayerNorm minus "subtract the mean"; its wide adoption proves that at huge scale, small per-step savings get amplified into a decisive advantage.
🤔 "Re-centering is redundant" is an empirical hypothesis, not a theorem. Could it fail under a different architecture or data distribution? How would you design an experiment to verify "a default step can actually be removed"?

Further ReadingFurther Reading

Deep QuestionsDeep Questions

1. Activations (inject non-linearity) and normalization (stabilize scale) seem like two separate things — why do they always appear paired, packed right next to each other?
Because they address complementary problems on the same data stream. Activations inject non-linearity for expressive power, but non-linear operators (especially saturating ones) are highly sensitive to input scale — too large or too small and the input lands in a near-zero-gradient region, making deep nets hard to train. Normalization pulls the data scale entering/exiting the activation back to the range where the activation works well; only with both does a deep net get expressive power and stable training. So a typical layer is the sandwich Norm → Linear → activation repeated. Another angle: the activation decides "how the signal transforms," normalization decides "at what scale the signal enters the next transform" — one governs shape, the other governs magnitude. Grasp this and you see why swapping the activation often forces a normalization re-tune: they're coupled on the same "gradient health" objective.
2. The ReLU → GELU → SwiGLU evolution has a hidden thread of "let the data decide the signal path." Is the attention mechanism the ultimate version of the same idea?
Yes — attention is the "apex" of this line. Rank these by "degree of data dependence": ReLU's gate is decided purely by sign (x>0 open, else closed), parameter-free, fixed across samples; GELU's gate is smoothly decided by input magnitude, still element-wise, no interaction; SwiGLU's gate is computed by a learnable projection, opening based on input content, with multiplicative modulation across dimensions; attention generalizes the "gate" into a weight matrix computed dynamically from token-to-token similarity — what each position attends to, and how much, is fully decided by current input content. The common thread: moving from a "fixed computation graph" to an "input-conditioned computation graph" (input-conditioned computation). This is why modern architectures are so expressive — they dynamically reassemble information pathways at inference based on data, rather than running a rigid fixed transform. BigCat, picture it as the evolution from a "static routing table" to "dynamic routing computed in real time from request content."
3. RMSNorm still works after dropping "subtract the mean" — what does this say about how many "taken-for-granted steps" in deep learning are actually redundant?
A deep lesson: many deep-learning defaults are products of historical inertia + one-off experience, not necessities derived from first principles. RMSNorm, Pre-LN replacing Post-LN, dropping bias terms, simpler position encodings replacing complex ones — recent improvements share a pattern: "question whether a default component is necessary, remove it, find it simpler and no worse." Behind this is a transferable research methodology: (a) ablation — systematically delete components one by one to see which can go harmlessly; (b) suspicion of "because everyone does it" designs; (c) re-evaluating small-scale conclusions at large scale (many defaults were set in the small-model era; with scale, conclusions can flip). But be honest: this doesn't make "simpler is always better" a law — simplification works because that step happened to be redundant; under a different data distribution/architecture it may not hold. The real skill is knowing how to verify whether a step is redundant, not worshipping minimalism. That ability to "dare to remove, yet rigorously verify it's right" is the divide between a senior engineer and a novice — in AI as in distributed systems.
4. BatchNorm makes one sample's output depend on others in the same batch — coupling that is both regularization and a bug source. What shared engineering lesson does it share with "shared state" in distributed systems?
It's fundamentally the same problem: introducing implicit shared state brings coupling you didn't expect. BatchNorm's "side-effect regularization" is much like "performance you got by accident from some shared cache" in distributed systems — handy, but fragile and hard to reason about. Its bug modes map closely onto the classic shared-state traps: (a) train/inference inconsistency — training uses batch statistics, inference uses a running average, equivalent to bugs from "test and prod drawing state from different sources"; (b) noisy at small batch — when the shared sample size is too small the statistics distort, like a rate limiter whose threshold jitters at low traffic; (c) statistics must be synced across devices in distributed training (SyncBatchNorm) — literally a "distributed shared-state consistency" problem, with communication overhead and complexity; (d) outright failure in contrastive learning / small-batch tasks, where sample coupling pollutes representations that should be independent. The common lesson: shared state is a false "free lunch" — it often brings short-term gains and long-term debt. Part of why LayerNorm/RMSNorm won in the Transformer era is precisely that they have no inter-sample coupling and no hidden global state, behave identically in train and inference, and are easy to reason about and scale out — exactly the distributed wisdom that "stateless design composes better," echoing in deep learning.