AI/ML Explained: Frontier Architectures

Day 34 · 2026-06-20

For: engineers with coding experience, non-AI background

Engineering counterpart → super-individual D16: Cost (inference-cost trade-offs of MoE / long context)

The thread of today

The standard Transformer has two congenital ailments: attention is O(n²) (double the sequence, quadruple the compute) and dense computation (every token activates every parameter). Today's four concepts are academia's two lines of attack on these ailments — MoE attacks "dense," growing parameters while holding compute fixed; SSM / Mamba attacks "O(n²)" by borrowing RNN ideas to recover linear complexity; long context asks "how do we stretch n without changing the architecture?" They all answer the same question: when we want bigger and longer, what about the hardware bill?

Mixture of ExpertsMoE

sparse activationconditional compute

One-line analogy

MoE is database sharding + query routing for a neural net. A dense model is like a single-node database that scans the whole table on every request; MoE splits one giant feed-forward layer (FFN) into N "expert shards" and adds a router that forwards each token only to its 1–2 most relevant shards. Total capacity grows linearly with shard count, while per-query compute stays roughly fixed — exactly sharding's core payoff: trade space for throughput, not compute for capacity.

What problem + how it works

Pain point: the most direct way to a stronger model is more parameters, but in a dense model parameter count and compute are welded together — double the params, double the FLOPs per token, and inference cost explodes linearly. We want "huge capacity but cheap to run," and a dense architecture can't have both.

MoE's mechanism is conditional computation: replace each Transformer layer's FFN with N parallel small FFNs (experts) plus a lightweight router. For each token, the router computes a score distribution and activates only the top-k experts (usually k=1 or 2); the rest do zero work this step. So:

Total params (capacity) = N × per-expert size — can scale to trillions;
Active params (cost) = k × per-expert size — depends only on k, not N.

One token through an MoE layer (N=8 experts, top-2)

token → Router scores → pick the 2 highest

E1 ✓E2E3E4 ✓E5E6E7E8
└ only E1, E4 are activated & weighted-summed; the other 6 shards do zero compute ┘

capacity = 8 sets of params　|　cost ≈ 2 sets of params

The key design difficulty is load balancing: if the router keeps sending tokens to a few "star experts," the rest never train yet still occupy memory. Shazeer et al.'s foundational 2017 paper introduced an auxiliary balancing loss to penalize this skew; Switch Transformer (2021) further simplified k to 1, showing single-expert routing preserves quality and cuts routing overhead. The 2024 Mixtral 8×7B turned this into an open-source flagship: 8 experts, 2 activated per token.

Code example

import torch, torch.nn.functional as F
# Minimal MoE layer: N experts + top-k routing (for intuition, not production)
class MoE(torch.nn.Module):
    def __init__(self, d=512, n_exp=8, k=2):
        super().__init__()
        self.k = k
        self.router  = torch.nn.Linear(d, n_exp)        # router: token → score per expert
        self.experts = torch.nn.ModuleList([torch.nn.Linear(d, d) for _ in range(n_exp)])

    def forward(self, x):                          # x: [tokens, d]
        scores = self.router(x)                       # score every token against all experts
        w, idx = scores.topk(self.k, dim=-1)        # keep only top-k experts
        w = F.softmax(w, dim=-1)                     # normalize weights over the chosen k
        out = torch.zeros_like(x)
        for j in range(self.k):                      # weighted sum of activated experts
            for e in range(len(self.experts)):
                m = idx[:, j] == e                    # mask of tokens routed to expert e
                if m.any(): out[m] += w[m, j:j+1] * self.experts[e](x[m])
        return out                                    # N=8 capacity, but each token computes only 2 experts

Common misconception + your scenario

Misconception: "MoE splits tasks by domain — a math expert, a code expert, a poetry expert." Wrong. Routing is learned at token granularity in latent space; expert specialization is uninterpretable and does not map to human domains — research finds routing correlates more with syntax/surface patterns than with "subjects." Think of experts as "hash buckets of a DB shard," not "microservices split by business line."

📌 BigCat scenario: when picking an open model for your workflow, don't misread "Mixtral 8×7B" as a "56B dense model." It has ~47B total params but only ~13B active per token — meaning its memory footprint is sized like 47B but its speed is like 13B. Grasping this "capacity/cost decoupling" is how you correctly estimate whether one GPU can run it and what throughput to expect.

Takeaway + question

💡 MoE's essence is decoupling "parameter capacity" from "compute cost" — the same engineering philosophy as sharding trading space for throughput.
🤔 In your distributed experience, the biggest pitfall of "route by key to a shard" is the hot shard. MoE's "star expert" is the same disease — how would you design the balancing mechanism?
Engineering counterpart → super-individual D16 (Cost: the inference-cost ledger of MoE)

State Space ModelsSSM / S4

linear complexitysequence modeling

One-line analogy

An SSM processes a sequence like a streaming processor with a fixed-size memory. A Transformer reads the whole log into memory and compares everything pairwise (O(n²)); an SSM is like a Kafka consumer — read one record at a time, compress history into a fixed-dimension "state variable," then update the state and drop the raw record. Memory does not grow with stream length — exactly streaming's core edge over batch.

What problem + how it works

Pain point: attention must store every past token's Key/Value (KV cache grows linearly with length) and compute pairwise correlations (O(n²)). At tens or hundreds of thousands of tokens it's both slow and memory-hungry. Can we be "O(n) linear, constant memory" like an RNN, yet keep the Transformer's parallel training?

SSMs borrow classic equations from control theory. The core keeps a hidden state h, updated once per input x_t:

The SSM recurrence (two lines)

h_t = A·h_t-1 + B·x_t # new state = decayed old state + injected input
y_t = C·h_t # output = read out from the state

Intuition: A is the "forget/retain matrix" — how old memory decays; B is the "write gate" — how much of the current input is injected; C is the "read gate" — what to extract from the state. This echoes LSTM gating, but the beauty of SSMs is: when A, B, C are input-independent (linear time-invariant), the whole recurrence can be unrolled mathematically into a convolution, so training runs fully parallel like a CNN, while inference updates step-by-step O(1) like an RNN — best of both worlds.

The difficulty: a naive A matrix numerically explodes or vanishes over long sequences (the same root as RNN gradient issues). Gu et al.'s 2021 S4 paper used a specially structured A matrix (HiPPO-theory initialization + low-rank correction) to fix this stability, letting SSMs for the first time beat Transformers across long-sequence benchmarks (Long Range Arena) and generate dozens of times faster.

Code example

import torch
# The SSM "recurrent mode" — showing constant-memory streaming (teaching version, not full S4)
def ssm_scan(x, A, B, C):
    # x: [seq_len, d]  A,B,C are learned state matrices
    h = torch.zeros(A.shape[0])          # hidden state: fixed dim, does not grow with seq_len
    ys = []
    for t in range(x.shape[0]):         # streaming, token by token
        h = A @ h + B @ x[t]            # decay old state + write new input
        ys.append(C @ h)                # read current output from the state
    return torch.stack(ys)
# Key: whether seq_len is 1k or 1M, h's size is unchanged → O(1) memory
# In training this recurrence unrolls into a convolution → fully parallel (omitted)

Common misconception + your scenario

Misconception: "SSMs are linear, so they beat Transformers everywhere." Not quite. A fixed-size state is a double-edged sword — it lossily compresses all history into one vector, so it's inherently weaker at "exactly recalling a specific distant token" (needle-in-a-haystack, verbatim copying) than attention, which can look back token by token. This is exactly the weakness the next concept, Mamba, fixes.

📌 BigCat scenario: when evaluating "very long sequence" tasks (genomic sequences, long audio, whole books), remember this divide: need globally precise retrieval → attention wins; need efficient streaming over very long signals → SSM wins. The logic is isomorphic to your "batch vs stream processing" framework choice.

Takeaway + question

💡 SSMs beat O(n²) back to O(n) with "fixed state + streaming update," at the cost of lossily compressing history — the classic "constant memory vs precise look-back" trade-off.
🤔 Think of hidden state h as a database "materialized view": continuously updated incrementally, storing no raw events. When is a materialized view enough, and when must you query the raw log?

Mamba / Selective SSMMamba

input-dependentselectivity2023

One-line analogy

S4's state matrices are a static config — like a hard-coded cache policy that applies the same "keep/drop" rule regardless of the data. Mamba upgrades it to a content-aware dynamic cache: it makes the "write gate B, read gate C, and forget step" all functions of the current input. The cache policy can now judge for itself "this one matters, keep it longer; that one's noise, forget it fast" — upgrading from a fixed TTL to an adaptive TTL.

What problem + how it works

Pain point: the last section noted SSMs' Achilles' heel — "indiscriminate compression of history." Because A, B, C are fixed, the model cannot decide by content whom to remember and whom to ignore — this is the lack of "content-based reasoning." A task like "skip all whitespace, keep only content words" is impossible for a fixed SSM.

Mamba's (Gu & Dao, 2023) core innovation is the selection mechanism: it makes the SSM's B, C, and discretization step Δ functions computed from input x_t. So every token can dynamically control "how much I write into the state, how much I read, how fast I decay old memory." Intuitively, the model gains selective memory: reset/strongly-write the state on key info, let noise flow past.

S4 vs Mamba: do parameters vary with input?

S4　A,B,C fixed → updates state identically for every token
Mamba　B,C,Δ = f(x_t) → each token decides how much/how long to keep

cost: params vary with input → convolution unroll fails → no direct parallelism

But here's the engineering paradox: once parameters depend on input, the system is no longer "linear time-invariant," so the "unroll into a convolution for parallel training" trick stops working. Mamba's second contribution is a hardware-aware parallel scan algorithm — a prefix-sum-like parallel primitive plus carefully designed GPU memory I/O (in the spirit of FlashAttention, keeping intermediate state in fast SRAM) — so this "input-dependent recurrence" still runs efficiently in parallel on a GPU. The paper reports Mamba reaching ~5× the throughput of a same-size Transformer in language modeling, scaling linearly with sequence length up to million-length sequences.

Since 2024 the practical consensus is hybrid architectures: interleave a few attention layers with many Mamba layers — use attention to patch the "precise recall" weakness and Mamba for "long-sequence efficiency," taking the best of both.

Code example

# Official implementation: pip install mamba-ssm (needs CUDA)
import torch
from mamba_ssm import Mamba

batch, seq_len, dim = 2, 4096, 512
x = torch.randn(batch, seq_len, dim).to("cuda")

model = Mamba(
    d_model=dim,     # input/output dimension
    d_state=16,      # hidden-state dim — fixed, independent of seq_len
    d_conv=4,        # local conv window (captures short-range patterns)
    expand=2,        # internal expansion factor
).to("cuda")

y = model(x)           # output same shape as input [2, 4096, 512]
print(y.shape)        # B, C, Δ are computed from x internally → selective memory
# However long the sequence, d_state stays 16, run efficiently via parallel scan

Common misconception + your scenario

Misconception: "Mamba is out, the Transformer is obsolete." Overstated. The 2024–2025 evidence: pure Mamba still trails attention on tasks needing precise in-context retrieval (in-context learning, copying long strings); leading frontier models take a hybrid route, not a wholesale replacement. Treat it as "a handy new weapon in the toolbox," not a silver bullet.

📌 BigCat scenario: in cross-disciplinary thinking, Mamba's "selectivity" is a lovely cognitive metaphor — the brain's attention is also "selective writing to long-term memory," with the vast majority of sensory input forgotten instantly. When you design your own "information-flow filter" (what to read, keep, forget each day), Mamba's Δ (adaptive forget step) is a borrowable design variable.

Takeaway + question

💡 Mamba = giving SSMs "content-aware selective memory" + "hardware-aware parallel scan," letting a linear architecture approach attention on language for the first time.
🤔 Attention is "losslessly keep all history, retrieve on demand"; Mamba is "selectively compress history, update streaming." Isn't this the database debate of "keep full detail vs maintain an aggregate state"? Will both paradigms coexist long-term, or will one win?

Long ContextLong Context

extrapolationposition encoding

One-line analogy

"Stretch context from 4K to 1M" is not flipping a config flag; it's more like horizontally scaling a system designed for small data: you can't just change one max_length line — you must simultaneously solve compute (O(n²) blowup), memory (KV cache grows linearly), and generalization (the model never saw positions this far out) — just as scaling a database from single-node to distributed redoes indexing, caching, and consistency all at once.

What problem + how it works

Pain point: the original Transformer trains at a fixed length (e.g. 2K, 4K); feeding it 100K tokens directly hits three barriers. Long context is a collection of techniques, each attacking one:

① Compute wall (O(n²)): attention grows quadratically with length. Solutions: efficient attention kernels (e.g. FlashAttention, which tiles the computation and never materializes the intermediate matrix, cutting memory from O(n²) to O(n)), plus linear architectures like the SSM/Mamba above.
② Memory wall (KV cache): every past token's Key/Value must stay resident in memory. Solutions: GQA / MQA (multiple attention heads share one set of K/V, shrinking the cache several-fold), KV cache quantization, etc.
③ Generalization wall (position extrapolation): the model never saw "the 500,000th position" during training, so position encoding fails. This is the subtlest one — expanded below.

The mainstream fix for the third revolves around RoPE (Rotary Position Embedding) (Day 13 covered its mechanism). RoPE encodes position as a "rotation angle"; higher-frequency dimensions rotate faster. Extrapolating directly to unseen lengths makes high-frequency dimensions "over-rotate," confusing the model. Position Interpolation's trick: rather than show the model unseen large angles, compress all positions proportionally back into the trained range — like re-marking a ruler designed for 30cm so it measures 3 meters: the ticks get denser but all stay within the "known interval," needing only a little fine-tuning.

The three walls of long context and their counters

① Compute O(n²) → FlashAttention / linear SSM architectures
② Memory KV cache → GQA/MQA shared K/V · cache quantization
③ Position extrapolation → RoPE + position interpolation / NTK scaling

long context = clear all three walls at once, none can be skipped

Code example

import torch
# Core idea of position interpolation: "compress" very long positions back into the trained range
train_len   = 4096        # model's original training length
target_len  = 32768       # length we want to extend to
scale = train_len / target_len     # scaling factor = 1/8

pos = torch.arange(target_len).float()
# Key step: position ×scale → squeeze [0, 32768) back into the "known" [0, 4096)
pos_interpolated = pos * scale     # model only needs to "know" positions it has seen

# These interpolated positions feed RoPE to compute rotation angles (RoPE: see Day 13)
# In practice: after interpolating, fine-tune briefly on long text to adapt stably
print(pos_interpolated.max())   # ≈ 4095, all within the trained range

Common misconception + your scenario

Misconception: "Supports 1M context = can use 1M context well." Distinguish two things: "the architecture can fit it" is physical capacity; "the model actually sees the middle" is effective utilization — the latter is constrained by Day 8's Lost in the Middle, where recall drops sharply for the middle of a long context. "Max window" is a spec ceiling, not actual usable quality. (Note: how to arrange context and when to use RAG instead are engineering topics — see super-individual.)

📌 BigCat scenario: when a launch event shouts "2M-token context," you can now unpack the engineering cost behind it: compute scaling O(n²) or linear, memory growing with KV cache, positions extrapolated via interpolation. This lets you ask the right questions when choosing — "what's the effective length? what's the latency and price for long inputs?" — rather than being bluffed by the spec number.

Takeaway + question

💡 Long context isn't one technique; it's an ensemble of "efficient attention + KV compression + position extrapolation" — fail any one wall and the length won't hold.
🤔 Across the three walls: MoE solves "parameter-capacity cost," SSM solves "sequence-compute cost," and long context solves "sequence-length cost." If you could bet on only one line in your product, how would you choose based on your business's "sequence characteristics"?
Engineering counterpart → super-individual D16 (Cost: the cost & latency ledger of long context)

Deep QuestionsDeep Questions

1. MoE and SSM/Mamba attack two different "ailments" of the Transformer — can they stack?

Yes, and this is a frontier direction. They're orthogonal: MoE optimizes the "per-layer width axis" — swapping a dense FFN for sparse experts, decoupling capacity from compute; SSM/Mamba optimizes the "sequence-length axis" — swapping O(n²) for O(n) recurrence. One governs "how many params per step," the other "how the sequence is scanned"; they don't conflict. Research already puts MoE inside Mamba/hybrid architectures: big capacity, long sequences, cheap steps. For your distributed intuition: it's "sharding (MoE)" stacked with "stream processing (SSM)" — two scaling strategies meant to combine, not to pick from. The difficulty is tuning both the load balancing and the hardware kernels together.

2. Why did "hybrid architectures" (few attention + many Mamba) become mainstream rather than pure Mamba?

The core is complementary capability: Mamba's fixed state is inherently weak at "precisely recalling a specific distant token," while attention excels at "lossless look-back + on-demand retrieval." Evidence repeatedly shows pure SSMs drop points on in-context learning and needle-in-a-haystack tasks. Hybrids use a few attention layers to patch the weakness and many Mamba layers for efficiency. This "a few critical components carry overall quality" pattern is everywhere: a few indexes cover most queries, a few hot caches serve most requests, a few coordination nodes govern many data nodes. The deeper insight: heterogeneity often beats homogeneity — placing components of different cost/ability where each excels is cheaper than "one universal component everywhere."

3. SSMs "lossily compress" history; attention "losslessly keeps" it. From an information-theory view (Day 33), is lossless always better?

The essence is an information bottleneck trade-off. Lossless isn't always better: (1) real sequences are highly redundant — most tokens are useless for future prediction, so keeping them losslessly is waste, and lossy compression acts like implicit denoising/regularization; (2) cognitive analogy: the brain is also aggressively lossy, discarding most sensory input instantly — not a defect but a feature; forgetting enables abstraction and generalization. The real question is never "lossless vs lossy" but "is what's dropped the right thing to drop" — exactly the value of Mamba's selection mechanism: it makes "what to drop" learnable and content-aware. Same for designing your info flow: the goal isn't "remember everything" but "intelligently decide what to remember."

4. MoE routing is "uninterpretable, not mapped to human domains" — should we accept "effective but uninterpretable" mechanisms?

A pragmatic stance splits by context: (1) capability level — as long as balancing mechanisms keep it working well, "uninterpretable but effective specialization" is acceptable, just as we used not-fully-understood stochastic gradient descent for decades; (2) high-risk level — in medicine, justice, alignment, uninterpretability is real risk, needing Day 27's mechanistic-interpretability tools for post-hoc probing. The deeper question: is "interpretability" a human-centric obsession? Human neurons don't map to "subjects" either, and forcing AI to think in human categories may limit ability. The healthy stance may be: pursue "supervisable and steerable" rather than "fully understandable" — being able to correct it when it errs matters more, and is more realistic, than explaining every step.

5. Stepping back: how does architecture innovation relate to "Scaling Law's brute-force compute" (Day 14)? Which matters more?

They're complementary legs. Scaling Law is "same architecture, add compute/data → steady gains" — going further along a fixed path; architecture innovation (MoE/Mamba/long context) changes the path's slope — raising the "ability per unit of compute" exchange rate. Scaling is the engine, architecture is the gearbox. When dense scaling's marginal cost hits the wall (compute/energy/memory), innovations like MoE become the lifeline — buying back more scaling room with sparsity/linearity. Currently (2026) it's both legs together. As a metaphor for the "super-individual": both "work harder with existing methods (scaling)" and "switch to a more efficient method (architecture innovation)" are needed — but when you feel diminishing returns, it's often the signal to switch architecture rather than "push harder."

Mixture of ExpertsMoE

State Space ModelsSSM / S4

Mamba / Selective SSMMamba

Long ContextLong Context

Further ReadingFurther Reading

Deep QuestionsDeep Questions