AI/ML Deep Dive: LLM Foundations

Day 1 · 2026-05-19
For: engineers with coding experience but no AI background
Engineering counterpart → super-individual D2: Context Engineering (engineering Attention within the context window)

Transformer ArchitectureTransformer Architecture

LLMTheory
One-line intuition

Trade the "read one word at a time" serial pipeline for a "whole-team-in-one-meeting" parallel architecture — every token sees the full text at once before deciding what to say.

What problem it solves

Before 2017, RNN/LSTM dominated text processing. They were essentially a for-loop: you couldn't compute token t+1 until you finished token t. Two pain points followed: (1) training couldn't be parallelized, so it was slow; (2) long-range dependencies tended to fade away (vanishing gradients). The Transformer uses self-attention to directly connect any two positions in one shot — GPUs run at full utilization in parallel, and long-range signal no longer gets lost.

How it works (intuition)

Picture each token as an object with three properties: Query (what I'm looking for), Key (what I can offer), Value (what I actually carry). Each layer roughly does this:

# Pseudocode: one Transformer block
for token in sequence:
    # 1. Score Q against every Key in the sequence (like a database JOIN)
    scores = token.Q @ all_tokens.K.T
    weights = softmax(scores)
    # 2. Pull others' Value vectors back, weighted
    token.new = weights @ all_tokens.V
    # 3. Pass through a feed-forward layer (a nonlinear projection)
    token.out = FFN(token.new + token)  # residual connection

Stack 12 to 96 of these blocks. Each layer lets every token "re-ask the whole sequence," and the representation gets richer with every pass.

Code example
# Load a Transformer with the transformers library — the whole picture in 10 lines
from transformers import AutoTokenizer, AutoModel
import torch

tok = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

inputs = tok("Transformers changed NLP.", return_tensors="pt")
with torch.no_grad():
    out = model(**inputs)

# out.last_hidden_state shape: [batch, seq_len, hidden_dim]
# Each token is now a 768-dim vector encoding its meaning in context
print(out.last_hidden_state.shape)  # torch.Size([1, 6, 768])
Common misconception
Myth: "Transformer = ChatGPT." Not quite. The Transformer is just the skeleton. The original paper targeted translation (Encoder-Decoder). GPT uses a Decoder-only variant; BERT uses Encoder-only. Architecture is the foundation — behavior comes from training objective and data.
Key resources
Where you see it

Canonical: every modern LLM (GPT, Claude, Gemini, Llama) is a Transformer. Your day-to-day: when you paste a 5,000-word meeting transcript into Claude and ask for a summary, the Transformer's parallel attention is what lets the model scan the whole document in one pass instead of word by word.

English Summary
Transformers replaced recurrent loops with parallel self-attention, letting every token attend to every other token in one pass. This unlocked GPU-scale training and is the backbone of all modern LLMs (GPT, Claude, Llama).
Think it through
Transformers beat RNNs because of parallelism — clearly true at training time. But at inference time (autoregressive generation), can they still parallelize? Why or why not?
Not really. Generating token N+1 requires token N first, so inference collapses back into a serial loop — that's why LLM inference is slow and needs KV Cache and Speculative Decoding to claw performance back. But there's a key nuance: at training time, gradients for all N tokens are computed in parallel, so training is O(n); at inference, even though the outer loop is serial, the attention inside each step is still parallel, so it stays faster than an RNN. The takeaway: modern inference optimization isn't about "parallelizing tokens," it's about "shrinking the cost per step" — reuse KV cache, reorder matmuls with FlashAttention, etc.
If attention can connect any two positions, why stack 96 layers? Wouldn't one suffice? Compare it to chained SQL JOINs.
One layer isn't enough. A single attention layer does a one-hop lookup: A queries B, B queries C, but A still can't directly use C's information. Multiple layers diffuse signal: at layer 2, B has already absorbed C, so A can pull C through B; at layer 3, A reaches even further. It's like chained SQL JOINs — a single JOIN connects two tables, but "what does my friend's friend's friend like" requires recursion. Depth promotes the model from "matching" to "reasoning chains." There are diminishing returns: past ~100 layers, gains drop sharply because information is already well-mixed and extra depth mostly adds noise. That's why Llama-3 405B uses only 126 layers instead of stacking forever.
Transformer's O(n²) cost is original sin — what application boundaries does it set? How does GPT-4 still serve 128K context?
Naive Transformers struggle with long sequences: double the length, quadruple the compute and memory. That barred whole-genome analysis or full-book modeling. 128K context works thanks to engineering stacks: (1) FlashAttention removes memory-traffic bottlenecks so attention is affordable; (2) GQA/MQA share KV across heads to cut memory pressure; (3) RoPE allows positional extrapolation; (4) chunked prefill and KV-cache reuse at inference time. Caveat: "fits in 128K" is not "uses 128K well" — many models lose retrieval accuracy mid-context (the lost-in-the-middle problem). When engineering bypasses an architectural bottleneck, it usually introduces new capability traps.
BERT (Encoder-only), GPT (Decoder-only), T5 (Encoder-Decoder) — why is Decoder-only the only survivor by 2024?
The win comes from task unification. Decoder-only uses next-token prediction as a single universal interface — translation, QA, summarization, classification all become "continue the text." That unification compounds best with scale. Encoder-only (BERT) understands but can't generate, so its surface area is narrow. Encoder-Decoder (T5) doubles parameter count for limited gain. There's also infrastructure pull: KV-cache-friendly autoregressive inference has a mature stack. But post-2024, small specialist models (embeddings, rerankers) are returning to Encoder designs — architecture choice is task-driven, and "winner takes all" isn't the final word. See Hyung Won Chung's "Decoder-only is all you need" talk.
Residual connections + LayerNorm look like engineering trivia, but remove them and training collapses. What does this reveal about why deep learning can scale at all?
They're the gradient highway. Residuals let gradients flow across layers without multiplicative decay; LayerNorm keeps per-layer input distributions stable. Without them, a 96-layer network's gradients either vanish or explode — it simply can't converge. The deeper insight: deep learning scales not because "deeper networks are stronger," but because "we can train deeper networks stably." That's a subtle shift — the Transformer's success is half thanks to its auxiliary plumbing (Norm + Residual). Analogous to distributed systems: they scale not because "more machines," but because of invisible infrastructure like consensus protocols, heartbeats, and failover.

Attention MechanismAttention Mechanism

LLMCore
One-line intuition

Like a SQL JOIN — every word carries its own "query," matches it against the full text, finds the most relevant terms, and pulls a weighted copy of their content back.

What problem it solves

"The cat sat on the mat because it was warm." Does "it" refer to the mat or the cat? Traditional models could only guess by proximity. Attention lets the position holding "it" actively poll the whole sentence: who is most relevant to me? It then pulls in the mat's vector with high weight, baking "it = mat" into its own representation. The essence: solving long-range coreference and contextual focus.

How it works (intuition)

The core formula: Attention(Q,K,V) = softmax(Q·Kᵀ / √d) · V. In plain English: take dot products between Q and every Key to get similarity, normalize into weights, then take a weighted sum of all Values. A weighted lookup, nothing more.

# Attention in 4 lines of numpy
import numpy as np
def attention(Q, K, V):
    scores = Q @ K.T / np.sqrt(K.shape[-1])  # similarity
    weights = np.exp(scores) / np.exp(scores).sum(axis=-1, keepdims=True)
    return weights @ V                        # weighted sum

Multi-Head Attention = running several independent query channels in parallel, each head focusing on a different dimension of relation (syntax, semantics, coreference, etc.).

Code example
# See what GPT-2 is actually attending to
from transformers import GPT2Tokenizer, GPT2Model
tok = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2Model.from_pretrained("gpt2", output_attentions=True)

inputs = tok("The cat sat on the mat", return_tensors="pt")
out = model(**inputs)
# out.attentions[layer][batch, head, query_pos, key_pos]
attn = out.attentions[-1][0, 0]   # last layer, head 0
print(attn.shape)  # [6, 6] — attention weights from each token to every other
Common misconception
Myth: "Attention is just a weighted average." The difference is that the weights are computed dynamically from the input — they're not fixed parameters. Every new sentence reshuffles them. That's exactly what lets the model "understand context."
Where you see it

Canonical: in machine translation, each target word uses attention to look at the corresponding source words. Your day-to-day: when you read a bedtime story to your kid through Claude, the reason the model remembers that "little fox" from paragraph 1 is still the same fox in paragraph 3 is attention building coreference links across paragraphs.

English Summary
Attention computes a weighted sum of values, where weights come from query-key similarity. It lets each token dynamically pull context from any position, replacing fixed locality with content-based routing.
Think it through
Comparing attention to a SQL JOIN — where does the analogy break? And what does KV Cache correspond to in this analogy?
It breaks in three places: (1) SQL JOIN does discrete equality/range matching, while attention does continuous similarity weighting — every Key participates, just at different weights; (2) JOIN returns a table, attention returns an aggregated vector; (3) JOIN happens once, attention re-runs across many stacked layers. KV Cache corresponds to a materialized view: cache previously computed K and V, and new tokens only do incremental updates instead of recomputing the whole table. The value of the analogy isn't precision — it's building the intuition of content routing. Like grep or JOIN, attention is "look up things by content" at a different granularity and a different math.
Q, K, V are all linear projections of the token embedding. Why not just have the embedding attend to itself? What's the essential payoff of training three separate matrices?
Role separation. The features a word wants when querying are usually different from the features it offers when being queried. "It" as Q cares about "who am I referring to?"; "it" as K cares about "am I a candidate antecedent?" If a single vector served both, the model would have to compromise between asking and answering. Three matrices grant the freedom to learn asymmetric relations — A attending to B doesn't imply B attends to A. Same idea as separating primary keys, foreign keys, and business columns in database design: physically they could be merged, but separation lets indexing, querying, and modification each optimize independently. Note: later architectures (MQA/GQA) share K and V partially to save memory — an engineering trade in the other direction.
Why divide by √d? What happens if you skip it? How does this mirror normalization in database index design?
Without it, dot product magnitude grows with dimension d (the expected value scales like √d for high-dim vectors). softmax inputs blow up → output collapses to one-hot → gradients vanish → no training. Dividing by √d normalizes score variance to ~1 so softmax stays meaningfully "soft." Analogous to balancing hash buckets in databases: skewed distributions overload one bucket (softmax collapsing to one-hot is "all queries hit one bucket"). Deeper lesson: many "incidental" details in deep learning (the √d divide, LayerNorm's ε, Adam's β1=0.9) are really about managing numerical distributions — an engineering craft, not a mathematical necessity.
Multi-Head with 8 heads beats one big head at roughly equal parameter count. Why does it work better? How does it relate to ensemble learning?
It enforces diversity. One large head only learns one optimal attention pattern in training; splitting into 8 smaller heads forces each to look at a different subspace, naturally specializing in different relations (syntax, semantics, coreference, local neighbors). Same family as bagging and ensembles: use structural constraints to force diversity, then merge. The difference: Multi-Head is trained end-to-end, and diversity emerges as a byproduct rather than being hand-engineered. Mechanistic interpretability has indeed found heads specialized as induction heads, name movers, duplicate-token detectors, and so on — so "multi-head" isn't marketing, it reflects genuine functional division.
Linear/Sparse Attention tries to drop O(n²) to O(n) — why hasn't either replaced standard attention? What do they sacrifice?
They sacrifice global, exact matching. Standard attention gives every token an exact weight against every other token. Linear Attention approximates with kernel tricks (sacrificing precision); Sparse Attention restricts to windows or fixed patterns (sacrificing global reach). At short and medium context (under 32K), the precision loss visibly degrades capability — industry prefers squeezing the constant with FlashAttention over switching. At ultra-long context (1M+), O(n²) becomes infeasible, and Mamba/SSM-style linear methods are an active frontier. Algorithmic complexity isn't a standalone metric — constants, precision loss, and hardware fit all play. The likely future is hybrid: early Attention layers for global structure, later linear/SSM layers for local details.

TokenizationTokenization

LLMEngineering
One-line intuition

Like a compiler lexing source code into tokens, an LLM must first slice a character stream into chunks and map each chunk to an integer ID. The model never sees characters — only IDs.

What problem it solves

Vocabularies can't be too big (one ID per word → millions of entries → exploding matrices) nor too small (one ID per letter → sequences too long, semantics lost). The fix: BPE / WordPiece / SentencePiece automatically learn a vocabulary of around 50K subwords from high-frequency byte combinations. Common words stay whole; rare words get split into pieces.

How it works (intuition)

BPE training is greedy compression: start from single characters, repeatedly merge the most frequent adjacent pair into a new token, and stop when the vocabulary hits target size.

# BPE intuition
# Start: ['l','o','w','e','r']
# 'l'+'o' is most frequent → merge → ['lo','w','e','r']
# 'lo'+'w' is most frequent → merge → ['low','e','r']
# High-frequency whole words survive; rare words break into subword pieces
# "tokenization" might become ["token", "ization"]

Result: in English, 1 token ≈ 4 characters ≈ 0.75 words; in Chinese, one character often costs 2-3 tokens. This directly drives your bill and your context budget.

Code example
import tiktoken  # OpenAI's official tokenizer

enc = tiktoken.get_encoding("cl100k_base")  # used by GPT-4

text_en = "Tokenization is fun!"
text_zh = "分词其实很有意思!"

print(enc.encode(text_en))         # [3404, 2065, 374, 2523, 0]
print(len(enc.encode(text_en)))    # 5 tokens
print(len(enc.encode(text_zh)))    # 11 tokens (Chinese is pricier!)

# Decode back
print(enc.decode([3404, 2065]))   # "Tokeniz" — not a complete word
Common misconception
Myth: "1 token = 1 word." Wrong. "strawberry" splits into ["straw","berry"] inside GPT. That's exactly why early GPT models couldn't count the r's in "strawberry" — they never saw the letter r, only token chunks.
Key resources
Where you see it

Canonical: evaluating API cost (OpenAI bills by token). Your day-to-day: the same meeting transcript costs 2-3x more in Chinese than in English. For high-frequency automation tasks (daily news summarization), writing the prompt in English while letting the model reply in Chinese saves money.

English Summary
Tokenization splits text into subword units (via BPE/WordPiece) so the vocabulary stays manageable while rare words remain encodable. Token counts drive both context limits and API pricing.
Think it through
GPT can't count the r's in strawberry. Beyond the tokenization root cause, what prompt engineering tricks sidestep it? And what does this teach you about "systemic bugs"?
Workarounds: (1) ask the model to list each letter first (once characters are explicit, counting works); (2) call a Python code interpreter; (3) ask the model to space the letters apart before processing. Systemic-bug lesson: the root cause lives in a low-level abstraction (token representations) invisible from the high-level interface. The right fix is to bypass the abstraction layer, not to patch within it. Like software: you can't fix SQL injection at the ORM layer — you go down to SQL; same here, you can't fix token blindness at the text layer, you reshape the task into a token-friendly form. That's why prompt engineering is fundamentally about identifying the model's blind spots and routing around them.
Chinese tokenization costs 2-3x English. Is that a technical issue or a commercial one? If you had to design a tokenizer with equal efficiency for both, how would you do it?
Mostly technical: BPE training data is unbalanced (GPT-4's pretraining corpus is 90%+ English), so English averages 1 token per word while Chinese needs 2-3 bytes per character. There's commercial inertia too — OpenAI has no incentive to optimize for Chinese. Design approach: (1) force balanced language proportions in the training corpus; (2) use SentencePiece + Unigram LM instead of BPE — it handles whitespace-free languages better; (3) seed the vocabulary with a curated set (e.g. 8,000 common Chinese characters as standalone tokens). DeepSeek and Qwen's tokenizers do handle Chinese far better than GPT's. Deeper point: this exposes the language inequality hidden inside "neutral" technical decisions — the same flavor of bias as "UTF-8 favors ASCII over CJK."
What's wrong with a vocabulary of 256 (too small) vs 500K (too large)? Compare it to choosing hash-table bucket counts.
Too small (256, essentially byte-level): one token per character makes sequences extremely long, the model spends compute building "letters → words," and long-range learning is harder. Upside: nearly no OOV (out-of-vocabulary). Too large (500K): every common word gets its own ID, sequences shrink, but the embedding matrix balloons (500K × 4096 ≈ 2B parameters spent on the vocabulary), training is inefficient, and rare words don't learn well. The sweet spot sits between 32K and 128K. Hash buckets: too many wastes memory and causes cache misses; too few causes heavy collisions and O(n) lookups. Vocabulary size is essentially the discretization granularity of your frequency distribution — like hashing, it has to match the data.
Different vendors use incompatible tokenizers. How is this similar to the character-encoding wars (UTF-8 vs UTF-16)?
Four parallels: (1) no enforced standard — each vendor picks based on training data and engineering preference; (2) high switching cost — model weights are bound to a particular tokenizer, so swapping it is essentially retraining; (3) first-mover lock-in — GPT-2's BPE has been reused for years, just as UTF-8 came to dominate the web through ASCII compatibility; (4) invisible costs — you pay per token, but the token count for the same text differs across vendors, similar to UTF-8 vs UTF-16 storage differences for Chinese. The difference: character sets eventually converged on Unicode, but tokenizers likely won't, because they're tightly coupled to model weights. Standardization requires interfaces that can evolve independently — tokenizers lack that property.
If byte-level tokenization (one token per byte) takes over, where does it win? What does it sacrifice?
Wins: (1) multilingual / low-resource languages — no per-language tokenizer tuning; (2) code and binary data — character-level handling is more precise; (3) letter-level tasks (spelling, counting r's) — no more strawberry problem; (4) robustness — better tolerance for typos and informal input. Sacrifices: (1) sequence length explodes 4-8x, O(n²) becomes unbearable; (2) long-range dependencies get harder (the model has to rebuild the concept of "word"); (3) training data efficiency drops. Meta's BLT (Byte Latent Transformer, 2024) uses dynamic patching — fine-split where entropy is high, merge where it's low — trying to have it both ways. If hardware continues to crack O(n²) (linear attention, Mamba), byte-level may well become the default.

Positional EncodingPositional Encoding

LLMTheory
One-line intuition

Attention is an unordered-set operation (like a Python set) — it has no idea which token comes first. Positional encoding stamps each token with a "position ID" so the model can tell "dog bites man" from "man bites dog."

What problem it solves

The Q·Kᵀ computation in self-attention is completely permutation-invariant — shuffle the sentence, recompute, and you get the same answer. Obviously unacceptable. The fix: inject position information into the token embeddings so position 0 and position 5 start from different vectors.

How it works (intuition)

Three generations of design:

  • Sinusoidal (original Transformer): use sin/cos at different frequencies to generate a unique vector per position, added to the token embedding. Upside: can extrapolate to lengths never seen in training.
  • Learned (GPT-2 / BERT): simply learn a [max_len, dim] lookup matrix. Simple, but no extrapolation.
  • RoPE — Rotary Position Embedding (Llama / Qwen / today's standard): turn position into a rotation angle. Each token rotates by a position-dependent angle in a high-dimensional space, so the Q·K dot product naturally encodes relative distance. This is the de facto standard for long-context models.
# Intuition: RoPE rotates position-m vectors by m·θ
def rope(x, pos):
    # x is the token's Q or K vector, processed in pairs
    for i, (a, b) in enumerate(pairs(x)):
        theta = pos / 10000 ** (2*i / dim)
        a_new = a * cos(theta) - b * sin(theta)
        b_new = a * sin(theta) + b * cos(theta)
    return rotated_x
Code example
# Watch how positional encoding changes the output
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tok = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

# Same tokens, different positions → different hidden states
ids_a = tok("dog bites man", return_tensors="pt").input_ids
ids_b = tok("man bites dog", return_tensors="pt").input_ids

with torch.no_grad():
    h_a = model.transformer(ids_a).last_hidden_state
    h_b = model.transformer(ids_b).last_hidden_state

# Same three tokens, but the representations diverge because of position
print(torch.allclose(h_a, h_b))   # False
Common misconception
Myth: "Positional encoding just adds 1, 2, 3 to the embedding." That would let position magnitudes overwhelm semantics. Real approaches use high-dimensional periodic functions or rotations so that "relative distance" has geometric meaning in vector space — that's how RoPE supports 128K context.
Where you see it

Canonical: any order-sensitive task (translation, code generation, reasoning chains). Your day-to-day: when you ask Claude to analyze a 50-page PDF report, the model has to keep page 3 distinct from page 47 — that's positional encoding (RoPE) maintaining positional awareness across a long context.

English Summary
Self-attention is permutation-invariant, so positional encodings inject order information into token embeddings. Modern LLMs use Rotary Position Embedding (RoPE), which encodes relative positions via rotation, enabling long-context extrapolation.
Think it through
RoPE can extrapolate past the training length — but performance collapses past roughly 2-4x. Why? What deeper challenge of "positional extrapolation" does this expose?
The collapse comes from frequency-coverage blind spots: RoPE assigns different sinusoidal periods to different dimensions, and during training the model only sees phase combinations within a certain range. When extrapolating, high-frequency dimensions rotate into phase regions the model has never seen, and the resulting Q·K patterns are unfamiliar. The deeper challenge: positional encoding is a prior injection, but the model only learns specific behaviors against that prior — it doesn't generalize across unseen priors. Mitigations: (1) YaRN / NTK Scaling use interpolation to compress phases so extrapolation looks more like interpolation; (2) deliberately mix in very long sequences during continual pretraining. Lesson: neural networks extrapolate poorly. "Theoretically extrapolable" and "practically extrapolable" are different worlds — that's why long context is an engineering problem, not an algorithmic one.
Across three generations (Sinusoidal / Learned / RoPE), what's the underlying evolution? Why is "relative position" better than "absolute"? Compare git relative paths vs absolute paths.
The trajectory: from explicit position labels (Sinusoidal, Learned, each token tagged with an absolute number) to implicit relative encoding (RoPE makes dot products naturally reflect relative distance). Why relative is better: (1) language rules are relative — "distance between subject and predicate" matters more than "subject is at index 5"; (2) the same phrase shifts position across contexts without changing meaning, and relative encoding stays invariant; (3) extrapolation is easier. Git analogy: a commit hash is an absolute reference (precise but rigid), while HEAD~3 is relative (semantically clear, portable). Same as using ../config in code instead of hard-coded /Users/cissy/.... A recurring design philosophy: prefer relations over identifiers.
Adding positional encoding to the embedding — doesn't it pollute semantics? How does the model separate "position" from "meaning"?
There's interference, but the model learns to separate. Two reasons: (1) the embedding space is large enough (768–4096 dims) that position signals and semantic signals can occupy different subspaces; (2) training itself sorts which dimensions hold which information — the model maps positional cues into whichever subspace is useful. RoPE goes further: it doesn't add position in, it multiplies it through rotation on Q and K — a multiplicative injection that pollutes less than addition. A real side effect remains: shallow layers tend to learn positional patterns first; you can literally see attention maps in the first layer dominated by diagonal "look-at-nearby-tokens" patterns, while deeper layers do semantic work.
Three bottlenecks for long context — positional encoding, attention complexity, KV cache memory. Which gets cracked first, and why?
My bet: KV cache memory. It's the most direct engineering pressure with the most active mitigation paths: (1) GQA/MQA sharing KV across heads; (2) KV quantization compressing fp16 to int8/int4; (3) PagedAttention (vLLM) treating KV like virtual-memory paging; (4) sliding-window or recurrent designs dropping distant KV entries. Attention complexity needs an algorithmic breakthrough (Mamba-class), and the precision cost makes full replacement hard. Positional extrapolation is fundamentally a generalization problem, the hardest of all. Priority: memory (engineering) > complexity (algorithm + hardware) > extrapolation (theory). Lesson: when assessing bottlenecks, the existence of a clear engineering path is a powerful heuristic — pure algorithmic or theoretical problems usually move slowest.
For video frame sequences or code AST trees, is 1D RoPE still sufficient? What hidden assumption about "position" does this expose?
No. Video is a time × space × channel 3D structure; an AST is a tree. 1D positional encoding assumes "positions are linearly comparable," which loses dimensional information for these structures. Retrofits: (1) video uses 2D/3D RoPE with independent rotations per axis — Sora uses spatiotemporal joint encoding; (2) AST and graph use Graph Positional Encoding (graph Laplacian eigenvectors, shortest-path distance). Hidden assumption exposed: when we say "position," we really mean "coordinates in some metric space." Text happens to be a 1D time series, so 1D is enough; once data has richer structure, "position" must be redefined. That's a core difficulty in multimodal models — unifying positional encoding so image patches, video frames, text tokens, and audio segments live in one comparable space.
← Back to home