Attention's core operation is a weighted sum over a set of tokens—mathematically isomorphic to a "set" operation: shuffle the tokens and each token sees exactly the same context. This is called permutation invariance. RNNs/CNNs don't have this problem (their convolution windows / time steps carry order natively), but pure Attention sees "I love you" and "you love me" as the same bag of tokens.
So we must inject an "I'm at position N" signal externally. That's Positional Encoding (PE). Today we cover four flavors: fixed sinusoidal (the 2017 original), learned absolute (used by GPT-2/BERT), RoPE (today's default in Llama / Qwen / DeepSeek), and ALiBi (extrapolation-friendly bias). They map onto a very clear evolution: from "a tag added to the input" to "geometry baked into attention itself."
Like handing each position a multi-band "clock reading" card—position 0 is all zeros, position 1 advances each hand by one tick, position 1000 by a thousand ticks. Low-frequency hands move slowly and handle "paragraph-scale distance"; high-frequency hands move fast and handle "neighbor precision." Backend analogy: imagine stuffing a multi-resolution composite timestamp (year/month/day/second) into every record—the model picks which scale to use.
Pain point: Attention is permutation-invariant, so we must tell it each token's position externally. The most naive approach: encode "position i" as a d-dim vector and add it to the token embedding. The question is how. Attention Is All You Need (Vaswani 2017) gave an elegant parameter-free answer:
PE(pos, 2i) = sin(pos / 100002i/d)
PE(pos, 2i+1) = cos(pos / 100002i/d)
Symbols: pos is the token's index in the sequence (0,1,2,…), i is the embedding dimension index (0…d/2−1), d is total embedding dim. Pair up dimensions; each pair uses one (sin,cos) couple, with wavelengths growing geometrically from 2π to 10000·2π. Low i = high frequency → sensitive to neighbors; high i = low frequency → sensitive to long-range.
Why sin/cos? Key property: for any fixed offset k, PE(pos+k) can be obtained from PE(pos) via a linear transform that depends only on k (the sum-of-angles identity). This gives the model a "latent relative-position learnability"—if it wants to, it can implicitly extract relative distance from two absolute encodings. Another bonus: positions never seen during training can still be computed (formula plugs in directly), so it's theoretically extrapolatable. In practice, extrapolation is weak.
import torch, math def sinusoidal_pe(max_len, d): pe = torch.zeros(max_len, d) pos = torch.arange(max_len).unsqueeze(1) # (L, 1) # Geometrically-spaced frequencies: higher dim → lower freq div = torch.exp(torch.arange(0, d, 2) * (-math.log(10000.0) / d)) pe[:, 0::2] = torch.sin(pos * div) # even dims = sin pe[:, 1::2] = torch.cos(pos * div) # odd dims = cos return pe pe = sinusoidal_pe(max_len=512, d=128) print(pe.shape) # (512, 128) # Usage: x = token_embedding + pe[:seq_len] ← addition, not concat
Treat "position" as a lookup table on equal footing with tokens: the vocab table has vocab_size word vectors, the position table has max_len position vectors, and both are learned together. Backend analogy: like building an extra "position dimension table" and joining by row_id—crude but workable.
Pain point: Sinusoidal encoding uses "human-designed frequencies"—nobody guarantees sin/cos is the optimal basis. The learnable approach drops the assumption: "if neural networks are good at learning, let them learn positions too."
Mechanism: build an nn.Embedding(max_len, d)—max_len rows, each a d-dim vector. For position i, look up row i and add to the token embedding. BERT, GPT-1/2 use this. Key limitation: max_len must be fixed at training time (BERT used 512). Sequences up to 511 work; 513 has no corresponding vector—the model literally hasn't learned what "that position" looks like, so zero extrapolation. This is the root reason BERT-style models must chunk long documents.
import torch, torch.nn as nn class LearnedPE(nn.Module): def __init__(self, max_len=512, d=128): super().__init__() # A learnable (max_len, d) table, peer to the word vector table self.pos = nn.Embedding(max_len, d) def forward(self, x): # x: (batch, L, d) L = x.size(1) ids = torch.arange(L, device=x.device) # 0,1,...,L-1 return x + self.pos(ids) # broadcast-add pe = LearnedPE(max_len=512, d=128) print(pe(torch.randn(2, 128, 128)).shape) # (2,128,128) OK # Feeding L=600 → IndexError: no row 600. Literally cannot extrapolate.
max_position_embeddings + position_embedding_type=absolute, it's this approach—context window is hard-capped, exceeding it just errors out. Extending it requires retraining.Don't tag the input—rotate Q and K vectors by an angle determined by position. When two tokens take their dot product, the rotation difference IS their relative distance—position lives directly inside attention as a geometric angle. Backend analogy: upgrading from "stuff a timestamp column in every record" to "rotate coordinates by timestamp"—joins naturally encode time differences.
Pain point: The previous two PEs encode absolute position, but linguistic relationships are fundamentally relative—"an adjective modifies the noun that immediately follows it" or "subject and verb are two words apart" has nothing to do with being at sentence position 7 vs. 700. With absolute PE added to tokens, the model still has to laboriously infer relative distance from two absolute positions. Can we encode relative distance directly?
RoPE's insight (Su et al., RoFormer 2021): leave token embeddings alone, modify Q and K instead. Pair up the d dims of Q/K into d/2 2D planes, and rotate each plane by angle m·θi (m is token position; θi is the plane's intrinsic frequency, geometrically spaced just like sinusoidal PE). Core formula (one 2D plane):
RoPE(q, m) = Rm · q where Rm = [[cos(mθ), −sin(mθ)],
[sin(mθ), cos(mθ)]]
Symbols: q is the 2D sub-vector for this dim pair; m is token position; θ is this pair's frequency (high dims = low freq, same design as sinusoidal). Rm is the standard 2D rotation matrix.
Key property (RoPE's magic): after rotating, take dot product:
(Rm q)T (Rn k) = qT RmT Rn k = qT Rn−m k
The right side depends only on n−m (relative distance)—the absolute position cancels automatically after the dot product, leaving only relative position. A seemingly trivial rotation cleanly converts "absolute encoding" into "relative distance encoding." Llama, Qwen, DeepSeek, Mistral all use it.
import torch def apply_rope(x, theta_base=10000.0): # x: (batch, seq, d); d must be even. Minimal readable version. b, L, d = x.shape half = d // 2 # Per-pair intrinsic frequencies, geometric: high dim = low freq freq = 1.0 / (theta_base ** (torch.arange(0, half).float() / half)) pos = torch.arange(L).float() ang = torch.outer(pos, freq) # (L, d/2) cos, sin = ang.cos(), ang.sin() x1, x2 = x[..., :half], x[..., half:] # split d into two halves # 2D rotation: (x1, x2) → (x1·cos − x2·sin, x1·sin + x2·cos) out1 = x1 * cos - x2 * sin out2 = x1 * sin + x2 * cos return torch.cat([out1, out2], dim=-1) # In practice: apply once each to Q and K inside attention; V is left alone q = torch.randn(1, 8, 64); q_rot = apply_rope(q) print(q_rot.shape) # (1, 8, 64) shape unchanged, position now in the geometry
More radical: drop PE as a "separate signal" entirely, and just subtract a distance penalty from the attention similarity score—the farther apart, the bigger the penalty. Backend analogy: like appending − DISTANCE_PENALTY to a SQL ORDER BY so "closer first" is built into the sort, with no extra position dimension table needed.
Pain point: All prior approaches—sinusoidal, learned, RoPE—first construct a position vector, then merge it into the computation. ALiBi (Press et al. 2022) asked the inverse question: can we skip the PE vector and just add a distance-based bias to the attention score?
Naive attention computes score(i,j) = qi·kj/√d. ALiBi changes this to:
score(i, j) = qi·kj/√d − mh · |i − j|
Symbols: |i−j| is the relative distance (in causal models, only j ≤ i). mh is each head's fixed slope—the paper uses a geometric sequence (1/2, 1/4, 1/8, …, 1/256), with each head getting a different slope: large-slope heads focus tightly on neighbors, small-slope heads can look much farther. Token embeddings unchanged, Q/K unchanged—just subtract a distance-proportional number from the pre-softmax score, and position is injected.
Why does it extrapolate well? Because the bias is just m·|i−j|, defined and behaving consistently for arbitrarily large i−j—the model has never seen length 8192, but it learned the monotonic rule "distance 8000 gets a bigger penalty than distance 4000." The paper reports: trained on 1024 and tested at 2048 directly, ALiBi matches or beats a sinusoidal model trained natively to 2048. Used by BLOOM, MPT.
import torch def alibi_bias(seq_len, n_heads): # Per-head fixed slopes, geometric: paper uses slope ~ 2^(-8h/n) slopes = torch.tensor([2 ** (-8.0 * (h + 1) / n_heads) for h in range(n_heads)]) # Distance matrix |i - j| i = torch.arange(seq_len).unsqueeze(1) j = torch.arange(seq_len).unsqueeze(0) dist = (j - i).float() # causal: keep j<=i only # Each head gets one (L,L) bias, broadcast to (n_heads, L, L) bias = slopes[:, None, None] * dist[None, :, :] return bias # add to score before softmax bias = alibi_bias(seq_len=8, n_heads=4) print(bias.shape) # (4, 8, 8) — one matrix per head # Train at 1024, run inference at 8192 directly → extrapolation-friendly