The Transformer didn't appear out of nowhere. Before it, the workhorse for processing "sequences" (a sentence, a clip of speech, a stream of timestamps) was the RNN (Recurrent Neural Network). Today we walk the 2014–2015 evolutionary line: LSTM → GRU → Seq2Seq → the origin of Attention. Trace it and you'll see the Transformer's attention isn't magic — it's a precise answer to one specific RNN bottleneck. And that bottleneck can be stated in one phrase you already know: lossy compression.
A plain RNN is like a running aggregate with a single variable: every new word gets mashed into the same state, repeatedly overwriting older information. LSTM adds a separate "conveyor belt" (cell state) alongside it — much like a database WAL (write-ahead log): information travels forward untouched by default, and is only read, written, or erased when a "gate" explicitly approves. That belt lets distant information travel far with almost no decay.
First, the RNN's root disease: the vanishing gradient. Each RNN step multiplies information by a weight matrix; during training, backpropagating the error has to multiply these matrices dozens or hundreds of times. Multiply numbers below 1 and you get exponential decay toward zero — distant gradients vanish (above 1 and they explode). The result: RNNs fail to learn long dependencies like "The clouds are in the ___" where the cue is many words back — like a signal that decays after too many hops.
LSTM (Hochreiter & Schmidhuber, 1997) solves this by separating "memory" from "computation." Updates on the cell-state belt use addition rather than repeated multiplication — and on an additive path, gradients don't decay exponentially. That's the mathematical core of LSTM's long memory. Three gates control the belt:
Each gate is σ(W·[h, x] + b) — the sigmoid squashes the result to 0~1, intuitively "how far the switch is open": 0 = fully closed, 1 = fully open. The gates aren't hand-written rules; they are learned.
import torch, torch.nn as nn # PyTorch's built-in LSTM — all the gate logic is packaged lstm = nn.LSTM(input_size=16, hidden_size=32, batch_first=True) x = torch.randn(4, 10, 16) # (batch=4, seq_len=10, 16 features per step) out, (h_n, c_n) = lstm(x) # c_n is the "conveyor belt" cell state print(out.shape) # (4, 10, 32) an output at every time step print(h_n.shape) # (1, 4, 32) last hidden state — often used as the whole-sequence "summary" print(c_n.shape) # (1, 4, 32) last cell state
GRU is a slimmed-down LSTM. If LSTM is "three-replica strong consistency" storage, GRU is "two-replica" — it drops the separate cell state, merges "memory" and "output" into one hidden state, and cuts the gates from 3 to 2. The payoff: fewer parameters, faster training, with performance on most tasks roughly tied with LSTM. A textbook engineering trade-off: spend a little theoretical expressiveness to buy real speed.
LSTM's three gates plus a separate cell state mean many parameters and heavy compute. GRU (Cho et al., 2014) asked: can we do it cheaper? Its design:
The key intuition lives in the update gate's interpolation structure: h_t = (1−z)·h_{t-1} + z·h̃_t. Read it as "new state = (1−z) parts old + z parts new." When z→0, h passes through almost untouched — the same trick as LSTM's belt for fighting vanishing gradients: leave a near-identity pass-through channel.
import torch, torch.nn as nn gru = nn.GRU(16, 32, batch_first=True) lstm = nn.LSTM(16, 32, batch_first=True) # Compare parameter counts directly: GRU uses 3 gate weight groups, LSTM 4 n_gru = sum(p.numel() for p in gru.parameters()) n_lstm = sum(p.numel() for p in lstm.parameters()) print(n_gru, n_lstm) # 4800 vs 6400 — GRU ~1/4 fewer params out, h_n = gru(torch.randn(4, 10, 16)) # note: GRU returns only h, no c
Seq2Seq splits "input sequence → output sequence" into two halves: an encoder reads the whole sentence and compresses it into one fixed-length vector; a decoder takes that vector and generates the output word by word. Like an RPC call: the client (encoder) serializes the whole request into one fixed-length payload, sends it to the server (decoder) which deserializes the result. The problem is immediate — cramming an arbitrarily long sentence into a fixed-size vector is necessarily lossy.
Pain point: translation, summarization, dialogue — tasks where input and output lengths are both variable and unaligned (5 Chinese words may become 8 English words). A plain classification network can't do "variable in, variable out." Seq2Seq (Sutskever, Vinyals & Le, 2014) solves it elegantly:
This is the first time "understanding" and "generation" were cleanly split into two modules — every encoder-decoder architecture today (including the original Transformer) inherits this skeleton. The paper also had a counterintuitive trick: feeding the input sentence in reverse boosts scores, because it brings the source's start and the translation's start closer in time, easing long-distance decay.
import torch, torch.nn as nn class Seq2Seq(nn.Module): def __init__(self, vocab, dim=64): super().__init__() self.emb = nn.Embedding(vocab, dim) self.encoder = nn.LSTM(dim, dim, batch_first=True) self.decoder = nn.LSTM(dim, dim, batch_first=True) self.out = nn.Linear(dim, vocab) def forward(self, src, tgt): _, state = self.encoder(self.emb(src)) # encode: keep only final state as the "summary" dec, _ = self.decoder(self.emb(tgt), state) # init decoder with the summary return self.out(dec) # predict the next word at each step
Attention turns Seq2Seq's "send only one summary" into "keep all the original records + an on-demand retrieval index." Every word the decoder generates, it runs a weighted retrieval over all the encoder's hidden states — like a database JOIN, but with relevance scoring instead of exact matching, and like RAG dynamically recalling the most relevant snippets at each step. It eliminates the fixed bottleneck entirely, and it's the direct ancestor of the later Transformer's "Attention is All You Need."
The pain is the previous card's bottleneck: the decoder gets only a fixed summary, so a long sentence's details are lost. Bahdanau, Cho & Bengio (2014) solve it with soft alignment — when the decoder generates word i, it no longer uses just that one vector, but computes a context vector specific to this step:
Symbol by symbol (the one formula worth chewing on today):
Core intuition: when translating each word, the model learns on its own which source words to look back at. Translating "cats," weight concentrates on "猫"; translating "love," on "爱." This score→softmax→weighted-sum trio is the prototype of today's Transformer self-attention — the only difference is the Transformer generalized it from between-encoder-and-decoder to the sequence attending to itself.
import torch, torch.nn.functional as F # The 3 core steps of additive (Bahdanau) attention, stripped of packaging def attention(s_prev, enc_h): # s_prev:(B,d) decoder prev step; enc_h:(B,T,d) all source words score = (enc_h * s_prev.unsqueeze(1)).sum(-1) # (1) score (B,T) alpha = F.softmax(score, dim=-1) # (2) normalize to weights summing to 1 ctx = (alpha.unsqueeze(-1) * enc_h).sum(1) # (3) weighted sum (B,d) return ctx, alpha # alpha also visualizes "which word the model is looking at" ctx, a = attention(torch.randn(2,8), torch.randn(2,5,8)) print(ctx.shape, a.shape) # (2,8) (2,5) — one custom context per step + one row of attention