AI/ML Explained: RNNs & Sequences

Day 19 · 2026-06-05 · Difficulty ★★★☆☆
For: engineers with coding experience, but not from an AI background

The Transformer didn't appear out of nowhere. Before it, the workhorse for processing "sequences" (a sentence, a clip of speech, a stream of timestamps) was the RNN (Recurrent Neural Network). Today we walk the 2014–2015 evolutionary line: LSTM → GRU → Seq2Seq → the origin of Attention. Trace it and you'll see the Transformer's attention isn't magic — it's a precise answer to one specific RNN bottleneck. And that bottleneck can be stated in one phrase you already know: lossy compression.

Long Short-Term MemoryLSTM

gatinglong dependency
One-line analogy

A plain RNN is like a running aggregate with a single variable: every new word gets mashed into the same state, repeatedly overwriting older information. LSTM adds a separate "conveyor belt" (cell state) alongside it — much like a database WAL (write-ahead log): information travels forward untouched by default, and is only read, written, or erased when a "gate" explicitly approves. That belt lets distant information travel far with almost no decay.

What problem it solves + how it works

First, the RNN's root disease: the vanishing gradient. Each RNN step multiplies information by a weight matrix; during training, backpropagating the error has to multiply these matrices dozens or hundreds of times. Multiply numbers below 1 and you get exponential decay toward zero — distant gradients vanish (above 1 and they explode). The result: RNNs fail to learn long dependencies like "The clouds are in the ___" where the cue is many words back — like a signal that decays after too many hops.

LSTM (Hochreiter & Schmidhuber, 1997) solves this by separating "memory" from "computation." Updates on the cell-state belt use addition rather than repeated multiplication — and on an additive path, gradients don't decay exponentially. That's the mathematical core of LSTM's long memory. Three gates control the belt:

  • Forget gate f: decides which old information on the belt to erase (e.g. a new subject appears, so the old subject's gender info is dropped);
  • Input gate i: decides which new information gets written to the belt;
  • Output gate o: decides which part of the belt to read out as this step's output.

Each gate is σ(W·[h, x] + b) — the sigmoid squashes the result to 0~1, intuitively "how far the switch is open": 0 = fully closed, 1 = fully open. The gates aren't hand-written rules; they are learned.

cell-state belt (info goes straight by default; gates read/write on demand)

C_{t-1} ──×forget──+input──→ C_t (additive update, gradient doesn't decay)
        ↓ output gate
        h_t this step's output

vs RNN: only one h, multiplied by a weight matrix every step → distant gradients vanish
Code example
import torch, torch.nn as nn

# PyTorch's built-in LSTM — all the gate logic is packaged
lstm = nn.LSTM(input_size=16, hidden_size=32, batch_first=True)

x = torch.randn(4, 10, 16)   # (batch=4, seq_len=10, 16 features per step)
out, (h_n, c_n) = lstm(x)     # c_n is the "conveyor belt" cell state

print(out.shape)    # (4, 10, 32) an output at every time step
print(h_n.shape)    # (1, 4, 32) last hidden state — often used as the whole-sequence "summary"
print(c_n.shape)    # (1, 4, 32) last cell state
Pitfall + where you'd use it
Pitfall: "LSTM can remember infinitely far." Wrong. The additive path mitigates vanishing gradients but doesn't eliminate them — in practice, effective LSTM memory is typically tens to a couple hundred steps, no more. For truly long context you need attention (read on). This is why Transformers replaced LSTMs after 2017.
📌 Super-individual scenario: understanding the "gating" abstraction itself is worth more than memorizing the formula. When you design a stateful Agent workflow, "what to forget, what to write to long-term memory, what to output" is the engineering version of forget/input/output gates — LSTM gives you a native vocabulary for thinking about memory management.
Takeaway + question
💡 LSTM's core innovation isn't "more complex" — it's opening an additive highway for information, bypassing the vanishing gradient caused by repeated multiplication.
🤔 Where in systems you know is "default pass-through + explicit gating" used to fight information decay? (Hint: cache penetration, message-queue acks)

Gated Recurrent UnitGRU

gatinglightweight
One-line analogy

GRU is a slimmed-down LSTM. If LSTM is "three-replica strong consistency" storage, GRU is "two-replica" — it drops the separate cell state, merges "memory" and "output" into one hidden state, and cuts the gates from 3 to 2. The payoff: fewer parameters, faster training, with performance on most tasks roughly tied with LSTM. A textbook engineering trade-off: spend a little theoretical expressiveness to buy real speed.

What problem it solves + how it works

LSTM's three gates plus a separate cell state mean many parameters and heavy compute. GRU (Cho et al., 2014) asked: can we do it cheaper? Its design:

  • Update gate z: one knob controlling both "how much old state to keep" and "how much new state to write" — merging LSTM's forget and input gates (keep = 1 − write);
  • Reset gate r: decides whether to ignore the old state when computing the new candidate (r→0 means "forget the past, look only at the current input");
  • No separate cell state: a single hidden state h serves as both memory and output.

The key intuition lives in the update gate's interpolation structure: h_t = (1−z)·h_{t-1} + z·h̃_t. Read it as "new state = (1−z) parts old + z parts new." When z→0, h passes through almost untouched — the same trick as LSTM's belt for fighting vanishing gradients: leave a near-identity pass-through channel.

Code example
import torch, torch.nn as nn

gru  = nn.GRU(16, 32, batch_first=True)
lstm = nn.LSTM(16, 32, batch_first=True)

# Compare parameter counts directly: GRU uses 3 gate weight groups, LSTM 4
n_gru  = sum(p.numel() for p in gru.parameters())
n_lstm = sum(p.numel() for p in lstm.parameters())
print(n_gru, n_lstm)   # 4800 vs 6400 — GRU ~1/4 fewer params

out, h_n = gru(torch.randn(4, 10, 16))  # note: GRU returns only h, no c
Pitfall + where you'd use it
Pitfall: "GRU is newer than LSTM, so it's better." Wrong. There's no universal winner — with little data or a need for speed GRU often wins; on very long dependencies or big data LSTM is sometimes more stable. Run the comparison, don't pick by reputation. This is the norm in classic ML: architecture choice is data-driven.
📌 Decision-support scenario: GRU vs LSTM is a small instance of "is the simplification worth it?" When you pick a tech stack you ask the same thing: does cutting a component save more than the capability it costs? GRU's answer — "worth it in most cases" — is itself a useful prior.
Takeaway + question
💡 GRU = interpolating LSTM's "keep vs write" with a single update gate, losing one belt and one gate to trade tiny expressiveness for speed.
🤔 What's the deep connection between "(1−z)·old + z·new" and the exponential moving average (EMA) / weighted sliding window you've seen?

Sequence to SequenceSeq2Seq · Encoder-Decoder

architectureencode-decode
One-line analogy

Seq2Seq splits "input sequence → output sequence" into two halves: an encoder reads the whole sentence and compresses it into one fixed-length vector; a decoder takes that vector and generates the output word by word. Like an RPC call: the client (encoder) serializes the whole request into one fixed-length payload, sends it to the server (decoder) which deserializes the result. The problem is immediate — cramming an arbitrarily long sentence into a fixed-size vector is necessarily lossy.

What problem it solves + how it works

Pain point: translation, summarization, dialogue — tasks where input and output lengths are both variable and unaligned (5 Chinese words may become 8 English words). A plain classification network can't do "variable in, variable out." Seq2Seq (Sutskever, Vinyals & Le, 2014) solves it elegantly:

  • The encoder (an LSTM) reads the input in order; its final hidden state = the whole sentence's "summary vector" (context vector, aka thought vector);
  • The decoder (another LSTM) starts from that vector and autoregressively emits one word at a time until an end token.

This is the first time "understanding" and "generation" were cleanly split into two modules — every encoder-decoder architecture today (including the original Transformer) inherits this skeleton. The paper also had a counterintuitive trick: feeding the input sentence in reverse boosts scores, because it brings the source's start and the translation's start closer in time, easing long-distance decay.

Seq2Seq: the whole sentence is squeezed into one fixed-length vector (the bottleneck)

I love cats Encoder LSTM [one fixed-length vector] Decoder LSTM J'aime les chats

↑ No matter how long the sentence, all info must squeeze into this red box → longer sentence, more loss
Code example
import torch, torch.nn as nn

class Seq2Seq(nn.Module):
    def __init__(self, vocab, dim=64):
        super().__init__()
        self.emb = nn.Embedding(vocab, dim)
        self.encoder = nn.LSTM(dim, dim, batch_first=True)
        self.decoder = nn.LSTM(dim, dim, batch_first=True)
        self.out = nn.Linear(dim, vocab)

    def forward(self, src, tgt):
        _, state = self.encoder(self.emb(src))   # encode: keep only final state as the "summary"
        dec, _ = self.decoder(self.emb(tgt), state)  # init decoder with the summary
        return self.out(dec)                      # predict the next word at each step
Pitfall + where you'd use it
Pitfall: "just make the vector dimension bigger and no info is lost." That treats the symptom, not the cause. The bottleneck isn't dimension size — it's the structural assumption of "representing arbitrarily long input with one fixed vector": a 50-word sentence and a 5-word sentence are forced into the same-sized space. The real cure is structural: let the decoder stop relying on a summary and instead look back at every spot in the source — that's attention.
📌 Cross-disciplinary thought: this "fixed-summary bottleneck" resembles human memory uncannily — after reading a book your brain keeps only a fuzzy "summary," and details require flipping back. Attention is to Seq2Seq as "keep the book handy and look things up" is to "rely on recollection alone."
Takeaway + question
💡 Seq2Seq established the encoder-decoder skeleton, but squeezing the whole sentence into one fixed vector is a structural, lossy bottleneck.
🤔 Without changing the RNN, only the information flow, how would you let the decoder "remember" the input's details instead of just getting a summary? (That's exactly the next card's answer.)

The Origin of AttentionAttention · Bahdanau 2014

alignmentmilestone
One-line analogy

Attention turns Seq2Seq's "send only one summary" into "keep all the original records + an on-demand retrieval index." Every word the decoder generates, it runs a weighted retrieval over all the encoder's hidden states — like a database JOIN, but with relevance scoring instead of exact matching, and like RAG dynamically recalling the most relevant snippets at each step. It eliminates the fixed bottleneck entirely, and it's the direct ancestor of the later Transformer's "Attention is All You Need."

What problem it solves + how it works

The pain is the previous card's bottleneck: the decoder gets only a fixed summary, so a long sentence's details are lost. Bahdanau, Cho & Bengio (2014) solve it with soft alignment — when the decoder generates word i, it no longer uses just that one vector, but computes a context vector specific to this step:

ci = Σj αij · hj
αij = softmax( score(si-1, hj) )

Symbol by symbol (the one formula worth chewing on today):

  • hj: the encoder's hidden state for source word j — the "original record" we keep instead of discarding;
  • score(·): a scoring function measuring how relevant "the decoder's current state si-1" is to "source word hj" — i.e. whether to attend to this source word;
  • αij: pass all scores through softmax to normalize into weights that sum to 1, read as "how much attention to give source word j at this step";
  • ci: a weighted sum of all hj with those weights — context tailored to this particular step.

Core intuition: when translating each word, the model learns on its own which source words to look back at. Translating "cats," weight concentrates on "猫"; translating "love," on "爱." This score→softmax→weighted-sum trio is the prototype of today's Transformer self-attention — the only difference is the Transformer generalized it from between-encoder-and-decoder to the sequence attending to itself.

Attention: the decoder looks back at all source words each step, weighted by relevance

source hidden states:h(I) h(love) h(cats)
        ↘ 0.1  ↘ 0.1  ↘ 0.8
decode "chats" weighted sum c (heavily attends "cats")

weights α are learned and change dynamically each step → no fixed bottleneck
Code example
import torch, torch.nn.functional as F

# The 3 core steps of additive (Bahdanau) attention, stripped of packaging
def attention(s_prev, enc_h):       # s_prev:(B,d) decoder prev step; enc_h:(B,T,d) all source words
    score = (enc_h * s_prev.unsqueeze(1)).sum(-1)  # (1) score (B,T)
    alpha = F.softmax(score, dim=-1)              # (2) normalize to weights summing to 1
    ctx   = (alpha.unsqueeze(-1) * enc_h).sum(1)   # (3) weighted sum (B,d)
    return ctx, alpha             # alpha also visualizes "which word the model is looking at"

ctx, a = attention(torch.randn(2,8), torch.randn(2,5,8))
print(ctx.shape, a.shape)   # (2,8) (2,5) — one custom context per step + one row of attention
Pitfall + where you'd use it
Pitfall: "attention was invented by the Transformer." Not so. Bahdanau attention existed in 2014, 3 years earlier, where it was still grafted onto an RNN as an add-on. The Transformer's (2017) truly radical move was to throw out the RNN entirely and keep only attention — the title "Attention is All You Need" is aimed squarely at "no more recurrence." Grasp this history and you'll see attention isn't just another trick — it untied the 20-year-old knot of long dependencies in sequence modeling.
📌 Super-individual scenario: the attention weights α are interpretable — you can literally plot "where in the source the model is looking while translating this word." This "make the model tell you what it's attending to" idea is exactly what you should demand in AI collaboration: not just the answer, but where its attention landed, so you can judge whether the reasoning is trustworthy.
Takeaway + question
💡 Attention = score → softmax → weighted sum, letting the decoder dynamically look back at all source info, killing the fixed bottleneck and directly birthing the Transformer.
🤔 If attention is so powerful, why did it take 3 years (2014→2017) before anyone dared delete the RNN entirely? Where might the resistance lie? (Hint: parallelization, inductive bias, compute)

Further Reading

Deep Questions

1. Are LSTM's "additive belt" and ResNet's "residual connection" (Day 18) the same idea?
Essentially yes, the same move: both leave a near-identity pass-through path inside the network so gradients can flow through many layers/steps without decay. LSTM's cell state updates as C_t = f·C_{t-1} + i·C̃; when the forget gate f≈1 it's "C_t ≈ C_{t-1} + increment." ResNet's y = x + F(x) is "output = input + residual." Both give backprop a direct "+1" path, bypassing the exponential decay of repeated multiplication — the only difference is LSTM goes along the time axis (same layer, different time steps), ResNet along the depth axis (same moment, different layers). GRU's (1−z)·h + z·h̃ interpolation is the same. So "leave an identity shortcut to fight vanishing gradients" is a core motif running through deep learning, and LSTM (1997) used it 18 years before ResNet (2015) — good ideas get reinvented again and again in different places.
2. If attention existed in 2014 and was so effective, why did no one dare delete the RNN until 2017?
Three sources of resistance. (1) The incentive for parallelization wasn't painful enough yet: RNNs must compute step by step (step t depends on step t−1) and can't parallelize, but in 2014 sequences and models were still small, so the slowness was tolerable. Removing recurrence lets the Transformer compute all positions simultaneously, perfectly saturating GPUs — it was compute growth that turned "parallelizable" from nice-to-have into decisive. (2) Belief in inductive bias: an RNN's "process in order" natively encodes the prior "sequences have order," and people intuitively felt deleting it would lose the sense of time. The Transformer plugged that gap with positional encoding (Day 13), proving the prior can be injected explicitly rather than imposed structurally. (3) It needed a bundle of innovations together: self-attention, multi-head, positional encoding, residual + LayerNorm — none could be missing. Attention alone can't delete the RNN. History is often like this: a key part A exists early, but B, C, D must arrive and external conditions (compute/data) mature before a paradigm leap happens. The 3-year gap from attention to Transformer is the classic "parts assembled but one final push short."
3. RNNs are serial over time, Transformers fully parallel — is this the same as "stream vs batch processing"?
There's a deep echo, though not an exact match. An RNN is like stream processing: data arrives one at a time, you maintain a rolling state (hidden state), naturally suited to unbounded / online scenarios with constant memory — but it can't parallelize, and distant info decays. A Transformer is like batch processing: load the whole window at once, all positions interact pairwise, parallelizable and detail-preserving — but at the cost of O(n²) compute/memory and a hard window limit that can't fit an "infinite stream." Interestingly, 2024–2025's SSM / Mamba (Day 34) want both ends: maintain a constant-size state like an RNN, be streamable, linear complexity, yet approach attention's long-range power through clever design. So it's not "who replaces whom" but a long tug-of-war between serial constant state and parallel full connectivity — isomorphic to distributed systems' "stateful stream vs stateless batch" trade-off. Your distributed background transfers directly here: throughput vs latency, memory vs compute, online vs offline.
4. Attention weights claim to be "interpretable," telling you which word the model is looking at — but is that explanation trustworthy?
Be cautiously optimistic. Attention weights do offer a tempting explanation: plot the α matrix and you see "cats" highlighting "猫," intuitively "the model is aligning." But the field has a long debate over whether attention is genuinely an explanation (Jain & Wallace 2019 "Attention is not Explanation" vs Wiegreffe & Pinter 2019 "Attention is not not Explanation"). The core doubts: (a) high weight ≠ causal importance — you can construct different attention distributions yet nearly identical output, suggesting it isn't necessarily the model's true basis for decisions; (b) after stacking many heads and layers, a single layer's weights are diluted and can't be naively read as "the model's attention." Pragmatic conclusion: attention weights are a useful debugging clue and hypothesis source, but not an authoritative confession of the model's reasoning. This has a direct lesson for AI collaboration — when a tool "tells you its reasons," is that reason a post-hoc narrative or true causality? The two often differ, demanding cross-verification rather than blind acceptance.
5. Over the decade from RNN to Transformer, did things get "more complex" or "simpler"?
On the surface, more complex (parameters from millions to trillions), but the core abstraction actually got simpler and more unified. RNN era: processing sequences required carefully designed gating, managing hidden state's temporal transfer, and wrestling vanishing gradients — the structure was packed with human priors about "how sequences should be processed." The Transformer drastically cuts these: no recurrence, no gating, no temporal assumptions, leaving just "all positions attend to each other + feed-forward + positional encoding," a highly uniform, stackable set of blocks. This is a recurring law in deep learning — the Bitter Lesson (Sutton): in the long run, simple general architectures that cut human priors and cede the space to data and compute beat carefully engineered complex structures. LSTM's gating is a crystallization of human insight, but the Transformer proved: given enough data and compute, a simpler, more parallelizable, fewer-assumption architecture learns better. For super-individuals this is a profound metaphor — real leverage often comes from cutting to the simple and letting general mechanisms emerge, not from piling up special cases. Designing any system (workflow, Agent, decision framework), it's worth asking: is the structure I'm adding a necessary inductive bias, or a human assumption data will eventually wash away?