AI/ML Deep Dive: Encoder Models

Day 24 · 2026-06-10
For: engineers with coding experience but no ML background
Engineering counterpart → super-individual D11: RAG in Practice (choosing embedding / re-rank)

The past days were all about decoders (generative models like GPT). Today we switch tracks to the encoder—it doesn't generate text, it compresses text into vectors, and it's the backbone of RAG retrieval, semantic search, and classification. Four concepts form an evolutionary line: BERT lays the foundation → RoBERTa fixes the recipe → Sentence-BERT makes vectors comparable → ColBERT balances accuracy and speed.

BERT — Bidirectional EncoderBidirectional Encoder

foundationalrepresentation
One-line analogy

GPT is like streaming log processing—it can only read left to right, and when predicting the next item it can't see the future. BERT is the opposite: it loads the whole table into memory for a full scan, where every token sees the entire left and right context at once. The price is that BERT can't generate (it has no "next word" task), but it understands a sentence better—because understanding inherently needs both sides of context.

Problem + mechanism

Before 2018, understanding tasks (classification, NER, QA) mostly used unidirectional language models or shallow word vectors. The pain: one direction can't see everything. "Apple launched a new phone" vs "she peeled the apple"—the meaning of "apple" depends on what comes after, which a left-to-right model can't see when it reaches the word. BERT (Devlin et al. 2018) solves this with an encoder-only architecture plus two self-supervised pretraining tasks:

  • MLM (Masked Language Modeling): randomly mask 15% of tokens and have the model guess them from bidirectional context. It's a fill-in-the-blank task—because the answer sits in the middle, the model is forced to use both sides, dropping the causal mask.
  • NSP (Next Sentence Prediction): decide whether two sentences were adjacent in the original text, teaching inter-sentence relationships.

After pretraining, BERT has learned general language representations; downstream you just attach a small classification head and fine-tune to hit SOTA on classification / NER / extractive QA. This is the birth of the "pretrain + fine-tune" paradigm.

Same input "Apple launched new [MASK]", two reading styles

GPT (decoder, causal):
Apple launched new ?  each step sees only the left

BERT (encoder, bidirectional):
Apple launched new [MASK]
└── each token absorbs all left + right context ──┘ → output vector
Code
from transformers import pipeline

# The most natural BERT demo is its pretraining task: fill-in-the-blank
fill = pipeline("fill-mask", model="bert-base-uncased")

# BERT completes [MASK] using bidirectional context
for r in fill("The capital of France is [MASK].")[:3]:
    print(f"{r['token_str']:>8}  {r['score']:.3f}")
# paris 0.42 / lyon 0.05 ... — it infers from context, not "continuation"

# Get the sentence's hidden vectors (raw material for downstream tasks)
from transformers import AutoModel, AutoTokenizer
tok = AutoTokenizer.from_pretrained("bert-base-uncased")
mdl = AutoModel.from_pretrained("bert-base-uncased")
out = mdl(**tok("hello world", return_tensors="pt"))
print(out.last_hidden_state.shape)  # [1, seq_len, 768] one vector per token
Pitfall + use case
"BERT is ancient, why bother now that we have GPT?"—wrong. Generation (writing, dialogue) uses decoders; but understanding / encoding (turning text into vectors for retrieval, classification, clustering) is still the encoder's home turf. That embedding model in your RAG system is, by lineage, a descendant of BERT. They're a division of labor, not old vs new.
📌 BigCat scenario: auto-tagging / classifying your personal knowledge base (sorting notes into "distributed systems", "Buddhism", "parenting")—a classic encoder fine-tuning task. A few hundred labeled examples train a fast, accurate small classifier, orders of magnitude cheaper than calling an LLM each time.
Takeaway + question
💡 Decoders "speak", encoders "comprehend"—bidirectionality is the prerequisite for understanding; the causal mask is the price of generation.
🤔 Of the things you do with an LLM today, which only need "understanding / classification" rather than "generation"? Wouldn't an encoder be faster and cheaper for those?

RoBERTa — A Better RecipeRobustly Optimized BERT

training recipereplication
One-line analogy

RoBERTa (Liu et al. 2019) has the exact same architecture as BERT—no new CPU, no schema change. It just re-tuned the training hyperparameters and data recipe and decisively beat the original. Like getting a slow database and, without new hardware, just enlarging the buffer pool, killing a useless background job, and feeding it 10× more data—performance doubles. The conclusion: BERT was severely undertrained.

Problem + mechanism

After BERT, everyone assumed "to get stronger you must change the architecture." RoBERTa is a rigorous ablation study proving much of the gain comes not from new structure but from getting the old recipe right. Key changes:

  • Drop NSP: experiments found NSP barely helps and can even hurt. Removing it and using long contiguous text works better—a widely-assumed task debunked.
  • Dynamic masking: BERT decides which tokens to mask once during preprocessing (static), reused for all of training. RoBERTa re-randomizes the mask each time data is fed, exposing the model to more diverse fill-in-the-blanks and preventing memorization.
  • Bigger batches + more data + longer training: data scaled from ~16GB to ~160GB, with larger batches and more steps.
  • Larger byte-level BPE vocabulary: reduces out-of-vocabulary problems.

Not one change is a "new architecture", yet together they refreshed benchmarks like GLUE / SQuAD. RoBERTa's real contribution is methodological: before claiming "I invented a better architecture", first confirm you trained the baseline thoroughly—otherwise you're comparing "an undertrained old model vs a fully-trained new one", and the conclusion is invalid.

Code
from transformers import pipeline

# RoBERTa's interface is identical to BERT—same architecture, swap weights
# Note: RoBERTa has no NSP; its mask token is <mask> not [MASK]
fill = pipeline("fill-mask", model="roberta-base")

for r in fill("Better data beats a fancier <mask>.")[:3]:
    print(f"{r['token_str']:>12}  {r['score']:.3f}")

# In practice RoBERTa is a common base for fine-tuning "understanding" tasks
# e.g. sentiment classification, NLI — same fine-tune workflow
clf = pipeline("sentiment-analysis",
               model="cardiffnlp/twitter-roberta-base-sentiment-latest")
print(clf("Just tuning the recipe shot the scores up. Elegant."))
Pitfall + use case
"Model underperforms = architecture is bad, switch to something fancier"—the exact intuition trap RoBERTa refutes. Often the problem is undertraining / insufficient data / a baseline that was never tuned. Before changing architecture, ask: is the current approach trained to saturation? (This applies to any system tuning—don't rush to rewrite before confirming the old approach was used to its full.)
📌 BigCat scenario: treat it as an epistemic principle for cross-disciplinary thinking—when evaluating any "new method crushes old method" paper or pitch, first ask "was the control group deliberately weakened?" RoBERTa reminds us many "breakthroughs" vanish once the baseline is trained properly.
Takeaway + question
💡 The recipe is often worth more than the architecture—RoBERTa changed not a line of architecture code and took SOTA just by "training the baseline right".
🤔 Which "underperforming" system of yours has never been seriously tuned or fully fed with data? Are you sure the bottleneck is the design, not under-utilization?

Sentence EmbeddingsSentence-BERT / Sentence Transformers

vector retrievalcontrastive
One-line analogy

To compare two sentences, vanilla BERT must feed both concatenated through one pass—like doing a full-table JOIN per query. Comparing 10,000 sentences pairwise ≈ 50 million BERT inferences, tens of hours. Sentence-BERT instead builds an index: encode each sentence independently into one fixed vector, precompute and store them, then compare via cosine similarity—from "compute at JOIN time" to "precompute + lookup", from tens of hours down to seconds.

Problem + mechanism

The pain has two layers. One is speed: the combinatorial explosion above. The other is quality: many assume "BERT's output vectors can be compared by cosine directly"—badly wrong. The Sentence-BERT paper (Reimers & Gurevych 2019) measured that averaging BERT's token vectors or taking the [CLS] vector for sentence similarity is even worse than averaging GloVe word vectors. The reason: BERT's pretraining objective (fill-in-the-blank) never required "semantically similar sentences to have similar vectors", so its vector space isn't built for cosine comparison.

The fix is a Siamese network + contrastive fine-tuning:

  • Both sentences pass through the same BERT (shared weights, hence "Siamese"), each pooled (usually mean over token vectors) into sentence vectors u, v;
  • Train on labeled sentence pairs: pull similar pairs together, push dissimilar ones apart (classification / regression / triplet objectives). This is contrastive learning—not teaching "what this sentence is", but "which two should sit next to each other";
  • After training, semantically close sentences are also close in cosine distance, ready to feed a vector database.

This is the direct ancestor of every modern embedding model / RAG retriever, and the origin of the sentence-transformers library. Day 22's dense retrieval and Day 4's RAG embeddings both sit on it.

Two ways to compare sentences

Cross-encoder (vanilla BERT): A+B one BERT similarity
accurate, but computed per pair → N sentences need N² runs → not precomputable

Bi-encoder (Sentence-BERT):
A BERT vec u    B BERT vec v
└─ encoded independently, stored ahead ─┘ → query-time cos(u,v), millisecond
Code
from sentence_transformers import SentenceTransformer, util

# A lightweight sentence embedder (384-dim), trained for cosine comparison
model = SentenceTransformer("all-MiniLM-L6-v2")

docs = ["consistency trade-offs in distributed systems",
        "how to explain quantum mechanics to a child",
        "the CAP theorem and eventual consistency"]

# 1) Offline: encode all docs into vectors, store in a vector DB (in-memory here)
doc_emb = model.encode(docs, convert_to_tensor=True)

# 2) Online: encode the query, then just one cosine-similarity lookup
q_emb = model.encode("how to trade off consistency vs partition tolerance", convert_to_tensor=True)
hits = util.cos_sim(q_emb, doc_emb)[0]
for i in hits.argsort(descending=True)[:2]:
    print(f"{hits[i]:.3f}  {docs[i]}")
# Matches "CAP theorem"—no shared words with the query, pure semantic match
Pitfall + use case
"Any BERT / LLM hidden vector works for semantic retrieval"—wrong. Without contrastive fine-tuning, cosine similarity in the vector space is often unreliable (semantically close sentences aren't necessarily close in vector space). Always use a purpose-trained embedding model (the sentence-transformers family, OpenAI / Cohere embeddings, etc.)—don't pull hidden states straight out of a general LLM.
📌 BigCat scenario: build a semantic search over years of notes / bookmarks—no longer keyword-based (searching "consistency" misses a note that only wrote "CAP"), but recall by meaning. Encode all notes into a database offline, and every query returns in milliseconds—the most practical encoder application for a personal knowledge base.
Takeaway + question
💡 "Can encode" ≠ "vectors are comparable"—the vector space is shaped by the training objective; for cosine to be meaningful, you must train with a contrastive objective.
🤔 Once you train in the constraint "semantically close = vectors close", you've handed the definition of "similar" to your training data. In your retrieval system, who actually defines "similar"?
Engineering counterpart → super-individual D11: RAG in Practice (choosing embeddings)

ColBERT — Late InteractionLate Interaction

retrieval archmulti-vectoraccuracy/speed
One-line analogy

The two methods above are extremes: bi-encoders are fast but coarse (the whole sentence squashed into one vector, details lost, like hashing a full row into one value), cross-encoders are accurate but slow (every query-doc pair computed fresh, not precomputable). ColBERT takes the middle road: it stores one vector per document token (like building a fine-grained column index), and at query time does a lightweight token-level match. It keeps token-level detail yet still precomputes the heavy document-side work.

Problem + mechanism

Single-vector bi-encoders have a fundamental bottleneck: compressing a whole passage into one 768-dim vector inevitably averages away the detail of long or multi-topic documents—the signal of a query keyword precisely hitting one sentence gets diluted in the global average. Cross-encoders don't have this issue (query and doc tokens interact fully), but at the cost of being not precomputable: document vectors depend on the current query, so each query re-runs the whole corpus.

ColBERT's (Khattab & Zaharia 2020) key term is "late interaction": defer the interaction step to the very end and make it cheap. The flow:

  • Offline: documents pass through BERT, keeping every token's vector (not pooled into one), and the whole corpus's token vectors are indexed;
  • Online: the query is also encoded into a sequence of token vectors; scoring uses MaxSim—for each query token, find its most similar token among all doc tokens and take that max similarity; then sum the maxes across all query tokens for the total score.

MaxSim formula: S(q,d) = Σi∈q maxj∈d (Eq,i · Ed,j). Intuition: i iterates over each query token, maxj is "how well does this query token's best-matching spot in the document match", and the outer Σ sums each query token's best match. It's essentially "soft keyword matching"—with the precision of BM25-style exact term hits, but done in semantic vector space (synonyms match too). Document token vectors can be prestored, and interaction is just lookup-and-max, so it's about two orders of magnitude faster than a cross-encoder while approaching its accuracy.

Trade-offs of three retrieval architectures

Bi-encoder 1 vec/doc   fastest · precomputable   loses detail
ColBERT   N vecs/doc   fast · doc-side precomputable   keeps token detail
Cross-encoder interaction/query   most accurate   slowest · not precomputable

MaxSim: each query token → find best match in doc → sum them all
Code
# Use MaxSim to illustrate ColBERT's scoring logic (simplified)
import torch
from sentence_transformers import SentenceTransformer

enc = SentenceTransformer("all-MiniLM-L6-v2")

def token_vecs(text):
    # per-token vectors, normalized (real ColBERT uses a purpose-trained model)
    feats = enc.tokenize([text])
    out = enc[0].auto_model(**{k: v for k, v in feats.items()})
    return torch.nn.functional.normalize(out.last_hidden_state[0], dim=-1)

def maxsim(q, d):
    Q, D = token_vecs(q), token_vecs(d)
    sim = Q @ D.T            # [query_tok, doc_tok] all pairwise similarities
    return sim.max(dim=1).values.sum().item()  # max per q-token, then sum

q = "how to keep data consistent"
print(maxsim(q, "eventual consistency in distributed databases"))
print(maxsim(q, "a recipe for chocolate cake"))  # clearly lower
Pitfall + use case
"ColBERT is fast and accurate, so replace single-vector retrieval entirely"—watch its storage cost: each document stores not 1 vector but dozens to hundreds (one per token), making the index one or two orders of magnitude larger. ColBERTv2 (2021) eases this with residual compression, but it's still far heavier than a bi-encoder. It fits scenarios with high recall-precision demands and a manageable corpus, not blind wholesale replacement.
📌 BigCat scenario: when RAG recall quality disappoints, use ColBERT as a tier that's both retriever and reranker—especially when documents are full of technical terms / precise nouns (tech docs, papers), where its token-level matching catches the "that exact keyword hit" signal better than a single vector. Understanding the mechanism, you'll know you're trading "storage / speed" for "accuracy".
Takeaway + question
💡 Retrieval architecture is essentially a trade-off of "how early the interaction happens vs how precomputable it is": earlier interaction = more accurate, later = faster. ColBERT pushes interaction to the end and makes it light, landing on the sweet spot.
🤔 Does the "early interaction = accurate, late = fast" axis also hold in systems you know (DB join strategies, cache precomputation)? The same trade-off recurring in different clothes—what does that tell you?
Engineering counterpart → super-individual D11: RAG in Practice (choosing rerankers)

Further ReadingFurther Reading

Deep QuestionsDeep Questions

1. Since decoders (the GPT family) can also produce text vectors, why are RAG / semantic search still dominated by encoders (the BERT family)? Where does the difference in "understanding" lie?
The core difference is whether attention is bidirectional. Encoders have no causal mask, so each token, when encoded, absorbs all left and right context—naturally suited to "compress a passage into one understood representation". Decoders are constrained by the causal mask: token i sees only what precedes it, so using the last token's hidden state to represent the whole sentence is inherently biased—earlier words "can't see" later ones. Second is the training objective: GPT learns "predict the next word", never required to make "similar sentences have similar vectors"; to use a decoder as an embedder, you must add contrastive fine-tuning like Sentence-BERT (recent LLM-based embedders such as E5-mistral do exactly this, very effectively). So it's not "decoders can't", but encoders are cheaper and more fit-for-purpose: understanding tasks need models an order or two smaller, run faster, and the bidirectional architecture was built for representation. Since 2024 the trend is two-legged—small tasks use the BERT family, top retrieval quality uses contrastively fine-tuned LLM embeddings.
2. RoBERTa dropping NSP and getting better has methodological meaning far beyond NLP. What is it reminding researchers of?
At least three things. First, "a reasonable-looking design" ≠ "a useful design": NSP was well-motivated in the BERT paper (learn inter-sentence relations) yet didn't survive a strict ablation. Any component added by intuition should be asked "would removing it hurt?" Second, the control group must be fully trained: RoBERTa revealed that many prior "beat-BERT" works compared against an undertrained BERT, and once the baseline was fed properly the advantage shrank or vanished—a fairness issue in scientific comparison, pervasive in AI papers. Third, negative results have value too: RoBERTa has almost no "new invention"; its contribution is cleanly isolating variables to tell the community "where the gains come from and where they don't". The transfer for BigCat's technical decisions: seeing any "new beats old" comparison, your first reaction should be "was the control group configured fairly, trained/tuned properly?"
3. Bi-encoder → ColBERT → cross-encoder is a continuum of "interaction earlier, computation heavier". Mapped onto database / cache trade-offs you know, what shared structure do you see?
It's the same trade-off in different clothing: "precompute (materialize) vs compute on demand". Bi-encoder ≈ materialized view / inverted index: do the expensive work offline, just look up at query time—blazing fast, but the index "freezes" once built and can't be tailored to the current query, losing accuracy. Cross-encoder ≈ full JOIN computed at query time: uses all of the current query's info each time, most accurate, but utterly non-precomputable, expensive and slow. ColBERT ≈ fine-grained index + lightweight join: the heavy doc-side work is done offline (per-token materialized), interaction deferred to query time but only a cheap MaxSim, landing on the sweet spot. The shared structure: redistribute computation along the time axis—push everything decoupled from the query forward into precomputation, keep the query-dependent part as light and as late as possible. This principle recurs in CDNs, prepared statements, KV cache (Day 9), and query optimizers.
4. Sentence-BERT outsources the definition of "similar" to its training data (which pairs were labeled close). What risk does this pose when you design a personal semantic search?
It means "similar" is not objective but shaped by the training set's labeling preferences. General embedding models train mostly on public corpora (news, QA, NLI), so their notion of "similar" is the similarity in that data. Applied to your personal knowledge base it may misalign: you may believe "Buddhism's 'impermanence' and distributed systems' 'eventual consistency' are about the same thing", but no one in the training data paired them, so the model won't recall them together—your most prized cross-disciplinary connections are precisely the blind spot of generic similarity. Concrete risks: (1) domain terms get over-generalized ("consistency" is CAP to you, the model may drift toward "personality consistency"); (2) your unique conceptual links can't be retrieved. Mitigations: for high-value cases, fine-tune with your own notes / labeled positive pairs to teach the model "your definition of similar"; or add a hand-maintained concept graph at the retrieval layer. The deeper lesson: outsourcing "similarity judgment" outsources part of "how you organize the world"—worth being wary of which connections you're unwilling to give up.
5. These four models span six years (2018→2024), yet none is "obsolete" relative to another. They're more like a coexisting toolbox than a succession. What does this say about the real shape of AI progress?
It shows "AI progress = the old gets replaced" is a misconception. The real shape is more like niche differentiation: each model occupies a "cost/accuracy/scenario" niche, and new models mostly open a new niche or optimize one trade-off, rather than wholesale replacing. BERT holds "general understanding base", RoBERTa is a fitter individual in the same niche, Sentence-BERT opens the "comparable vectors" niche, ColBERT inserts a middle tier on the "accuracy/speed" axis. Even 2024's big LLMs haven't put BERT out of work—"classifying with a 340M model" and "classifying with GPT" are different cost magnitudes, the former irreplaceable in high-throughput, low-latency settings. The lesson for BigCat's pursuit of the "AI super-individual": don't just chase the newest and biggest model; build a layered toolbox—cheap small encoders for high-frequency understanding tasks, expensive big models reserved for genuine reasoning/generation. Knowing how to "pick the right tool for the niche" is closer to the essence of a super-individual than "always use the strongest model".