The past days were all about decoders (generative models like GPT). Today we switch tracks to the encoder—it doesn't generate text, it compresses text into vectors, and it's the backbone of RAG retrieval, semantic search, and classification. Four concepts form an evolutionary line: BERT lays the foundation → RoBERTa fixes the recipe → Sentence-BERT makes vectors comparable → ColBERT balances accuracy and speed.
GPT is like streaming log processing—it can only read left to right, and when predicting the next item it can't see the future. BERT is the opposite: it loads the whole table into memory for a full scan, where every token sees the entire left and right context at once. The price is that BERT can't generate (it has no "next word" task), but it understands a sentence better—because understanding inherently needs both sides of context.
Before 2018, understanding tasks (classification, NER, QA) mostly used unidirectional language models or shallow word vectors. The pain: one direction can't see everything. "Apple launched a new phone" vs "she peeled the apple"—the meaning of "apple" depends on what comes after, which a left-to-right model can't see when it reaches the word. BERT (Devlin et al. 2018) solves this with an encoder-only architecture plus two self-supervised pretraining tasks:
After pretraining, BERT has learned general language representations; downstream you just attach a small classification head and fine-tune to hit SOTA on classification / NER / extractive QA. This is the birth of the "pretrain + fine-tune" paradigm.
from transformers import pipeline # The most natural BERT demo is its pretraining task: fill-in-the-blank fill = pipeline("fill-mask", model="bert-base-uncased") # BERT completes [MASK] using bidirectional context for r in fill("The capital of France is [MASK].")[:3]: print(f"{r['token_str']:>8} {r['score']:.3f}") # paris 0.42 / lyon 0.05 ... — it infers from context, not "continuation" # Get the sentence's hidden vectors (raw material for downstream tasks) from transformers import AutoModel, AutoTokenizer tok = AutoTokenizer.from_pretrained("bert-base-uncased") mdl = AutoModel.from_pretrained("bert-base-uncased") out = mdl(**tok("hello world", return_tensors="pt")) print(out.last_hidden_state.shape) # [1, seq_len, 768] one vector per token
RoBERTa (Liu et al. 2019) has the exact same architecture as BERT—no new CPU, no schema change. It just re-tuned the training hyperparameters and data recipe and decisively beat the original. Like getting a slow database and, without new hardware, just enlarging the buffer pool, killing a useless background job, and feeding it 10× more data—performance doubles. The conclusion: BERT was severely undertrained.
After BERT, everyone assumed "to get stronger you must change the architecture." RoBERTa is a rigorous ablation study proving much of the gain comes not from new structure but from getting the old recipe right. Key changes:
Not one change is a "new architecture", yet together they refreshed benchmarks like GLUE / SQuAD. RoBERTa's real contribution is methodological: before claiming "I invented a better architecture", first confirm you trained the baseline thoroughly—otherwise you're comparing "an undertrained old model vs a fully-trained new one", and the conclusion is invalid.
from transformers import pipeline # RoBERTa's interface is identical to BERT—same architecture, swap weights # Note: RoBERTa has no NSP; its mask token is <mask> not [MASK] fill = pipeline("fill-mask", model="roberta-base") for r in fill("Better data beats a fancier <mask>.")[:3]: print(f"{r['token_str']:>12} {r['score']:.3f}") # In practice RoBERTa is a common base for fine-tuning "understanding" tasks # e.g. sentiment classification, NLI — same fine-tune workflow clf = pipeline("sentiment-analysis", model="cardiffnlp/twitter-roberta-base-sentiment-latest") print(clf("Just tuning the recipe shot the scores up. Elegant."))
To compare two sentences, vanilla BERT must feed both concatenated through one pass—like doing a full-table JOIN per query. Comparing 10,000 sentences pairwise ≈ 50 million BERT inferences, tens of hours. Sentence-BERT instead builds an index: encode each sentence independently into one fixed vector, precompute and store them, then compare via cosine similarity—from "compute at JOIN time" to "precompute + lookup", from tens of hours down to seconds.
The pain has two layers. One is speed: the combinatorial explosion above. The other is quality: many assume "BERT's output vectors can be compared by cosine directly"—badly wrong. The Sentence-BERT paper (Reimers & Gurevych 2019) measured that averaging BERT's token vectors or taking the [CLS] vector for sentence similarity is even worse than averaging GloVe word vectors. The reason: BERT's pretraining objective (fill-in-the-blank) never required "semantically similar sentences to have similar vectors", so its vector space isn't built for cosine comparison.
The fix is a Siamese network + contrastive fine-tuning:
This is the direct ancestor of every modern embedding model / RAG retriever, and the origin of the sentence-transformers library. Day 22's dense retrieval and Day 4's RAG embeddings both sit on it.
from sentence_transformers import SentenceTransformer, util # A lightweight sentence embedder (384-dim), trained for cosine comparison model = SentenceTransformer("all-MiniLM-L6-v2") docs = ["consistency trade-offs in distributed systems", "how to explain quantum mechanics to a child", "the CAP theorem and eventual consistency"] # 1) Offline: encode all docs into vectors, store in a vector DB (in-memory here) doc_emb = model.encode(docs, convert_to_tensor=True) # 2) Online: encode the query, then just one cosine-similarity lookup q_emb = model.encode("how to trade off consistency vs partition tolerance", convert_to_tensor=True) hits = util.cos_sim(q_emb, doc_emb)[0] for i in hits.argsort(descending=True)[:2]: print(f"{hits[i]:.3f} {docs[i]}") # Matches "CAP theorem"—no shared words with the query, pure semantic match
The two methods above are extremes: bi-encoders are fast but coarse (the whole sentence squashed into one vector, details lost, like hashing a full row into one value), cross-encoders are accurate but slow (every query-doc pair computed fresh, not precomputable). ColBERT takes the middle road: it stores one vector per document token (like building a fine-grained column index), and at query time does a lightweight token-level match. It keeps token-level detail yet still precomputes the heavy document-side work.
Single-vector bi-encoders have a fundamental bottleneck: compressing a whole passage into one 768-dim vector inevitably averages away the detail of long or multi-topic documents—the signal of a query keyword precisely hitting one sentence gets diluted in the global average. Cross-encoders don't have this issue (query and doc tokens interact fully), but at the cost of being not precomputable: document vectors depend on the current query, so each query re-runs the whole corpus.
ColBERT's (Khattab & Zaharia 2020) key term is "late interaction": defer the interaction step to the very end and make it cheap. The flow:
MaxSim formula: S(q,d) = Σi∈q maxj∈d (Eq,i · Ed,j). Intuition: i iterates over each query token, maxj is "how well does this query token's best-matching spot in the document match", and the outer Σ sums each query token's best match. It's essentially "soft keyword matching"—with the precision of BM25-style exact term hits, but done in semantic vector space (synonyms match too). Document token vectors can be prestored, and interaction is just lookup-and-max, so it's about two orders of magnitude faster than a cross-encoder while approaching its accuracy.
# Use MaxSim to illustrate ColBERT's scoring logic (simplified) import torch from sentence_transformers import SentenceTransformer enc = SentenceTransformer("all-MiniLM-L6-v2") def token_vecs(text): # per-token vectors, normalized (real ColBERT uses a purpose-trained model) feats = enc.tokenize([text]) out = enc[0].auto_model(**{k: v for k, v in feats.items()}) return torch.nn.functional.normalize(out.last_hidden_state[0], dim=-1) def maxsim(q, d): Q, D = token_vecs(q), token_vecs(d) sim = Q @ D.T # [query_tok, doc_tok] all pairwise similarities return sim.max(dim=1).values.sum().item() # max per q-token, then sum q = "how to keep data consistent" print(maxsim(q, "eventual consistency in distributed databases")) print(maxsim(q, "a recipe for chocolate cake")) # clearly lower