An "embedding" compresses a word, sentence, or image into a string of numbers — a high-dimensional vector. But how those vectors are arranged in space is itself what encodes the meaning. Today isn't about a specific encoder model (that was Day 24); it's about the geometric regularities that run through all of representation learning: why "king − man + woman ≈ queen" holds, what training objective modern embeddings rely on, why BERT's sentence vectors all "clump together," and how one vector can serve both fast lookup and precise reranking. Understand the geometry and you'll know why an embedding is good — or where it bites.
The diff between two word vectors is itself meaningful — like the patch between two git commits: you can extract the "gender diff" from one word and apply it to another. King − man = a displacement vector meaning "royalty minus maleness"; add it to "woman" and you land near "queen." Semantic relations get encoded as reusable, consistently-oriented translation operations.
Before 2013, words were represented as one-hot: each word an isolated dimension, so "cat" and "dog" had a similarity of exactly 0 — the machine had no idea both are animals. The breakthrough of Word2Vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014): make words with similar meaning have similar vectors, and encode relations as directions.
The mechanism rests on an old maxim — "a word is defined by its context." Word2Vec uses a shallow network to play fill-in-the-blank: given a center word, predict its neighbors (Skip-gram). To do this well, the model is forced to place words appearing in similar contexts close together. After training, "meaning" has crystallized into geometry: similar words cluster, and systematic relations like singular→plural, country→capital, verb tense show up as parallel, equal-length displacement vectors.
Note the "≈": this is not an exact equation, but "after adding, the nearest neighbor happens to be queen." This linear structure is stable only for frequent, regular relations; rare words and complex relations often break it — don't treat it as a reliable reasoning engine.
import gensim.downloader as api # Load pretrained GloVe vectors (~65MB on first download) wv = api.load("glove-wiki-gigaword-100") # 100-dim # Classic analogy: king - man + woman ≈ ? add/subtract then find NN result = wv.most_similar(positive=["king", "woman"], negative=["man"], topn=3) print(result) # → [('queen', 0.78), ('throne', ...), ...] # Geometric check: related words high cosine, unrelated low print(wv.similarity("cat", "dog")) # ~0.6, both pets print(wv.similarity("cat", "democracy"))# ~0.1, unrelated
Contrastive learning trains a "semantic fingerprint function": different surface forms of the same content (paraphrase, translation, noised) should hash to nearby spots (pull positives together), while unrelated content hashes far apart (push negatives away). Representation collapse is when this hash degenerates into "always return the same value" — every key crashes into one bucket, all similarities are 1, nothing was learned. Negatives are the constraint that prevents collapse.
Word2Vec gives word vectors, but we usually want a good vector for a whole sentence (for retrieval, clustering). The problem: there are no ready labels saying "these two sentences mean the same." Contrastive learning's trick — manufacture your own positives: apply two slight perturbations to the same sentence (e.g., two different dropouts) and they're naturally a positive pair; the other sentences in the batch serve as negatives. This is exactly what SimCSE (Gao et al., 2021) does.
The objective is the InfoNCE loss (van den Oord et al., 2018). Intuition: it's an "N-way multiple choice" — given an anchor query, among "1 positive + many negatives," make the model assign the highest similarity to the positive. The formula:
L = −log [ exp(sim(q,k⁺)/τ) / Σⱼ exp(sim(q,kⱼ)/τ) ]
Why are negatives needed? Pulling positives without pushing negatives lets the model find a lazy solution: map all sentences to the same point — the positive distance instantly goes to zero, yet nothing real was learned about distinguishing. That is representation collapse. Negatives in the denominator create a "mutual-repulsion tension" forcing representations to spread out (uniformity), so they don't collapse.
import torch, torch.nn.functional as F def info_nce(z1, z2, tau=0.05): # z1[i] and z2[i] are a positive pair; rest of batch are negatives z1, z2 = F.normalize(z1, dim=1), F.normalize(z2, dim=1) sim = z1 @ z2.T / tau # [B,B] similarity matrix / temperature labels = torch.arange(z1.size(0), device=z1.device) # diagonal = positives; same as a "pick the diagonal" classifier return F.cross_entropy(sim, labels) # Encode the same batch twice with dropout → two views (SimCSE idea) z1 = encoder(batch) # dropout randomness → natural positive pairs z2 = encoder(batch) loss = info_nce(z1, z2) # pull (z1[i],z2[i]), push the rest
Anisotropy is the geometric version of data skew. A bad hash function crams all keys into a handful of buckets — and cosine similarity gets systematically inflated, so all sentence vectors look "kind of alike." Whitening is a "rebalancing" of this cone of vectors: a linear transform that re-spreads them uniformly across space so similarity becomes trustworthy again.
Taking BERT's last layer directly as a sentence vector works surprisingly poorly. The diagnosed cause is anisotropy: vectors aren't spread uniformly on a sphere but are all squeezed into a narrow cone (pointing in similar directions). Consequence: any two unrelated sentences still have cosine similarity of 0.7+, because their directions were close to begin with — similarity loses its discriminative power.
Whitening (Su et al., 2021, i.e. BERT-whitening) is a purely linear post-processing step, no retraining needed. Two steps:
In essence it's statistical PCA whitening: find a transform W on covariance Σ such that WᵀΣW = I. After the transform, cosine similarity is meaningful again, and STS semantic-similarity scores often rise directly; you can also do dimensionality reduction as a bonus (keep only the high-variance directions).
import numpy as np def whitening_fit(embs): # embs: [N, d] a batch of sentence vectors mu = embs.mean(axis=0, keepdims=True) # ① center cov = np.cov((embs - mu).T) # covariance matrix [d,d] u, s, _ = np.linalg.svd(cov) # eigen-decomposition # ② transform W: whiten covariance into identity W = u @ np.diag(1.0 / np.sqrt(s + 1e-8)) return mu, W mu, W = whitening_fit(train_embs) # Apply: any new vector first subtract mean, then multiply by W, then cosine def transform(x): return (x - mu) @ W print(np.linalg.norm(transform(query))) # re-normalize after whitening, then retrieve
Like progressive JPEG: send the first chunk and you already see a low-res version of the whole image; more data means sharper. A vector trained with Matryoshka has its first k prefix dimensions act as a usable low-res version. Also like a composite index — reading just the first few columns gives a coarse filter. Store 1536-dim once, coarse-rank on the first 256 (fast), fine-rank on the full dims (accurate) — no need to train a separate model per precision.
Pain point: high-dim vectors (1536-d) are accurate but expensive — storage, memory, retrieval all grow with dimension. Low-dim vectors (fast, cheap) aren't accurate enough. The traditional fix is to train a separate model per scenario, or hard-truncate high-dim vectors after the fact — but an ordinary vector, once truncated, scatters its information across all dims, so the prefix is unusable and accuracy collapses.
Matryoshka Representation Learning (MRL, Kusupati et al., 2022) has a clever idea: at training time, force information to be packed into the prefix from coarse to fine, by importance. The mechanism just modifies the loss — instead of one loss on the full dimension, it computes one loss per nested prefix (e.g. [64,128,256,512,1024,1536]) and sums them:
L = Σ_{k∈{64,128,...,1536}} Loss( embedding[:k] )
Intuition: since every prefix is required to do the task on its own, the gradient pushes the most critical information to the very front (because the first 64 dims must stand alone), while later dims only add refinement. After training, one vector is naturally layered: truncate at any dimension and you get the "best possible" representation at that precision, not a random fragment. Overhead is near zero — just a few extra loss computations; one forward pass at inference.
from openai import OpenAI import numpy as np client = OpenAI() # needs OPENAI_API_KEY # text-embedding-3 natively supports Matryoshka: truncate via `dimensions` def embed(text, dim): r = client.embeddings.create(model="text-embedding-3-large", input=text, dimensions=dim) v = np.array(r.data[0].embedding) return v / np.linalg.norm(v) # MUST re-normalize after truncation full = embed("eventual consistency in distributed systems", 3072) # full, fine-rank small = embed("eventual consistency in distributed systems", 256) # low-dim, coarse print(full.shape, small.shape) # (3072,) (256,) — same model, just truncate