BM25 Lexical RetrievalBM25 / Lexical Retrieval
sparse retrievalstatistical scoring
One-line analogy
BM25 is that default full-text scoring function inside Elasticsearch / Lucene—you've been using it for years. Think of it as a relevance formula sitting on top of a database inverted index: given the words in a query, it scores each document by "how well it matches" and ranks them. It's the statistical upgrade of SQL LIKE—not just "does it contain the keyword" but "how well does it contain it".
What it solves + how it works
The most naive relevance score is just counting keyword occurrences (term frequency, TF). That has two flaws: (1) common words like "the"/"is" appear constantly but carry no signal; (2) long documents naturally have more words and get an unfair edge. BM25 fixes exactly these:
- IDF (inverse document frequency)—the rarer a word is across documents, the more discriminative it is, so it gets higher weight. "quantum entanglement" is worth more than "system";
- Term-frequency saturation—going from 1 to 10 occurrences matters a lot; 10 to 100 barely matters. BM25 lets the score saturate with frequency rather than grow linearly;
- Length normalization—discounts by relative document length so long docs can't win just by being wordy.
The core scoring formula (summed over each query term qi):
score(D,Q) = Σi IDF(qi) · f(qi,D)·(k₁+1) / [ f(qi,D) + k₁·(1 − b + b·|D|/avgdl) ]
In plain words: IDF(qi) = how rare the word is (rare words weigh more); f(qi,D) = how many times it appears in the doc; k₁ (typically 1.2–2.0) = how fast term frequency saturates, smaller saturates faster; b (typically 0.75) = strength of length penalty, 0 ignores length, 1 penalizes fully; |D|/avgdl = is this doc longer or shorter than average. The whole denominator is "term frequency + length penalty", keeping the numerator from inflating forever.
Code example
from rank_bm25 import BM25Okapi # pip install rank-bm25
corpus = [
"study of feline sleep cycles",
"how to configure a Postgres connection pool",
"cache consistency in distributed systems",
]
tokenized = [doc.split() for doc in corpus]
bm25 = BM25Okapi(tokenized) # builds inverted index + computes IDF
query = "cache consistency".split()
scores = bm25.get_scores(query) # one BM25 score per doc
print(scores) # doc 3 scores highest — exact term match
Pitfall + practical scenario
"Now that we have vector embeddings, BM25 is obsolete"—badly wrong. BM25 still reigns at exact matching: product SKUs (iPhone 15 Pro), names, error codes, code identifiers, rare proper nouns—exactly where embeddings tend to "drift" into near-synonyms. Most 2024–2025 retrieval systems keep BM25 as one channel rather than dropping it.
📌 BigCat scenario: when building personal knowledge-base search, for code snippets, API names, paper titles—where "the literal string IS the answer"—BM25 is often more accurate, cheaper, and needs zero GPU. Don't reach for vectors-only on day one.
Takeaway + question
💡 BM25 isn't "old-fashioned keyword search"—it's a statistical relevance function tuned over decades. On exact-match it still can't be beaten.
🤔 How much of your search need is really "find an exact word/ID" versus "find a semantically related concept"?
Dense RetrievalDense Retrieval
vector retrievalsemantics
One-line analogy
Encode each piece of text with a neural network into a single fixed-dimension vector (embedding), and retrieval becomes "find nearest neighbors in a high-dimensional space". Analogy: geohash—geographically close points get close codes. Dense retrieval swaps "geographically close" for "semantically close". "car", "automobile", "vehicle" cluster together even with zero overlapping characters.
What it solves + how it works
BM25's Achilles heel is vocabulary mismatch: the query says "how to reduce latency", the doc says "optimizing response time"—zero literal overlap → BM25 misses it. Dense retrieval encodes meaning into vectors, sidestepping the literal surface. The mechanism is a bi-encoder (dual encoder):
- query and document each pass through an encoder, yielding two vectors;
- score with dot product or cosine similarity: sim(q,d) = Eq(q) · Ed(d), larger = more relevant;
- document vectors can be precomputed offline into a vector store; at query time you only encode the query—this is what lets it scale to billions of docs.
How is the encoder trained? Contrastive learning: given a "question–correct passage" pair, pull their vectors together; push other passages in the batch away as negatives (in-batch negatives). Karpukhin et al.'s DPR paper (arXiv 2004.04906, 2020) showed that on open-domain QA, dense retrieval's top-20 accuracy beats Lucene-BM25 by 9–19 percentage points—the first time semantic retrieval systematically beat lexical at scale.
Bi-encoder: query and doc encoded separately, indexed offline
query→Encoder_q→ [0.2, -0.7, ...]
docs (millions)→Encoder_d→ vector store (precomputed)
at query time: only encode the query → nearest-neighbor search → top-k passages
Code example
from sentence_transformers import SentenceTransformer, util
# Sentence-BERT: encode sentences into comparable vectors
model = SentenceTransformer("all-MiniLM-L6-v2")
corpus = ["several ways to optimize database response time",
"the sleeping habits of cats"]
doc_emb = model.encode(corpus, convert_to_tensor=True) # precompute
q_emb = model.encode("how to reduce system latency", convert_to_tensor=True)
hits = util.cos_sim(q_emb, doc_emb)[0] # cosine similarity
print(hits) # doc 1 scores high — zero literal overlap, semantic hit
Pitfall + practical scenario
"Dense retrieval always beats BM25"—wrong. It has three soft spots: (1) out-of-domain—domains the encoder never saw (medicine, law, your company jargon) collapse in recall; (2) exact match—SKUs and IDs get "understood" as near-synonyms; (3) long-tail rare entities. So the industry rarely uses dense retrieval alone—the next card shows how to fuse it with BM25.
📌 BigCat scenario: the killer feature for cross-disciplinary notes—search "Buddhist impermanence" and hit a note you wrote three months ago saying "distributed systems have no permanent state". That cross-domain semantic association is dense retrieval's home turf; BM25 simply can't do it.
Takeaway + question
💡 Dense retrieval turns "find the word" into "find the meaning"—its strength (semantic generalization) and weakness (exact match) are exactly complementary to BM25.
🤔 If the encoder never saw your specialty domain, is its "meaning" still trustworthy? What does that imply about when to fine-tune embeddings?
Hybrid SearchHybrid Search
fusion rankingRRF
One-line analogy
Since BM25 excels at exact lexical match and dense retrieval at semantics, don't pick one—retrieve with both and fuse the rankings. This is the ensemble learning idea, much like a database query planner combining multiple indexes, or a recommender's multi-channel recall + fusion. One query takes two paths, and their two rankings are merged into one.
What it solves + how it works
Every single-channel retriever has blind spots: BM25 misses rephrased queries, dense misses rare proper nouns. Hybrid runs both; the hard part is how to fuse. Two mainstream approaches:
- ① Weighted scores: final = α·BM25 + (1−α)·cosine. Problem: BM25 scores can range 0–30, cosine is 0–1—totally different scales, so naive addition lets BM25 crush cosine. You must normalize, but normalization is brittle to outliers.
- ② RRF (Reciprocal Rank Fusion): uses only ranks, not raw scores, sidestepping the scale problem at the root.
RRF's formula is dead simple—each document d's fused score:
RRF(d) = Σretriever i 1 / ( k + ranki(d) )
Intuition: the higher a doc ranks (smaller rank) in a channel, the larger 1/(k+rank), the more it contributes; docs ranked high in both channels have their scores add up and naturally surface. k (usually 60) is a smoothing constant—it stops the rank-1 doc from utterly crushing ranks 2 and 3, giving lower ranks a voice too. Cormack et al. 2009 (SIGIR) proved this almost embarrassingly simple formula consistently beats each single channel and even the more complex Condorcet fusion.
Hybrid search flow
query
├→ BM25 recall → ranking A
└→ dense recall → ranking B
↓ RRF fusion (ranks only)
final top-k
Code example
def rrf(rank_lists, k=60):
# rank_lists: each channel's doc_id list, sorted by relevance desc
scores = {}
for ranking in rank_lists:
for rank, doc_id in enumerate(ranking): # rank starts at 0
# ranks only, never touch raw scores -> scale-free
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
return sorted(scores, key=scores.get, reverse=True)
bm25_top = ["d3", "d1", "d7"] # BM25 channel ranking
dense_top = ["d1", "d3", "d9"] # dense channel ranking
print(rrf([bm25_top, dense_top])) # d1/d3 high in both -> surface
Pitfall + practical scenario
"Just add the BM25 score and the cosine score together"—wrong. They differ by an order of magnitude; adding them means you effectively use only BM25. Either normalize rigorously (brittle), or use RRF on ranks only (robust, near zero tuning). Production default is RRF—no per-query α to fiddle with.
📌 BigCat scenario: the "have it all" config for a personal knowledge base—searching "PPO formula" lets BM25 nail the exact term, while "that note on the explore-vs-exploit tradeoff" lets the dense channel catch the meaning. RRF fusion makes one search box serve both intents.
Takeaway + question
💡 Hybrid search is retrieval's "ensemble learning"—RRF elegantly dodges the scale problem with the single trick of "compare ranks only", needing almost no tuning.
🤔 When the two channels give totally conflicting top-1 results, is RRF's "democratic vote" always right? When should you trust one channel more?
HNSW Vector IndexHierarchical Navigable Small World
approx. nearest neighborANN
One-line analogy
Dense retrieval says "find nearest neighbors"—but with millions or billions of vectors, brute-forcing every distance per query is a full table scan, O(n)—unusable. HNSW builds a skip list over the vector space: high layers are sparse and jump far; low layers are dense and step short. You know Redis sorted-set skip lists cold—HNSW just moves the skip list from a 1-D number line to a graph in high-dimensional space, cutting nearest-neighbor search from O(n) to roughly O(log n).
What it solves + how it works
Exact nearest neighbor has no shortcut in high dimensions, so you approximate. These algorithms are called ANN (Approximate Nearest Neighbor)—sacrifice a little recall for orders-of-magnitude speed. HNSW, from Malkov & Yashunin (arXiv 1603.09320), is the index of choice in today's vector stores (Faiss, Milvus, pgvector, Qdrant). Two core ideas:
- Small World graph—each node links to several near neighbors but also keeps a few "long edges", so any two points are a few hops apart (the six-degrees-of-separation intuition);
- Hierarchical layers—the top layer has very few nodes with very long edges, like highways; lower layers get denser with shorter edges, like city streets. A query greedily walks the top layer to a local nearest, then descends a layer to refine, closing in step by step.
HNSW layered navigation (skip-list analogy)
L2 (sparse·long) ●─────────●
↓ descend
L1 (medium) ●──●──●──●
↓ descend
L0 (full·short) ●─●─●─★─● ★=nearest
coarse locate at top → refine layer by layer → lock the nearest at the bottom, total hops ≈ log n
Three key parameters are the recall vs speed/memory knobs: M = edges per node (larger → higher recall, more memory); ef_construction = candidate breadth at build time (larger → better graph, slower build); ef_search = candidate breadth at query time (tunable live at query time, larger → higher recall, slower). In production, ef_search is the knob you turn most.
Code example
import hnswlib, numpy as np # pip install hnswlib
dim, n = 384, 10000
data = np.random.rand(n, dim).astype("float32")
index = hnswlib.Index(space="cosine", dim=dim)
index.init_index(max_elements=n, ef_construction=200, M=16) # build graph
index.add_items(data)
index.set_ef(50) # ef_search: recall/speed knob, tunable anytime
q = np.random.rand(dim).astype("float32")
labels, dists = index.knn_query(q, k=5) # approx top-5, ~O(log n)
print(labels) # approximate — may miss the true #1, that's ANN's nature
Pitfall + practical scenario
"The vector store returns the exact nearest neighbors"—wrong. HNSW is approximate; recall is almost never 100%—it can miss the truly nearest point. Turning ef_search up raises recall and latency together—a deliberate trade-off knob, not a bug. When you need 100% exactness (dedup, compliance), use brute-force exact search, or add a re-rank pass over HNSW results as a safety net.
📌 BigCat scenario: once a personal knowledge base hits tens of thousands of entries, pgvector / Qdrant use HNSW by default. Understand the ef_search knob and you can tune between "fast search" and "don't miss important notes" on demand—instead of letting a default value quietly tank your recall.
Takeaway + question
💡 HNSW = a skip list in high-dimensional space, using "layers + small-world graph" to compress nearest-neighbor search to O(log n), at the cost of giving up 100% exactness.
🤔 Would you accept missing 1% of true nearest neighbors for a 50× retrieval speedup? Who should set that recall/latency tradeoff, and by what standard?
Engineering counterpart → super-individual D11 (RAG retrieval in practice)