AI/ML Deep Dive: Semantic Search

Day 22 · 2026-06-08
For: engineers with coding experience, not specialized in AI
Engineering counterpart → super-individual D11: RAG in Practice (retrieval quality, chunking, re-ranking)

BM25 Lexical RetrievalBM25 / Lexical Retrieval

sparse retrievalstatistical scoring
One-line analogy

BM25 is that default full-text scoring function inside Elasticsearch / Lucene—you've been using it for years. Think of it as a relevance formula sitting on top of a database inverted index: given the words in a query, it scores each document by "how well it matches" and ranks them. It's the statistical upgrade of SQL LIKE—not just "does it contain the keyword" but "how well does it contain it".

What it solves + how it works

The most naive relevance score is just counting keyword occurrences (term frequency, TF). That has two flaws: (1) common words like "the"/"is" appear constantly but carry no signal; (2) long documents naturally have more words and get an unfair edge. BM25 fixes exactly these:

  • IDF (inverse document frequency)—the rarer a word is across documents, the more discriminative it is, so it gets higher weight. "quantum entanglement" is worth more than "system";
  • Term-frequency saturation—going from 1 to 10 occurrences matters a lot; 10 to 100 barely matters. BM25 lets the score saturate with frequency rather than grow linearly;
  • Length normalization—discounts by relative document length so long docs can't win just by being wordy.

The core scoring formula (summed over each query term qi):

score(D,Q) = Σi  IDF(qi) · f(qi,D)·(k₁+1) / [ f(qi,D) + k₁·(1 − b + b·|D|/avgdl) ]

In plain words: IDF(qi) = how rare the word is (rare words weigh more); f(qi,D) = how many times it appears in the doc; k₁ (typically 1.2–2.0) = how fast term frequency saturates, smaller saturates faster; b (typically 0.75) = strength of length penalty, 0 ignores length, 1 penalizes fully; |D|/avgdl = is this doc longer or shorter than average. The whole denominator is "term frequency + length penalty", keeping the numerator from inflating forever.

Code example
from rank_bm25 import BM25Okapi  # pip install rank-bm25

corpus = [
    "study of feline sleep cycles",
    "how to configure a Postgres connection pool",
    "cache consistency in distributed systems",
]
tokenized = [doc.split() for doc in corpus]
bm25 = BM25Okapi(tokenized)  # builds inverted index + computes IDF

query = "cache consistency".split()
scores = bm25.get_scores(query)  # one BM25 score per doc
print(scores)  # doc 3 scores highest — exact term match
Pitfall + practical scenario
"Now that we have vector embeddings, BM25 is obsolete"—badly wrong. BM25 still reigns at exact matching: product SKUs (iPhone 15 Pro), names, error codes, code identifiers, rare proper nouns—exactly where embeddings tend to "drift" into near-synonyms. Most 2024–2025 retrieval systems keep BM25 as one channel rather than dropping it.
📌 BigCat scenario: when building personal knowledge-base search, for code snippets, API names, paper titles—where "the literal string IS the answer"—BM25 is often more accurate, cheaper, and needs zero GPU. Don't reach for vectors-only on day one.
Takeaway + question
💡 BM25 isn't "old-fashioned keyword search"—it's a statistical relevance function tuned over decades. On exact-match it still can't be beaten.
🤔 How much of your search need is really "find an exact word/ID" versus "find a semantically related concept"?

Dense RetrievalDense Retrieval

vector retrievalsemantics
One-line analogy

Encode each piece of text with a neural network into a single fixed-dimension vector (embedding), and retrieval becomes "find nearest neighbors in a high-dimensional space". Analogy: geohash—geographically close points get close codes. Dense retrieval swaps "geographically close" for "semantically close". "car", "automobile", "vehicle" cluster together even with zero overlapping characters.

What it solves + how it works

BM25's Achilles heel is vocabulary mismatch: the query says "how to reduce latency", the doc says "optimizing response time"—zero literal overlap → BM25 misses it. Dense retrieval encodes meaning into vectors, sidestepping the literal surface. The mechanism is a bi-encoder (dual encoder):

  • query and document each pass through an encoder, yielding two vectors;
  • score with dot product or cosine similarity: sim(q,d) = Eq(q) · Ed(d), larger = more relevant;
  • document vectors can be precomputed offline into a vector store; at query time you only encode the query—this is what lets it scale to billions of docs.

How is the encoder trained? Contrastive learning: given a "question–correct passage" pair, pull their vectors together; push other passages in the batch away as negatives (in-batch negatives). Karpukhin et al.'s DPR paper (arXiv 2004.04906, 2020) showed that on open-domain QA, dense retrieval's top-20 accuracy beats Lucene-BM25 by 9–19 percentage points—the first time semantic retrieval systematically beat lexical at scale.

Bi-encoder: query and doc encoded separately, indexed offline

queryEncoder_q [0.2, -0.7, ...]
docs (millions)Encoder_d vector store (precomputed)
at query time: only encode the query → nearest-neighbor search → top-k passages
Code example
from sentence_transformers import SentenceTransformer, util
# Sentence-BERT: encode sentences into comparable vectors
model = SentenceTransformer("all-MiniLM-L6-v2")

corpus = ["several ways to optimize database response time",
          "the sleeping habits of cats"]
doc_emb = model.encode(corpus, convert_to_tensor=True)  # precompute

q_emb = model.encode("how to reduce system latency", convert_to_tensor=True)
hits = util.cos_sim(q_emb, doc_emb)[0]  # cosine similarity
print(hits)  # doc 1 scores high — zero literal overlap, semantic hit
Pitfall + practical scenario
"Dense retrieval always beats BM25"—wrong. It has three soft spots: (1) out-of-domain—domains the encoder never saw (medicine, law, your company jargon) collapse in recall; (2) exact match—SKUs and IDs get "understood" as near-synonyms; (3) long-tail rare entities. So the industry rarely uses dense retrieval alone—the next card shows how to fuse it with BM25.
📌 BigCat scenario: the killer feature for cross-disciplinary notes—search "Buddhist impermanence" and hit a note you wrote three months ago saying "distributed systems have no permanent state". That cross-domain semantic association is dense retrieval's home turf; BM25 simply can't do it.
Takeaway + question
💡 Dense retrieval turns "find the word" into "find the meaning"—its strength (semantic generalization) and weakness (exact match) are exactly complementary to BM25.
🤔 If the encoder never saw your specialty domain, is its "meaning" still trustworthy? What does that imply about when to fine-tune embeddings?

Hybrid SearchHybrid Search

fusion rankingRRF
One-line analogy

Since BM25 excels at exact lexical match and dense retrieval at semantics, don't pick one—retrieve with both and fuse the rankings. This is the ensemble learning idea, much like a database query planner combining multiple indexes, or a recommender's multi-channel recall + fusion. One query takes two paths, and their two rankings are merged into one.

What it solves + how it works

Every single-channel retriever has blind spots: BM25 misses rephrased queries, dense misses rare proper nouns. Hybrid runs both; the hard part is how to fuse. Two mainstream approaches:

  • ① Weighted scores: final = α·BM25 + (1−α)·cosine. Problem: BM25 scores can range 0–30, cosine is 0–1—totally different scales, so naive addition lets BM25 crush cosine. You must normalize, but normalization is brittle to outliers.
  • ② RRF (Reciprocal Rank Fusion): uses only ranks, not raw scores, sidestepping the scale problem at the root.

RRF's formula is dead simple—each document d's fused score:

RRF(d) = Σretriever i   1 / ( k + ranki(d) )

Intuition: the higher a doc ranks (smaller rank) in a channel, the larger 1/(k+rank), the more it contributes; docs ranked high in both channels have their scores add up and naturally surface. k (usually 60) is a smoothing constant—it stops the rank-1 doc from utterly crushing ranks 2 and 3, giving lower ranks a voice too. Cormack et al. 2009 (SIGIR) proved this almost embarrassingly simple formula consistently beats each single channel and even the more complex Condorcet fusion.

Hybrid search flow

query
├→ BM25 recall ranking A
└→ dense recall ranking B
        ↓ RRF fusion (ranks only)
        final top-k
Code example
def rrf(rank_lists, k=60):
    # rank_lists: each channel's doc_id list, sorted by relevance desc
    scores = {}
    for ranking in rank_lists:
        for rank, doc_id in enumerate(ranking):  # rank starts at 0
            # ranks only, never touch raw scores -> scale-free
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
    return sorted(scores, key=scores.get, reverse=True)

bm25_top = ["d3", "d1", "d7"]   # BM25 channel ranking
dense_top = ["d1", "d3", "d9"]  # dense channel ranking
print(rrf([bm25_top, dense_top]))  # d1/d3 high in both -> surface
Pitfall + practical scenario
"Just add the BM25 score and the cosine score together"—wrong. They differ by an order of magnitude; adding them means you effectively use only BM25. Either normalize rigorously (brittle), or use RRF on ranks only (robust, near zero tuning). Production default is RRF—no per-query α to fiddle with.
📌 BigCat scenario: the "have it all" config for a personal knowledge base—searching "PPO formula" lets BM25 nail the exact term, while "that note on the explore-vs-exploit tradeoff" lets the dense channel catch the meaning. RRF fusion makes one search box serve both intents.
Takeaway + question
💡 Hybrid search is retrieval's "ensemble learning"—RRF elegantly dodges the scale problem with the single trick of "compare ranks only", needing almost no tuning.
🤔 When the two channels give totally conflicting top-1 results, is RRF's "democratic vote" always right? When should you trust one channel more?

HNSW Vector IndexHierarchical Navigable Small World

approx. nearest neighborANN
One-line analogy

Dense retrieval says "find nearest neighbors"—but with millions or billions of vectors, brute-forcing every distance per query is a full table scan, O(n)—unusable. HNSW builds a skip list over the vector space: high layers are sparse and jump far; low layers are dense and step short. You know Redis sorted-set skip lists cold—HNSW just moves the skip list from a 1-D number line to a graph in high-dimensional space, cutting nearest-neighbor search from O(n) to roughly O(log n).

What it solves + how it works

Exact nearest neighbor has no shortcut in high dimensions, so you approximate. These algorithms are called ANN (Approximate Nearest Neighbor)sacrifice a little recall for orders-of-magnitude speed. HNSW, from Malkov & Yashunin (arXiv 1603.09320), is the index of choice in today's vector stores (Faiss, Milvus, pgvector, Qdrant). Two core ideas:

  • Small World graph—each node links to several near neighbors but also keeps a few "long edges", so any two points are a few hops apart (the six-degrees-of-separation intuition);
  • Hierarchical layers—the top layer has very few nodes with very long edges, like highways; lower layers get denser with shorter edges, like city streets. A query greedily walks the top layer to a local nearest, then descends a layer to refine, closing in step by step.
HNSW layered navigation (skip-list analogy)

L2 (sparse·long)  ─────────
                ↓ descend
L1 (medium)     ──────
                  ↓ descend
L0 (full·short)  ★=nearest
coarse locate at top → refine layer by layer → lock the nearest at the bottom, total hops ≈ log n

Three key parameters are the recall vs speed/memory knobs: M = edges per node (larger → higher recall, more memory); ef_construction = candidate breadth at build time (larger → better graph, slower build); ef_search = candidate breadth at query time (tunable live at query time, larger → higher recall, slower). In production, ef_search is the knob you turn most.

Code example
import hnswlib, numpy as np  # pip install hnswlib

dim, n = 384, 10000
data = np.random.rand(n, dim).astype("float32")

index = hnswlib.Index(space="cosine", dim=dim)
index.init_index(max_elements=n, ef_construction=200, M=16)  # build graph
index.add_items(data)

index.set_ef(50)  # ef_search: recall/speed knob, tunable anytime
q = np.random.rand(dim).astype("float32")
labels, dists = index.knn_query(q, k=5)  # approx top-5, ~O(log n)
print(labels)  # approximate — may miss the true #1, that's ANN's nature
Pitfall + practical scenario
"The vector store returns the exact nearest neighbors"—wrong. HNSW is approximate; recall is almost never 100%—it can miss the truly nearest point. Turning ef_search up raises recall and latency together—a deliberate trade-off knob, not a bug. When you need 100% exactness (dedup, compliance), use brute-force exact search, or add a re-rank pass over HNSW results as a safety net.
📌 BigCat scenario: once a personal knowledge base hits tens of thousands of entries, pgvector / Qdrant use HNSW by default. Understand the ef_search knob and you can tune between "fast search" and "don't miss important notes" on demand—instead of letting a default value quietly tank your recall.
Takeaway + question
💡 HNSW = a skip list in high-dimensional space, using "layers + small-world graph" to compress nearest-neighbor search to O(log n), at the cost of giving up 100% exactness.
🤔 Would you accept missing 1% of true nearest neighbors for a 50× retrieval speedup? Who should set that recall/latency tradeoff, and by what standard?
Engineering counterpart → super-individual D11 (RAG retrieval in practice)

Further ReadingFurther Reading

Deep QuestionsDeep Questions

1. BM25 vs dense retrieval is essentially "symbolic matching" vs "distributed representation". How does this echo AI's historical symbolic vs connectionist debate?
It's a beautiful miniature of it. BM25 is the symbolic camp—words are discrete symbols, relevance comes from symbol statistics (term/document frequency), interpretable, tunable, zero training, but brittle on "same meaning, different word". Dense retrieval is the connectionist camp—meaning is smeared across the vector's dimensions (distributed representation), strong generalization but a black box; you can't say what dimension 137 means. These two camps fought for decades, and the retrieval field's answer after 2020 is precisely don't pick sides—fuse—RRF lets the symbolic camp's precision and the connectionist camp's generalization each contribute. This is a more general engineering wisdom: when two paradigms each have irreplaceable strengths, fusion often beats purity. BigCat, your distributed-systems background should resonate—under CAP there's no silver bullet; the final architecture is always a context-driven trade-off composition, not the pursuit of one theoretical optimum.
2. Dense retrieval compresses meaning into a fixed-dimension vector—that compression is necessarily lossy. What information gets thrown away here, and which failure modes does that explain?
Compressing a whole passage into a 384- or 1024-dim vector is radically lossy—it must drop things. Mainly lost: (1) exact literal tokens—"GPT-4o" and "GPT-4" may nearly coincide in vector space because the model captures "both are OpenAI models" rather than the literal difference; this directly explains exact-match failures; (2) compositional structure—a bi-encoder squashes a sentence to one point, so "cat chases dog" and "dog chases cat" can be close because the bag-of-words semantics match; (3) long-tail rare info—entities the encoder rarely saw in training get poor representations. This is exactly why late-interaction methods exist (e.g. ColBERT, keeping one vector per token and doing fine-grained interaction at query time)—they trade more storage to recover the token-level info that compression discarded. Once you internalize "vector = lossy compression", you can predict when it'll fail: any scenario where literal precision and fine-grained structure matter is dangerous for single-vector dense retrieval.
3. HNSW trades "approximate" for speed. In which scenarios is a 1% recall loss catastrophic and ANN is off the table? Conversely, where is exact search over-engineering?
It comes down to the cost of missing one result. Where ANN must not be used (or must have an exact re-rank net): (a) dedup / entity resolution—missing the true duplicate lets dirty data into the store and everything downstream breaks; (b) compliance / safety—e.g. checking "does this text match a known prohibited-content set", missing one is an incident; (c) financial reconciliation / uniqueness checks—any "must find exactly that one" scenario. Conversely, exact search is over-engineering when: (a) RAG candidate passages for an LLM—you recall top-20 and let the LLM filter anyway, so missing #8 is usually harmless; (b) recommendation / exploration—there's no single "correct" answer; diversity may beat exactness; (c) massive low-value data—paying 50× compute for 1% recall isn't worth it. The decision rule is plain: (miss probability × cost per miss) vs (extra cost of exact search). BigCat, your personal knowledge base is the "RAG candidate" class—HNSW defaults suffice; no need for exact search.
4. RRF uses k=60, a "magic constant". What is it actually tuning? If you set k=1 or k=1000, how does fusion behavior change?
k tunes "how much voice the top ranks get". Look at 1/(k+rank): when k is tiny (say 1), the rank-0 doc contributes 1/1=1, rank-1 contributes 1/2—the head gap is blown up enormously, and fusion is dominated by each channel's top-1, leaving lower ranks almost no chance. When k is large (say 1000), rank-0 contributes 1/1000 and rank-1 contributes 1/1001—nearly equal, so rank differences are flattened and fusion degrades to roughly "count how many channels a doc appears in" (approaching vote counting). k=60 is the compromise found empirically in Cormack's paper: it gives the head a reasonable edge without letting one channel's top-1 veto another. It's a classic smoothing hyperparameter—same family as Laplace smoothing or the epsilon in a learning rate: a constant to prevent "divide by a tiny number" blowups, balancing "trust the head" against "respect the tail". The meta-lesson worth keeping: many engineering "magic constants" are just the empirical default of some trade-off knob—knowing what it tunes matters more than memorizing the number.
5. Stitch today's four concepts together: in a production semantic-search system, how do data and queries flow? Where is each stage's bottleneck?
Stitched into one pipeline: ① Offline indexing—chunk documents, send one path through BM25 to build an inverted index, send another through the embedding model to encode into vectors and load them into the HNSW graph. Bottleneck: embedding compute (millions of docs need batched GPU) + HNSW build time. ② Online query—the query takes both paths: BM25 inverted-index lookup for top-N, and vector encoding + HNSW for approximate top-N. Bottleneck: the dense path must encode the query live (latency), and HNSW's ef_search governs recall/latency. ③ Fusion—RRF merges the two rankings into one. Almost no bottleneck; RRF is extremely light. ④ Re-ranking (optional)—run a cross-encoder or LLM re-ranker over the fused top-k (this is engineering practice; see super-individual D11). ⑤ Feed the LLM—top passages enter the context. The core trade-off across the chain: recall breadth (how many per channel) × per-stage latency × compute cost. Understand today's four puzzle pieces and you can read the architecture diagram of any vector store / RAG framework—they're all different packagings of these five steps. The next layer—"how to chunk, which re-ranker, how to evaluate retrieval quality"—is super-individual D11's engineering turf.