Day 31 Hard Search BM25 + Vector Reranking Multi-stage

Hybrid Search & Reranking — Stitching BM25 and Vectors into a Multi-stage FunnelBM25 + Vector, Multi-stage Retrieval, RRF, Cross-encoder, Snippets

Scenario + Requirements

Design search for a 50M-SKU e-commerce catalog: it must hit on iPhone 15 Pro 256GB (an exact model/SKU) AND on "lightweight warm jacket good for camping" (a natural-language intent). Pure keyword search returns zero for the second class (the user never typed "jacket"); pure vector search dilutes the first by mixing in "iPhone 14" and the base iPhone 15, losing precision.

Key numbers: peak 5k QPS, p99 < 150ms, top-24 per query, 50M docs, 768-dim vectors. Constraints: recall must not drop (a miss = the user can't buy = lost GMV directly), but the ranking budget is limited (cross-encoders are expensive). This forces a multi-stage design — cheap retrieval widens the net, expensive ranking narrows it.

High-level Architecture

graph LR
    Q["Query
spell/tokenize/rewrite"] BM["BM25 retrieval
inverted index · exact"] VEC["Vector retrieval
HNSW ANN · semantic"] FUSE["Fusion
RRF / weighted"] RANK["Rerank
cross-encoder / GBDT"] BIZ["Business rerank
price/stock/ads/personalize"] SNIP["Snippet + highlight
query-biased"] R["Top-24"] Q --> BM --> FUSE Q --> VEC --> FUSE FUSE -->|top-200| RANK -->|top-50| BIZ --> SNIP --> R classDef q fill:#1a2530,stroke:#64c8ff,color:#e8eef5 classDef rec fill:#0e2030,stroke:#5eead4,color:#e8eef5 classDef rank fill:#1a1a30,stroke:#ffb450,color:#e8eef5 classDef out fill:#2a1530,stroke:#ff7ab6,color:#e8eef5 class Q q class BM,VEC,FUSE rec class RANK,BIZ rank class SNIP,R out

A funnel: retrieval pulls hundreds (high recall, cheap) → reranking trims to dozens (high precision, costly) → business rerank sets the final order

The core idea: split "find everything" from "rank correctly" into different stages. The retrieval layer runs two cheap channels in parallel (keyword + semantic), then fuses and dedups. The ranking layer runs the expensive model only on a few hundred candidates. Finally, business signals (price, stock, ads) are layered on. Each stage's latency budget and candidate count shrink strictly.

Key Technical Points

1. Hybrid retrieval: BM25 vs dense vectors, why you need both

Core trade-off: lexical precision vs semantic generalization — their blind spots are exactly complementary.

Principle: BM25 (Okapi BM25, Robertson/Spärck Jones) is the probabilistic refinement of TF-IDF: it scores by term frequency + inverse document frequency, normalizes for document length, and saturates frequent terms (a word appearing 10× isn't worth 2× of 5×). It runs on an inverted index and is unbeatable on exact terms, rare tokens (SKUs, models, names), and boolean/negation logic. A dense vector encodes query and doc each into a 768-dim vector (bi-encoder, e.g. a BERT-style dual encoder) and takes cosine similarity, capturing semantics — "laptop" matches "notebook computer", "cheap" matches "affordable". But it is insensitive to exact tokens and will "confidently fabricate relevance".

Trade-off:
# Hybrid retrieval: two channels in parallel, then fuse
def retrieve(q):
    sparse = bm25_index.search(q.text, k=100)        # inverted, exact terms
    dense  = ann_index.search(embed(q.text), k=100)  # HNSW, semantic
    return fuse(sparse, dense)        # see point 3
Real-world:

2. Multi-stage retrieval: the recall / coarse / fine funnel

Core trade-off: each stage trades "candidate count" against "per-item evaluation cost" in opposite directions.

Principle: you can't run an expensive model over 50M docs. Multi-stage builds retrieval as a funnel: recall pulls a few hundred from the full corpus aiming for "don't miss", using the cheapest BM25/ANN; fine ranking (precision) runs the heavy model only on those hundreds, aiming for "rank correctly". The key is that per-stage cost × candidate count ≈ constant — recall is near-free per item but high in volume; fine ranking is costly per item but low in volume; their product must fit the latency budget. ANN (e.g. HNSW, hierarchical navigable small world graphs) is the canonical "trade accuracy for speed" of the recall stage: give up exact KNN for sub-ms logarithmic lookup.

Trade-off (recall-layer ANN algorithms):
# Funnel: candidate count shrinks, per-item cost grows
cand = hybrid_retrieve(q, k=200)        # ~sub-ms/item, want all
cand = rerank_cross_encoder(q, cand)[:50]  # ~ms/item, want precise
final = business_rerank(cand)[:24]      # price/stock/ads/personalize
Real-world:

3. Fusion & reranking: RRF vs weighted, bi-encoder vs cross-encoder

Core trade-off: fusion solves "two score scales aren't comparable", reranking solves "the dual-encoder isn't precise enough", but cross-encoders are two orders of magnitude slower.

Principle: BM25 scores (unbounded) and cosine similarity (0–1) have different scales — a naïve weighted average gets dominated by one side. RRF (Reciprocal Rank Fusion, Cormack 2009) looks at rank, not score: doc score = Σ 1/(k + rank), with k typically 60, naturally solving the incomparable-scale problem with no normalization. After fusion comes ranking: a bi-encoder (the dual tower used for recall) encodes query and doc independently and can build an offline index; a cross-encoder concatenates query+doc into one model so their tokens interact fully — far more accurate, but must be computed online and can't be pre-stored. That's exactly why it lives in fine ranking (candidates already trimmed to dozens).

Bi-encoder (dual tower)Cross-encoder
Encodingquery / doc encoded independentlyquery+doc encoded jointly
Pre-indexable?Yes (doc vectors offline)No (must compute online)
AccuracyMediumHigh (token-level interaction)
CostLow (one vector compare)High (one forward per pair)
Used inRecall (millions)Fine ranking (dozens–hundreds)
# RRF: uses rank only, no normalization needed
def rrf(lists, k=60):
    score = defaultdict(float)
    for ranked in lists:                 # each channel's ordered results
        for rank, doc in enumerate(ranked):
            score[doc] += 1.0 / (k + rank)
    return sorted(score, key=score.get, reverse=True)

# Fine ranking: cross-encoder computed online (only on fused top-N)
scores = cross_encoder.predict([(q, d.text) for d in candidates])
Fusion trade-off: RRF is robust and tuning-free, insensitive to outlier scores, but discards score magnitude (a strong and a weak match at the same rank are treated equally); weighted linear (α·norm(bm25)+(1-α)·cos) keeps magnitude and is tunable, but needs score normalization and a per-query-type α, and overfits easily. In production RRF is the common default; high-value verticals add weighting or learned fusion on top.
Real-world:

4. Snippet generation & highlighting: making results "look relevant"

Core trade-off: what you show is the explanation, not the retrieval — pick the wrong passage and users won't click, however good the ranking.

Principle: a snippet is query-biased summarization: not the first two sentences, but the window that best reflects "relevance to this query". The classic approach slides a fixed-length window, scores it by query-term hits inside, term rarity (high-IDF terms weigh more), and term proximity, takes the best window, and wraps hit terms in <em>. The hard part: highlighting must land on the rendered source offsets (positions shift after tokenization, stemming, Unicode normalization) — misalignment highlights the wrong characters.

# query-biased snippet: pick the densest, rarest-term window
def best_snippet(doc_tokens, q_terms, W=160):
    qset, best, lo = set(q_terms), (-1, 0), 0
    for i in range(0, len(doc_tokens)-W, 20):
        win = doc_tokens[i:i+W]
        # hit count + IDF weighting + coverage of distinct query terms
        s = sum(idf(t) for t in win if t in qset) \
            + 2*len({t for t in win if t in qset})
        if s > best[0]: best, lo = (s, i), i
    return highlight(doc_tokens[lo:lo+W], qset)  # wrap hits in <em>
Trade-off: lexical highlighting (bold the matched terms) is cheap and intuitive for exact queries, but a semantically-retrieved result may have no query term in the passage at all (it came from the vector channel), looking irrelevant; extractive LLM summarization can explain semantic relevance, but is costly, adds latency, and may hallucinate. A compromise: exact hits get highlighting, semantic hits get a short extractive summary.
Real-world: Google/Bing result snippets have long used query-biased extraction + highlighting; modern RAG systems feed the snippet straight to an LLM as "cited context", turning highlights into citation anchors.

Scaling & Optimization

Common Pitfalls + Interview Questions

1. Directly averaging BM25 score and cosine similarity. Incomparable scales, dominated by the unbounded BM25. Either RRF (rank only), or normalize per query before weighting.
2. Using a cross-encoder for recall. It can't be pre-indexed and must be computed online; running it over the full corpus = latency explosion. Use it only after candidates shrink to dozens–hundreds.
3. Watching only NDCG/precision, not recall. A recall miss is unrecoverable in fine ranking — a doc never retrieved can't be ranked, however strong the ranker. Recall@k at the recall layer is the ceiling.
4. Highlight offset misalignment. Positions drift after tokenization/stemming/Unicode normalization (CJK especially); always remap onto the raw byte/character string.

Frequent interview follow-ups:

  1. Why not vector search alone? Give a query that pure vector must miss but BM25 hits, and vice versa.
  2. What does RRF's k=60 mean? Larger vs smaller — does it favor the head or the tail?
  3. What's the essential difference between bi-encoder and cross-encoder? Why can one be indexed and the other not?
  4. 50M docs, 768-dim vectors — roughly how much memory does an HNSW index take? How do you estimate?
  5. recall@200 is only 85% — no ranker can recover the missing 15%. How do you improve recall?

Deep-dive Resources

Going Deeper (click to expand)

1. recall@200=85%, fine-ranking NDCG is high, but GMV doesn't move. Where's the bottleneck? How to locate it?

The bottleneck is almost certainly recall, not ranking. Fine ranking only reorders the 200 retrieved — the missing 15% never enter the candidate set, and no ranker can help. High NDCG only says "what was retrieved is ordered well", masking "what wasn't retrieved is invisible".

How to locate: ① offline, compute the recall@k ceiling using human labels/click logs — feed known-relevant docs into retrieval and see if they enter candidates; ② break down by query type — likely semantic/long-tail queries recall poorly (vectors undercover) or exact queries get crowded out by vector noise; ③ A/B raise k from 200 to 500 — if GMV rises, recall is the bottleneck. Fix: widen retrieval channels, raise ANN's efSearch, add SPLADE — not keep optimizing ranking. Core insight: in a multi-stage funnel, upstream recall is the ceiling for everything downstream.

2. Tune RRF's k to 1 or 1000 — how does ranking behavior change?

RRF score = Σ 1/(k+rank). k controls whether rank differences are amplified or flattened.

Small k (e.g. 1): 1/(1+0)=1, 1/(1+1)=0.5 — huge gaps between top ranks, extremely biased toward each channel's rank-1, almost "whoever's first wins"; sensitive to the head, the tail barely contributes. Large k (e.g. 1000): 1/1000 vs 1/1001 are nearly equal — all ranks flatten, rank info diluted, degenerating into "in how many channels did it appear" — favoring cross-channel consensus over a single channel's high rank.

k=60 is the empirical compromise: it preserves head discrimination while letting mid-ranked docs (decent in both channels) float up on consensus — exactly the "what both sides find relevant is more trustworthy" that hybrid search wants. Tuning: shrink k to emphasize single-channel strong signals, grow k to emphasize cross-channel agreement.

3. 50M docs, 768-dim float32 vectors, HNSW — roughly how much memory? How to estimate?

The vectors themselves: 768 dims × 4 bytes = 3072 B ≈ 3 KB/doc. 5e7 × 3 KB ≈ 150 GB.

The HNSW graph: each node stores M neighbor pointers per layer; with a common M=16, each neighbor ID ~4–8 bytes, and with multi-layer overhead the graph empirically adds 20%–60% on top of the vectors. At ~50% that's about 75 GB.

Total ~200–230 GB — too much/too costly for one box. That's why large scale either shards (split by shard across machines, scatter-gather queries), uses PQ/quantization to compress 3 KB to a few hundred bytes (IVF-PQ compresses 8–32× at the cost of recall), or uses disk-based ANN (DiskANN, graph on SSD). Interviewers want to see you do order-of-magnitude estimation and choose architecture from it, not recite the HNSW formula.

4. Why is a cross-encoder accurate yet unusable for recall? What computational structure fundamentally limits it?

It comes down to whether you can precompute and amortize. A bi-encoder encodes query and doc independently into vectors, so doc vectors can be computed offline and stored in an ANN index; online you only do vector compares — the doc-side cost is amortized once. A cross-encoder concatenates query+doc and runs the whole thing through the model, letting their tokens interact pairwise in attention (precisely the source of its accuracy) — but that means the result depends on the specific query and can't be pre-stored; every (query, doc) pair needs a full online forward pass.

For 50M docs, recall would mean "query × full corpus" = 50M forward passes, blowing past any second-scale budget. So it only fits the fine-ranking stage where candidates are already dozens–hundreds. Insight: more interaction means higher accuracy, but interaction makes the result un-precomputable — accuracy and amortizability are a fundamental tension, and multi-stage architecture is exactly how we divide the labor. ColBERT's "late interaction" aims for a middle ground.

5. Vector retrieval "confidently retrieves irrelevant results" — it always returns nearest neighbors even when nothing truly relevant exists. How to defend?

This is a structural flaw of dense retrieval: ANN always finds the k "nearest", but "nearest" ≠ "relevant". When a query lands in a sparse region of embedding space, the nearest neighbors may be far yet still returned — manifesting as "search for something the corpus doesn't have, and you still get a pile of unrelated results".

Defenses: ① a similarity threshold — drop anything below it, returning few or empty rather than noise (but the threshold drifts per query, hard to unify); ② BM25 as a backstop/check — in hybrid, if a vector-retrieved doc contains zero query terms and has a very low BM25 score, downweight or drop it; ③ cross-encoder as gatekeeper — it gives an absolute relevance score; filter low scores; ④ OOD/confidence detection — monitor the distribution of query-to-nearest-neighbor distances and treat abnormally far ones as "no good result", falling back to a message. Key realization: vector retrieval gives a relative ordering, not an absolute relevance judgment — absolute judgments need thresholds or a cross-encoder.