Day 31 Hard Search BM25 + Vector Reranking Multi-stage

Hybrid Search & Reranking — Stitching BM25 and Vectors into a Multi-stage FunnelBM25 + Vector, Multi-stage Retrieval, RRF, Cross-encoder, Snippets

Scenario + Requirements

Design search for a 50M-SKU e-commerce catalog: it must hit on iPhone 15 Pro 256GB (an exact model/SKU) AND on "lightweight warm jacket good for camping" (a natural-language intent). Pure keyword search returns zero for the second class (the user never typed "jacket"); pure vector search dilutes the first by mixing in "iPhone 14" and the base iPhone 15, losing precision.

Key numbers: peak 5k QPS, p99 < 150ms, top-24 per query, 50M docs, 768-dim vectors. Constraints: recall must not drop (a miss = the user can't buy = lost GMV directly), but the ranking budget is limited (cross-encoders are expensive). This forces a multi-stage design — cheap retrieval widens the net, expensive ranking narrows it.

High-level Architecture

graph LR
    Q["Query
spell/tokenize/rewrite"]
    BM["BM25 retrieval
inverted index · exact"]
    VEC["Vector retrieval
HNSW ANN · semantic"]
    FUSE["Fusion
RRF / weighted"]
    RANK["Rerank
cross-encoder / GBDT"]
    BIZ["Business rerank
price/stock/ads/personalize"]
    SNIP["Snippet + highlight
query-biased"]
    R["Top-24"]

    Q --> BM --> FUSE
    Q --> VEC --> FUSE
    FUSE -->|top-200| RANK -->|top-50| BIZ --> SNIP --> R

    classDef q fill:#1a2530,stroke:#64c8ff,color:#e8eef5
    classDef rec fill:#0e2030,stroke:#5eead4,color:#e8eef5
    classDef rank fill:#1a1a30,stroke:#ffb450,color:#e8eef5
    classDef out fill:#2a1530,stroke:#ff7ab6,color:#e8eef5
    class Q q
    class BM,VEC,FUSE rec
    class RANK,BIZ rank
    class SNIP,R out

A funnel: retrieval pulls hundreds (high recall, cheap) → reranking trims to dozens (high precision, costly) → business rerank sets the final order

The core idea: split "find everything" from "rank correctly" into different stages. The retrieval layer runs two cheap channels in parallel (keyword + semantic), then fuses and dedups. The ranking layer runs the expensive model only on a few hundred candidates. Finally, business signals (price, stock, ads) are layered on. Each stage's latency budget and candidate count shrink strictly.

Key Technical Points

1. Hybrid retrieval: BM25 vs dense vectors, why you need both

Core trade-off: lexical precision vs semantic generalization — their blind spots are exactly complementary.

Principle: BM25 (Okapi BM25, Robertson/Spärck Jones) is the probabilistic refinement of TF-IDF: it scores by term frequency + inverse document frequency, normalizes for document length, and saturates frequent terms (a word appearing 10× isn't worth 2× of 5×). It runs on an inverted index and is unbeatable on exact terms, rare tokens (SKUs, models, names), and boolean/negation logic. A dense vector encodes query and doc each into a 768-dim vector (bi-encoder, e.g. a BERT-style dual encoder) and takes cosine similarity, capturing semantics — "laptop" matches "notebook computer", "cheap" matches "affordable". But it is insensitive to exact tokens and will "confidently fabricate relevance".

Trade-off:

Pure BM25: ✅ works zero-shot, interpretable, unbeatable on long-tail rare terms; ❌ vocabulary mismatch — synonyms/paraphrases simply miss.
Pure dense vector: ✅ cross-lingual, robust to rewording, understands intent; ❌ poor recall on rare entities/numbers/models, needs training/tuning, weak interpretability, ANN has approximation error.
Hybrid: ✅ complementary coverage, widely the highest recall; ❌ two indexes (inverted + ANN), double the ops, fusion needs tuning.

# Hybrid retrieval: two channels in parallel, then fuse
def retrieve(q):
    sparse = bm25_index.search(q.text, k=100)        # inverted, exact terms
    dense  = ann_index.search(embed(q.text), k=100)  # HNSW, semantic
    return fuse(sparse, dense)        # see point 3

Real-world:

Elastic: officially recommends hybrid — BM25 + kNN retrievers in parallel fused by RRF, getting "exact-term SKU/rare-term matching + natural-language semantics" together (elastic.co/elasticsearch/hybrid-search).
DoorDash: explicitly chose a "flexible hybrid system" — keyword retrieval keeps rules controllable, with LLMs layered on for doc and query understanding (careersatdoordash.com search blog).

2. Multi-stage retrieval: the recall / coarse / fine funnel

Core trade-off: each stage trades "candidate count" against "per-item evaluation cost" in opposite directions.

Principle: you can't run an expensive model over 50M docs. Multi-stage builds retrieval as a funnel: recall pulls a few hundred from the full corpus aiming for "don't miss", using the cheapest BM25/ANN; fine ranking (precision) runs the heavy model only on those hundreds, aiming for "rank correctly". The key is that per-stage cost × candidate count ≈ constant — recall is near-free per item but high in volume; fine ranking is costly per item but low in volume; their product must fit the latency budget. ANN (e.g. HNSW, hierarchical navigable small world graphs) is the canonical "trade accuracy for speed" of the recall stage: give up exact KNN for sub-ms logarithmic lookup.

Trade-off (recall-layer ANN algorithms):

HNSW: ✅ fast queries, high recall, supports incremental insert; ❌ large memory (graph resident in RAM), deletes/updates are awkward.
IVF-PQ (inverted file + product quantization): ✅ memory-thrifty, fits huge scale; ❌ lower recall, needs a trained codebook, tune nprobe.
Brute-force exact KNN: ✅ 100% recall; ❌ O(N), unusable for 50M online queries.

# Funnel: candidate count shrinks, per-item cost grows
cand = hybrid_retrieve(q, k=200)        # ~sub-ms/item, want all
cand = rerank_cross_encoder(q, cand)[:50]  # ~ms/item, want precise
final = business_rerank(cand)[:24]      # price/stock/ads/personalize

Real-world:

Vespa: phased ranking — first-phase runs a cheap function over all candidates, second-phase runs the heavy model (e.g. ColBERT MaxSim) only on the top, keeping latency bounded (docs.vespa.ai phased ranking).
DoorDash: built an in-house search engine doing multi-stage retrieval and ranking, publicly reporting ~50% latency reduction (ByteByteGo has dissected its architecture).

3. Fusion & reranking: RRF vs weighted, bi-encoder vs cross-encoder

Core trade-off: fusion solves "two score scales aren't comparable", reranking solves "the dual-encoder isn't precise enough", but cross-encoders are two orders of magnitude slower.

Principle: BM25 scores (unbounded) and cosine similarity (0–1) have different scales — a naïve weighted average gets dominated by one side. RRF (Reciprocal Rank Fusion, Cormack 2009) looks at rank, not score: doc score = Σ 1/(k + rank), with k typically 60, naturally solving the incomparable-scale problem with no normalization. After fusion comes ranking: a bi-encoder (the dual tower used for recall) encodes query and doc independently and can build an offline index; a cross-encoder concatenates query+doc into one model so their tokens interact fully — far more accurate, but must be computed online and can't be pre-stored. That's exactly why it lives in fine ranking (candidates already trimmed to dozens).

	Bi-encoder (dual tower)	Cross-encoder
Encoding	query / doc encoded independently	query+doc encoded jointly
Pre-indexable?	Yes (doc vectors offline)	No (must compute online)
Accuracy	Medium	High (token-level interaction)
Cost	Low (one vector compare)	High (one forward per pair)
Used in	Recall (millions)	Fine ranking (dozens–hundreds)

# RRF: uses rank only, no normalization needed
def rrf(lists, k=60):
    score = defaultdict(float)
    for ranked in lists:                 # each channel's ordered results
        for rank, doc in enumerate(ranked):
            score[doc] += 1.0 / (k + rank)
    return sorted(score, key=score.get, reverse=True)

# Fine ranking: cross-encoder computed online (only on fused top-N)
scores = cross_encoder.predict([(q, d.text) for d in candidates])

Fusion trade-off: RRF is robust and tuning-free, insensitive to outlier scores, but discards score magnitude (a strong and a weak match at the same rank are treated equally); weighted linear (α·norm(bm25)+(1-α)·cos) keeps magnitude and is tunable, but needs score normalization and a per-query-type α, and overfits easily. In production RRF is the common default; high-value verticals add weighting or learned fusion on top.

Real-world:

Elasticsearch: RRF is a built-in retriever, fusing BM25 and kNN lists by 1/(k+rank), documented in detail (elastic.co RRF docs).
Cohere Rerank: a productized cross-encoder — concatenates query with candidate docs and reranks by semantic relevance, auto-chunking over-long docs across multiple inferences (docs.cohere.com/docs/rerank).

4. Snippet generation & highlighting: making results "look relevant"

Core trade-off: what you show is the explanation, not the retrieval — pick the wrong passage and users won't click, however good the ranking.

Principle: a snippet is query-biased summarization: not the first two sentences, but the window that best reflects "relevance to this query". The classic approach slides a fixed-length window, scores it by query-term hits inside, term rarity (high-IDF terms weigh more), and term proximity, takes the best window, and wraps hit terms in <em>. The hard part: highlighting must land on the rendered source offsets (positions shift after tokenization, stemming, Unicode normalization) — misalignment highlights the wrong characters.

# query-biased snippet: pick the densest, rarest-term window
def best_snippet(doc_tokens, q_terms, W=160):
    qset, best, lo = set(q_terms), (-1, 0), 0
    for i in range(0, len(doc_tokens)-W, 20):
        win = doc_tokens[i:i+W]
        # hit count + IDF weighting + coverage of distinct query terms
        s = sum(idf(t) for t in win if t in qset) \
            + 2*len({t for t in win if t in qset})
        if s > best[0]: best, lo = (s, i), i
    return highlight(doc_tokens[lo:lo+W], qset)  # wrap hits in <em>

Trade-off: lexical highlighting (bold the matched terms) is cheap and intuitive for exact queries, but a semantically-retrieved result may have no query term in the passage at all (it came from the vector channel), looking irrelevant; extractive LLM summarization can explain semantic relevance, but is costly, adds latency, and may hallucinate. A compromise: exact hits get highlighting, semantic hits get a short extractive summary.

Real-world: Google/Bing result snippets have long used query-biased extraction + highlighting; modern RAG systems feed the snippet straight to an LLM as "cited context", turning highlights into citation anchors.

Scaling & Optimization

Learned fusion / learning-to-rank (LTR): train a GBDT (LambdaMART) or neural ranker on click logs to replace hand-tuned α, scoring BM25 score, vector score, CTR, price etc. as unified features.
Sparse neural retrieval (SPLADE): let a model do "term expansion" on the inverted index, getting BM25's interpretability plus neural semantics while dropping the separate ANN.
Index freshness: vectors need re-embedding to update; an edited doc must be re-embedded and the HNSW rebuilt. Use near-real-time (NRT) incremental segments + background merge, with a fast lane for hot new items.
Personalization & multi-objective: after fine ranking, layer in user profile, stock, margin, ad bids — usually as multi-objective weighting or a multi-tower model; beware letting commercial goals crush relevance.
Multi-region / cost: cross-encoder GPUs are expensive, so tier by query value — head high-value queries get the heavy reranker, long-tail queries use the fused result only.

Common Pitfalls + Interview Questions

1. Directly averaging BM25 score and cosine similarity. Incomparable scales, dominated by the unbounded BM25. Either RRF (rank only), or normalize per query before weighting.

2. Using a cross-encoder for recall. It can't be pre-indexed and must be computed online; running it over the full corpus = latency explosion. Use it only after candidates shrink to dozens–hundreds.

3. Watching only NDCG/precision, not recall. A recall miss is unrecoverable in fine ranking — a doc never retrieved can't be ranked, however strong the ranker. Recall@k at the recall layer is the ceiling.

4. Highlight offset misalignment. Positions drift after tokenization/stemming/Unicode normalization (CJK especially); always remap onto the raw byte/character string.

Frequent interview follow-ups:

Why not vector search alone? Give a query that pure vector must miss but BM25 hits, and vice versa.
What does RRF's k=60 mean? Larger vs smaller — does it favor the head or the tail?
What's the essential difference between bi-encoder and cross-encoder? Why can one be indexed and the other not?
50M docs, 768-dim vectors — roughly how much memory does an HNSW index take? How do you estimate?
recall@200 is only 85% — no ranker can recover the missing 15%. How do you improve recall?

Deep-dive Resources

Elasticsearch — Reciprocal Rank Fusion / Hybrid search: the RRF formula and official hybrid implementation (RRF docs · hybrid search).
Vespa Blog & Docs — Phased Ranking / Hybrid Search: engineering practice for multi-stage ranking and BM25+vector fusion (Vespa hybrid search).
Cohere — Rerank: productization and usage of cross-encoder reranking (docs.cohere.com/docs/rerank).
DoorDash Engineering: a hybrid system design using LLMs to improve search retrieval (careersatdoordash.com).
DDIA Ch 3 (Kleppmann): inverted indexes and storage engines — the foundation under BM25 retrieval.

Going Deeper (click to expand)

1. recall@200=85%, fine-ranking NDCG is high, but GMV doesn't move. Where's the bottleneck? How to locate it?

The bottleneck is almost certainly recall, not ranking. Fine ranking only reorders the 200 retrieved — the missing 15% never enter the candidate set, and no ranker can help. High NDCG only says "what was retrieved is ordered well", masking "what wasn't retrieved is invisible".

How to locate: ① offline, compute the recall@k ceiling using human labels/click logs — feed known-relevant docs into retrieval and see if they enter candidates; ② break down by query type — likely semantic/long-tail queries recall poorly (vectors undercover) or exact queries get crowded out by vector noise; ③ A/B raise k from 200 to 500 — if GMV rises, recall is the bottleneck. Fix: widen retrieval channels, raise ANN's efSearch, add SPLADE — not keep optimizing ranking. Core insight: in a multi-stage funnel, upstream recall is the ceiling for everything downstream.

2. Tune RRF's k to 1 or 1000 — how does ranking behavior change?

RRF score = Σ 1/(k+rank). k controls whether rank differences are amplified or flattened.

Small k (e.g. 1): 1/(1+0)=1, 1/(1+1)=0.5 — huge gaps between top ranks, extremely biased toward each channel's rank-1, almost "whoever's first wins"; sensitive to the head, the tail barely contributes. Large k (e.g. 1000): 1/1000 vs 1/1001 are nearly equal — all ranks flatten, rank info diluted, degenerating into "in how many channels did it appear" — favoring cross-channel consensus over a single channel's high rank.

k=60 is the empirical compromise: it preserves head discrimination while letting mid-ranked docs (decent in both channels) float up on consensus — exactly the "what both sides find relevant is more trustworthy" that hybrid search wants. Tuning: shrink k to emphasize single-channel strong signals, grow k to emphasize cross-channel agreement.

3. 50M docs, 768-dim float32 vectors, HNSW — roughly how much memory? How to estimate?

The vectors themselves: 768 dims × 4 bytes = 3072 B ≈ 3 KB/doc. 5e7 × 3 KB ≈ 150 GB.

The HNSW graph: each node stores M neighbor pointers per layer; with a common M=16, each neighbor ID ~4–8 bytes, and with multi-layer overhead the graph empirically adds 20%–60% on top of the vectors. At ~50% that's about 75 GB.

Total ~200–230 GB — too much/too costly for one box. That's why large scale either shards (split by shard across machines, scatter-gather queries), uses PQ/quantization to compress 3 KB to a few hundred bytes (IVF-PQ compresses 8–32× at the cost of recall), or uses disk-based ANN (DiskANN, graph on SSD). Interviewers want to see you do order-of-magnitude estimation and choose architecture from it, not recite the HNSW formula.

4. Why is a cross-encoder accurate yet unusable for recall? What computational structure fundamentally limits it?

It comes down to whether you can precompute and amortize. A bi-encoder encodes query and doc independently into vectors, so doc vectors can be computed offline and stored in an ANN index; online you only do vector compares — the doc-side cost is amortized once. A cross-encoder concatenates query+doc and runs the whole thing through the model, letting their tokens interact pairwise in attention (precisely the source of its accuracy) — but that means the result depends on the specific query and can't be pre-stored; every (query, doc) pair needs a full online forward pass.

For 50M docs, recall would mean "query × full corpus" = 50M forward passes, blowing past any second-scale budget. So it only fits the fine-ranking stage where candidates are already dozens–hundreds. Insight: more interaction means higher accuracy, but interaction makes the result un-precomputable — accuracy and amortizability are a fundamental tension, and multi-stage architecture is exactly how we divide the labor. ColBERT's "late interaction" aims for a middle ground.

5. Vector retrieval "confidently retrieves irrelevant results" — it always returns nearest neighbors even when nothing truly relevant exists. How to defend?

This is a structural flaw of dense retrieval: ANN always finds the k "nearest", but "nearest" ≠ "relevant". When a query lands in a sparse region of embedding space, the nearest neighbors may be far yet still returned — manifesting as "search for something the corpus doesn't have, and you still get a pile of unrelated results".

Defenses: ① a similarity threshold — drop anything below it, returning few or empty rather than noise (but the threshold drifts per query, hard to unify); ② BM25 as a backstop/check — in hybrid, if a vector-retrieved doc contains zero query terms and has a very low BM25 score, downweight or drop it; ③ cross-encoder as gatekeeper — it gives an absolute relevance score; filter low scores; ④ OOD/confidence detection — monitor the distribution of query-to-nearest-neighbor distances and treat abnormally far ones as "no good result", falling back to a message. Key realization: vector retrieval gives a relative ordering, not an absolute relevance judgment — absolute judgments need thresholds or a cross-encoder.

← Back to index