Design search for a 50M-SKU e-commerce catalog: it must hit on iPhone 15 Pro 256GB (an exact model/SKU) AND on "lightweight warm jacket good for camping" (a natural-language intent). Pure keyword search returns zero for the second class (the user never typed "jacket"); pure vector search dilutes the first by mixing in "iPhone 14" and the base iPhone 15, losing precision.
Key numbers: peak 5k QPS, p99 < 150ms, top-24 per query, 50M docs, 768-dim vectors. Constraints: recall must not drop (a miss = the user can't buy = lost GMV directly), but the ranking budget is limited (cross-encoders are expensive). This forces a multi-stage design — cheap retrieval widens the net, expensive ranking narrows it.
graph LR
Q["Query
spell/tokenize/rewrite"]
BM["BM25 retrieval
inverted index · exact"]
VEC["Vector retrieval
HNSW ANN · semantic"]
FUSE["Fusion
RRF / weighted"]
RANK["Rerank
cross-encoder / GBDT"]
BIZ["Business rerank
price/stock/ads/personalize"]
SNIP["Snippet + highlight
query-biased"]
R["Top-24"]
Q --> BM --> FUSE
Q --> VEC --> FUSE
FUSE -->|top-200| RANK -->|top-50| BIZ --> SNIP --> R
classDef q fill:#1a2530,stroke:#64c8ff,color:#e8eef5
classDef rec fill:#0e2030,stroke:#5eead4,color:#e8eef5
classDef rank fill:#1a1a30,stroke:#ffb450,color:#e8eef5
classDef out fill:#2a1530,stroke:#ff7ab6,color:#e8eef5
class Q q
class BM,VEC,FUSE rec
class RANK,BIZ rank
class SNIP,R out
A funnel: retrieval pulls hundreds (high recall, cheap) → reranking trims to dozens (high precision, costly) → business rerank sets the final order
The core idea: split "find everything" from "rank correctly" into different stages. The retrieval layer runs two cheap channels in parallel (keyword + semantic), then fuses and dedups. The ranking layer runs the expensive model only on a few hundred candidates. Finally, business signals (price, stock, ads) are layered on. Each stage's latency budget and candidate count shrink strictly.
Core trade-off: lexical precision vs semantic generalization — their blind spots are exactly complementary.
Principle: BM25 (Okapi BM25, Robertson/Spärck Jones) is the probabilistic refinement of TF-IDF: it scores by term frequency + inverse document frequency, normalizes for document length, and saturates frequent terms (a word appearing 10× isn't worth 2× of 5×). It runs on an inverted index and is unbeatable on exact terms, rare tokens (SKUs, models, names), and boolean/negation logic. A dense vector encodes query and doc each into a 768-dim vector (bi-encoder, e.g. a BERT-style dual encoder) and takes cosine similarity, capturing semantics — "laptop" matches "notebook computer", "cheap" matches "affordable". But it is insensitive to exact tokens and will "confidently fabricate relevance".
# Hybrid retrieval: two channels in parallel, then fuse
def retrieve(q):
sparse = bm25_index.search(q.text, k=100) # inverted, exact terms
dense = ann_index.search(embed(q.text), k=100) # HNSW, semantic
return fuse(sparse, dense) # see point 3
Core trade-off: each stage trades "candidate count" against "per-item evaluation cost" in opposite directions.
Principle: you can't run an expensive model over 50M docs. Multi-stage builds retrieval as a funnel: recall pulls a few hundred from the full corpus aiming for "don't miss", using the cheapest BM25/ANN; fine ranking (precision) runs the heavy model only on those hundreds, aiming for "rank correctly". The key is that per-stage cost × candidate count ≈ constant — recall is near-free per item but high in volume; fine ranking is costly per item but low in volume; their product must fit the latency budget. ANN (e.g. HNSW, hierarchical navigable small world graphs) is the canonical "trade accuracy for speed" of the recall stage: give up exact KNN for sub-ms logarithmic lookup.
# Funnel: candidate count shrinks, per-item cost grows
cand = hybrid_retrieve(q, k=200) # ~sub-ms/item, want all
cand = rerank_cross_encoder(q, cand)[:50] # ~ms/item, want precise
final = business_rerank(cand)[:24] # price/stock/ads/personalize
Core trade-off: fusion solves "two score scales aren't comparable", reranking solves "the dual-encoder isn't precise enough", but cross-encoders are two orders of magnitude slower.
Principle: BM25 scores (unbounded) and cosine similarity (0–1) have different scales — a naïve weighted average gets dominated by one side. RRF (Reciprocal Rank Fusion, Cormack 2009) looks at rank, not score: doc score = Σ 1/(k + rank), with k typically 60, naturally solving the incomparable-scale problem with no normalization. After fusion comes ranking: a bi-encoder (the dual tower used for recall) encodes query and doc independently and can build an offline index; a cross-encoder concatenates query+doc into one model so their tokens interact fully — far more accurate, but must be computed online and can't be pre-stored. That's exactly why it lives in fine ranking (candidates already trimmed to dozens).
| Bi-encoder (dual tower) | Cross-encoder | |
|---|---|---|
| Encoding | query / doc encoded independently | query+doc encoded jointly |
| Pre-indexable? | Yes (doc vectors offline) | No (must compute online) |
| Accuracy | Medium | High (token-level interaction) |
| Cost | Low (one vector compare) | High (one forward per pair) |
| Used in | Recall (millions) | Fine ranking (dozens–hundreds) |
# RRF: uses rank only, no normalization needed
def rrf(lists, k=60):
score = defaultdict(float)
for ranked in lists: # each channel's ordered results
for rank, doc in enumerate(ranked):
score[doc] += 1.0 / (k + rank)
return sorted(score, key=score.get, reverse=True)
# Fine ranking: cross-encoder computed online (only on fused top-N)
scores = cross_encoder.predict([(q, d.text) for d in candidates])
Core trade-off: what you show is the explanation, not the retrieval — pick the wrong passage and users won't click, however good the ranking.
Principle: a snippet is query-biased summarization: not the first two sentences, but the window that best reflects "relevance to this query". The classic approach slides a fixed-length window, scores it by query-term hits inside, term rarity (high-IDF terms weigh more), and term proximity, takes the best window, and wraps hit terms in <em>. The hard part: highlighting must land on the rendered source offsets (positions shift after tokenization, stemming, Unicode normalization) — misalignment highlights the wrong characters.
# query-biased snippet: pick the densest, rarest-term window
def best_snippet(doc_tokens, q_terms, W=160):
qset, best, lo = set(q_terms), (-1, 0), 0
for i in range(0, len(doc_tokens)-W, 20):
win = doc_tokens[i:i+W]
# hit count + IDF weighting + coverage of distinct query terms
s = sum(idf(t) for t in win if t in qset) \
+ 2*len({t for t in win if t in qset})
if s > best[0]: best, lo = (s, i), i
return highlight(doc_tokens[lo:lo+W], qset) # wrap hits in <em>
Frequent interview follow-ups:
The bottleneck is almost certainly recall, not ranking. Fine ranking only reorders the 200 retrieved — the missing 15% never enter the candidate set, and no ranker can help. High NDCG only says "what was retrieved is ordered well", masking "what wasn't retrieved is invisible".
How to locate: ① offline, compute the recall@k ceiling using human labels/click logs — feed known-relevant docs into retrieval and see if they enter candidates; ② break down by query type — likely semantic/long-tail queries recall poorly (vectors undercover) or exact queries get crowded out by vector noise; ③ A/B raise k from 200 to 500 — if GMV rises, recall is the bottleneck. Fix: widen retrieval channels, raise ANN's efSearch, add SPLADE — not keep optimizing ranking. Core insight: in a multi-stage funnel, upstream recall is the ceiling for everything downstream.
RRF score = Σ 1/(k+rank). k controls whether rank differences are amplified or flattened.
Small k (e.g. 1): 1/(1+0)=1, 1/(1+1)=0.5 — huge gaps between top ranks, extremely biased toward each channel's rank-1, almost "whoever's first wins"; sensitive to the head, the tail barely contributes. Large k (e.g. 1000): 1/1000 vs 1/1001 are nearly equal — all ranks flatten, rank info diluted, degenerating into "in how many channels did it appear" — favoring cross-channel consensus over a single channel's high rank.
k=60 is the empirical compromise: it preserves head discrimination while letting mid-ranked docs (decent in both channels) float up on consensus — exactly the "what both sides find relevant is more trustworthy" that hybrid search wants. Tuning: shrink k to emphasize single-channel strong signals, grow k to emphasize cross-channel agreement.
The vectors themselves: 768 dims × 4 bytes = 3072 B ≈ 3 KB/doc. 5e7 × 3 KB ≈ 150 GB.
The HNSW graph: each node stores M neighbor pointers per layer; with a common M=16, each neighbor ID ~4–8 bytes, and with multi-layer overhead the graph empirically adds 20%–60% on top of the vectors. At ~50% that's about 75 GB.
Total ~200–230 GB — too much/too costly for one box. That's why large scale either shards (split by shard across machines, scatter-gather queries), uses PQ/quantization to compress 3 KB to a few hundred bytes (IVF-PQ compresses 8–32× at the cost of recall), or uses disk-based ANN (DiskANN, graph on SSD). Interviewers want to see you do order-of-magnitude estimation and choose architecture from it, not recite the HNSW formula.
It comes down to whether you can precompute and amortize. A bi-encoder encodes query and doc independently into vectors, so doc vectors can be computed offline and stored in an ANN index; online you only do vector compares — the doc-side cost is amortized once. A cross-encoder concatenates query+doc and runs the whole thing through the model, letting their tokens interact pairwise in attention (precisely the source of its accuracy) — but that means the result depends on the specific query and can't be pre-stored; every (query, doc) pair needs a full online forward pass.
For 50M docs, recall would mean "query × full corpus" = 50M forward passes, blowing past any second-scale budget. So it only fits the fine-ranking stage where candidates are already dozens–hundreds. Insight: more interaction means higher accuracy, but interaction makes the result un-precomputable — accuracy and amortizability are a fundamental tension, and multi-stage architecture is exactly how we divide the labor. ColBERT's "late interaction" aims for a middle ground.
This is a structural flaw of dense retrieval: ANN always finds the k "nearest", but "nearest" ≠ "relevant". When a query lands in a sparse region of embedding space, the nearest neighbors may be far yet still returned — manifesting as "search for something the corpus doesn't have, and you still get a pile of unrelated results".
Defenses: ① a similarity threshold — drop anything below it, returning few or empty rather than noise (but the threshold drifts per query, hard to unify); ② BM25 as a backstop/check — in hybrid, if a vector-retrieved doc contains zero query terms and has a very low BM25 score, downweight or drop it; ③ cross-encoder as gatekeeper — it gives an absolute relevance score; filter low scores; ④ OOD/confidence detection — monitor the distribution of query-to-nearest-neighbor distances and treat abnormally far ones as "no good result", falling back to a message. Key realization: vector retrieval gives a relative ordering, not an absolute relevance judgment — absolute judgments need thresholds or a cross-encoder.