A RAG system's ceiling isn't the LLM — it's retrieval recall. The model can't answer from a document it never sees.
How to chunk, which embedding to use — Day 10 / 11 covered that. This issue takes a sharper angle: how to quantify, tune, and avoid fooling yourself about retrieval quality itself. The reality is that most RAG systems are bottlenecked not at generation — an LLM handed the right document almost always answers correctly, and fabricates when it isn't. So the real battleground of RAG engineering is one thing: did the right document make it into top-k? Yet almost nobody quantifies it: they ship with top_k=5 picked from a hunch, blame "the model" when it's bad, and never build a golden set that produces numbers. This issue covers four things that immediately separate you from the pack: why recall must come before precision, why hybrid search must use RRF rather than weighted sum, the real value and cost boundary of rerankers, and how to split retrieval failures from generation failures — with Anthropic's measured Contextual Retrieval numbers and runnable eval code.
Retrieval quality is a tug-of-war between recall and precision, but the two have completely asymmetric status at different stages of the pipeline. The key asymmetry: a missed-recall error is unrecoverable; an over-recall error is fixable. If the correct document never enters the stage-one candidate set, no reranker — however strong — and no LLM — however smart — can recover it: it simply never sees it. Conversely, an irrelevant document in the candidate set merely adds a little filtering load to the reranker and LLM — a far cheaper cost.
So industrial RAG is almost always two-stage: the recall stage uses a cheap, fast retriever (BM25 + vector) to over-retrieve to top-100~200 candidates, aiming to "fence in" the correct document; the rerank stage uses an expensive, precise cross-encoder to squeeze candidates down to top-5~8, aiming to "order them right." This structure directly dictates how you tune: in stage one push k larger, err on the side of over-inclusion; only in stage two chase precision.
First quantify your system's "recall ceiling" — the first number in RAG tuning, one almost nobody has ever measured:
# Given (query, correct chunk_id) labels, measure recall@k vs k
def recall_at_k(retriever, eval_set, ks=(5,10,20,50,100,200)):
hits = {k: 0 for k in ks}
for q, gold_ids in eval_set: # gold_ids: correct chunks for this query
ranked = retriever.search(q, top_k=max(ks)) # returns list of chunk_id
for k in ks:
if set(ranked[:k]) & gold_ids: # top-k hits any correct doc
hits[k] += 1
return {k: hits[k]/len(eval_set) for k in ks}
# Typical output: recall@5=0.71 @20=0.88 @100=0.96 @200=0.97
# Reading: @5 only 0.71 -> 29% of queries never get the right doc into the LLM.
# @100 reaches 0.96 -> the retriever itself is good; the bottleneck
# is ORDERING, not retrieval -> add a reranker, don't swap embeddings.
The shape of this curve tells you exactly what to do next: if recall@200 won't climb (say <0.85), the problem is the retriever / chunking — swap embeddings or go hybrid; if recall@100 is high but recall@5 is low, the problem is ordering — add a reranker. Tuning before drawing this curve is driving with your eyes closed.
Vector (dense) retrieval excels at semantic matching but is blind to exact literals: product codes like X-42B, error codes like ERR_TIMEOUT, names, rare proper nouns — embeddings blur these into nearby vectors. BM25 (lexical/sparse) is the opposite: brutal at exact-term hits, but understands no synonymy or semantics. The two recall complementary failure sets, so hybrid search almost always strictly beats either path alone. Anthropic's Contextual Retrieval (2024) gives hard numbers: Contextual Embeddings alone cut the top-20 retrieval failure rate by 35% (5.7%→3.7%); adding Contextual BM25 cut it by 49% (→2.9%); adding reranking on top cut it by 67% (→1.9%) — each of the three layers genuinely patching blind spots.
But fusion can't crudely add the two retrievers' similarity scores: BM25 scores are unbounded (could be 8.3 or 41.2), cosine similarity is in [-1,1] — the units are simply incomparable, and a weighted sum lets BM25's scale crush the vector score. The right approach is RRF (Reciprocal Rank Fusion, Cormack 2009): it uses only rank, never score, and is naturally immune to the units problem. Each doc's fused score = Σ 1/(k+rank), with k typically 60.
# RRF: use each retriever's "place", not its raw score -> no normalization
def rrf(rank_lists, k=60, top_n=100):
# rank_lists: [[doc_id ordered by rank], ...] from BM25 / dense / ...
scores = {}
for ranking in rank_lists:
for rank, doc_id in enumerate(ranking): # rank starts at 0
scores[doc_id] = scores.get(doc_id, 0) + 1/(k + rank + 1)
return sorted(scores, key=scores.get, reverse=True)[:top_n]
bm25_hits = bm25.search(q, top_k=100) # literal-exact
dense_hits = vector.search(q, top_k=100) # semantically near
fused = rrf([bm25_hits, dense_hits]) # complementary ranks stack up
# A doc ranked high by both paths -> scores stack to the very top;
# hit by only one path -> still gets a shot at the candidate set.
RRF's engineering advantage: zero tuning, zero training, and adding a new retriever is just passing one more list. k=60 is the paper's robust default and rarely needs touching. Only consider learned fusion to go further, but for 90% of cases RRF is the best-value endpoint.
α·dense + (1-α)·bm25 and spending a week tuning α — the units are incomparable, the optimal α drifts across queries, and you're essentially fitting noise; switching to RRF eliminates the problem outright. (2) believing "once we have vector search we don't need BM25" — for exact-match-heavy domains (code, legal clause numbers, SKUs), pure-vector recall systematically misses key literal hits, and those are exactly the queries users least tolerate getting wrong.
Stage-one vector retrieval uses a bi-encoder: query and doc are encoded independently into vectors, then scored by similarity. Independent encoding means you can precompute all corpus vectors offline and only encode the query once at query time — fast, but there is no token-level interaction between query and doc, so the semantic match is coarse-grained. A reranker is a cross-encoder: it concatenates (query, doc) into one sequence, runs full cross-attention, and every query token can "see" every doc token — an order of magnitude more precise.
The cost is that a cross-encoder cannot be precomputed: each (query, doc) pair requires a fresh full forward pass at runtime. 100 candidates means 100 inferences, with latency and cost scaling linearly in candidate count. So its natural role is stage-two reranking: process only the dozens-to-~hundred over-retrieved candidates, order them right, and cut to top-5. Never use a reranker as a retriever scanning the whole corpus — a million docs × a forward pass each would blow up both latency and the bill.
# Two stages: bi-encoder recall (fast) -> cross-encoder rerank (precise, candidates only)
import cohere
co = cohere.Client()
candidates = rrf([bm25_hits, dense_hits])[:100] # Stage 1: over-recall 100
docs = [chunk_text[c] for c in candidates]
reranked = co.rerank( # Stage 2: cross-encoder rerank
model="rerank-v3.5", query=q, documents=docs, top_n=5)
final = [candidates[r.index] for r in reranked.results]
# 100 -> 5: only these 5 reach the LLM. The reranker "translates"
# recall@100 into precision@5 -- the best-value jump on the §1 curve.
When is a reranker worth it? Look at the §1 recall curve: the gap where recall@100 is high but recall@5 is low is exactly the reranker's payoff zone — the correct doc is already in the candidates, just not ranked up front. In Anthropic's measurements, the reranking layer pushed the failure rate from 2.9% further down to 1.9%, confirming it as an effective layer stacked on top of hybrid search. For open source pick self-hosted bge-reranker-v2; for managed pick Cohere Rerank / Pinecone Rerank.
RAG chains two stages — retrieval + generation — and a failure in either shows up as a "wrong answer," but the fixes point in opposite directions. A retrieval failure (correct doc never in top-k) calls for fixing chunking/embedding/hybrid/reranker; a generation failure (the doc is right there in context, yet the LLM misused it or answered wrong) calls for fixing prompt/model. Tune them mixed together and you'll waste all your time on the wrong half. So the first principle of RAG eval is layered attribution:
The key engineering move is to first build a golden set: 50~200 labels of (query, correct chunk_id, ideal answer). This is the highest-ROI one-time investment in RAG engineering — with it, every chunk-strategy change, embedding swap, or reranker tweak produces a recall number to compare, instead of "it feels a bit more accurate" before shipping. Pinecone puts it well: RAG evaluation — don't let customers tell you first.
# Layered attribution: judge retrieval first, then generation -> same error, different bucket
def diagnose(sample, retriever, llm):
q, gold_ids, gold_answer = sample
retrieved = retriever.search(q, top_k=5)
retrieval_ok = bool(set(retrieved) & gold_ids) # right doc in top-5?
answer = llm.generate(q, context=[chunk_text[c] for c in retrieved])
answer_ok = judge_faithful(answer, gold_answer) # LLM-judge on faithfulness
if not retrieval_ok:
return "RETRIEVAL_FAIL" # -> fix chunk/embedding/hybrid/rerank
if not answer_ok:
return "GENERATION_FAIL" # -> fix prompt/model (doc was in context, still wrong)
return "OK"
# Run the golden set, tally the two shares:
# RETRIEVAL_FAIL 60% / GENERATION_FAIL 40% -> battleground is retrieval, stop tuning prompts
Run this attribution table once and you know exactly where your engineering hours should go. Most teams assume "the prompt is off," then find a large half is RETRIEVAL_FAIL — the doc never reached context, and no amount of prompt tuning will ever help.
Chain the four points into a runnable diagnostic pipeline — spend two hours over a weekend and never "tune RAG by feel" again:
recall_at_k, inspect @5/@20/@100/@200. Decide whether the bottleneck is "retrieval" (@200 low) or "ordering" (@100 high, @5 low).diagnose, tally RETRIEVAL_FAIL vs GENERATION_FAIL, lock in the next battleground.Once this is done, your understanding of your own RAG shifts from "feels okay" to "recall@5=0.84, bottleneck is exact-match recall, next step is BM25" — that's the line between engineering and superstition.