DAY 43 / PHASE 4 · ENGINEERING

Retrieval Quality Engineering

Recall vs Precision · Hybrid Fusion · Reranker · Retrieval Eval

2026-06-22 · BigCat

A RAG system's ceiling isn't the LLM — it's retrieval recall. The model can't answer from a document it never sees.

// WHY THIS MATTERS

How to chunk, which embedding to use — Day 10 / 11 covered that. This issue takes a sharper angle: how to quantify, tune, and avoid fooling yourself about retrieval quality itself. The reality is that most RAG systems are bottlenecked not at generation — an LLM handed the right document almost always answers correctly, and fabricates when it isn't. So the real battleground of RAG engineering is one thing: did the right document make it into top-k? Yet almost nobody quantifies it: they ship with top_k=5 picked from a hunch, blame "the model" when it's bad, and never build a golden set that produces numbers. This issue covers four things that immediately separate you from the pack: why recall must come before precision, why hybrid search must use RRF rather than weighted sum, the real value and cost boundary of rerankers, and how to split retrieval failures from generation failures — with Anthropic's measured Contextual Retrieval numbers and runnable eval code.

// 01

Recall Before Precision: The Core Trade-off of Two-Stage Retrieval

Claim: the only KPI of stage one is recall@k — over-retrieve noise rather than ever drop the correct document. Leave precision to the reranker and the LLM.

Background & Principle

Retrieval quality is a tug-of-war between recall and precision, but the two have completely asymmetric status at different stages of the pipeline. The key asymmetry: a missed-recall error is unrecoverable; an over-recall error is fixable. If the correct document never enters the stage-one candidate set, no reranker — however strong — and no LLM — however smart — can recover it: it simply never sees it. Conversely, an irrelevant document in the candidate set merely adds a little filtering load to the reranker and LLM — a far cheaper cost.

So industrial RAG is almost always two-stage: the recall stage uses a cheap, fast retriever (BM25 + vector) to over-retrieve to top-100~200 candidates, aiming to "fence in" the correct document; the rerank stage uses an expensive, precise cross-encoder to squeeze candidates down to top-5~8, aiming to "order them right." This structure directly dictates how you tune: in stage one push k larger, err on the side of over-inclusion; only in stage two chase precision.

┌──── Two-stage retrieval: wide recall in, precise rank out ────┐ │ │ │ query │ │ │ │ │ ├─▶ STAGE 1 · RECALL (need high recall@k · fast · cheap) │ │ │ BM25(lexical) ─┐ │ │ │ ├─▶ RRF fuse ─▶ top-100~200 │ │ │ dense(vector) ─┘ candidates(over-recall) │ │ │ │ │ │ └─▶ STAGE 2 · RERANK (need precision · slow · costly) │ │ cross-encoder re-rank ◀─────────┘ │ │ │ │ │ ▼ top-5~8 │ │ ┌──────────────┐ │ │ │ LLM context │ ← only what reaches here can be cited│ │ └──────────────┘ │ │ │ │ missed recall = unrecoverable · over-recall = small fixable│ └───────────────────────────────────────────────────────────────┘

Hands-on

First quantify your system's "recall ceiling" — the first number in RAG tuning, one almost nobody has ever measured:

# Given (query, correct chunk_id) labels, measure recall@k vs k
def recall_at_k(retriever, eval_set, ks=(5,10,20,50,100,200)):
    hits = {k: 0 for k in ks}
    for q, gold_ids in eval_set:        # gold_ids: correct chunks for this query
        ranked = retriever.search(q, top_k=max(ks))   # returns list of chunk_id
        for k in ks:
            if set(ranked[:k]) & gold_ids:   # top-k hits any correct doc
                hits[k] += 1
    return {k: hits[k]/len(eval_set) for k in ks}

# Typical output: recall@5=0.71  @20=0.88  @100=0.96  @200=0.97
# Reading: @5 only 0.71 -> 29% of queries never get the right doc into the LLM.
#          @100 reaches 0.96 -> the retriever itself is good; the bottleneck
#          is ORDERING, not retrieval -> add a reranker, don't swap embeddings.

The shape of this curve tells you exactly what to do next: if recall@200 won't climb (say <0.85), the problem is the retriever / chunking — swap embeddings or go hybrid; if recall@100 is high but recall@5 is low, the problem is ordering — add a reranker. Tuning before drawing this curve is driving with your eyes closed.

Failure mode: setting top_k too small (e.g. straight to 5) and then complaining RAG is inaccurate. A small k blocks the correct document at the door in stage one — and you'll never know, because without a labeled set you can't see recall. The stage-one k should be set by how many candidates the reranker can chew (typically 50~200), not by the final count fed to the LLM.

Going deeper · Pinecone Rerankers and Two-Stage Retrieval, pinecone.io/learn/series/rag/rerankers · BEIR benchmark (IR metrics), arXiv:2104.08663

// 02

Hybrid Search: Fuse BM25 + Vector with RRF, Not a Weighted Sum

Claim: lexical and semantic retrieval make different kinds of mistakes — fusion patches the blind spots — but the fusion method must be rank-based RRF, never a direct sum of scores.

Background & Principle

Vector (dense) retrieval excels at semantic matching but is blind to exact literals: product codes like X-42B, error codes like ERR_TIMEOUT, names, rare proper nouns — embeddings blur these into nearby vectors. BM25 (lexical/sparse) is the opposite: brutal at exact-term hits, but understands no synonymy or semantics. The two recall complementary failure sets, so hybrid search almost always strictly beats either path alone. Anthropic's Contextual Retrieval (2024) gives hard numbers: Contextual Embeddings alone cut the top-20 retrieval failure rate by 35% (5.7%→3.7%); adding Contextual BM25 cut it by 49% (→2.9%); adding reranking on top cut it by 67% (→1.9%) — each of the three layers genuinely patching blind spots.

But fusion can't crudely add the two retrievers' similarity scores: BM25 scores are unbounded (could be 8.3 or 41.2), cosine similarity is in [-1,1] — the units are simply incomparable, and a weighted sum lets BM25's scale crush the vector score. The right approach is RRF (Reciprocal Rank Fusion, Cormack 2009): it uses only rank, never score, and is naturally immune to the units problem. Each doc's fused score = Σ 1/(k+rank), with k typically 60.

Hands-on

# RRF: use each retriever's "place", not its raw score -> no normalization
def rrf(rank_lists, k=60, top_n=100):
    # rank_lists: [[doc_id ordered by rank], ...] from BM25 / dense / ...
    scores = {}
    for ranking in rank_lists:
        for rank, doc_id in enumerate(ranking):   # rank starts at 0
            scores[doc_id] = scores.get(doc_id, 0) + 1/(k + rank + 1)
    return sorted(scores, key=scores.get, reverse=True)[:top_n]

bm25_hits  = bm25.search(q,  top_k=100)     # literal-exact
dense_hits = vector.search(q, top_k=100)    # semantically near
fused = rrf([bm25_hits, dense_hits])        # complementary ranks stack up
# A doc ranked high by both paths -> scores stack to the very top;
# hit by only one path -> still gets a shot at the candidate set.

RRF's engineering advantage: zero tuning, zero training, and adding a new retriever is just passing one more list. k=60 is the paper's robust default and rarely needs touching. Only consider learned fusion to go further, but for 90% of cases RRF is the best-value endpoint.

Failure mode: (1) using a weighted sum α·dense + (1-α)·bm25 and spending a week tuning α — the units are incomparable, the optimal α drifts across queries, and you're essentially fitting noise; switching to RRF eliminates the problem outright. (2) believing "once we have vector search we don't need BM25" — for exact-match-heavy domains (code, legal clause numbers, SKUs), pure-vector recall systematically misses key literal hits, and those are exactly the queries users least tolerate getting wrong.

Going deeper · Anthropic Introducing Contextual Retrieval (hybrid measured), anthropic.com/engineering/contextual-retrieval · Cormack et al. Reciprocal Rank Fusion (SIGIR 2009), dl.acm.org/.../1572114

// 03

Reranker: The Real Value and Cost Boundary of Cross-Encoders

Claim: a reranker is the highest-value single-point quality gain — but its cost is latency, so it should process only dozens of candidates and never be used as a retriever.

Background & Principle

Stage-one vector retrieval uses a bi-encoder: query and doc are encoded independently into vectors, then scored by similarity. Independent encoding means you can precompute all corpus vectors offline and only encode the query once at query time — fast, but there is no token-level interaction between query and doc, so the semantic match is coarse-grained. A reranker is a cross-encoder: it concatenates (query, doc) into one sequence, runs full cross-attention, and every query token can "see" every doc token — an order of magnitude more precise.

The cost is that a cross-encoder cannot be precomputed: each (query, doc) pair requires a fresh full forward pass at runtime. 100 candidates means 100 inferences, with latency and cost scaling linearly in candidate count. So its natural role is stage-two reranking: process only the dozens-to-~hundred over-retrieved candidates, order them right, and cut to top-5. Never use a reranker as a retriever scanning the whole corpus — a million docs × a forward pass each would blow up both latency and the bill.

Hands-on

# Two stages: bi-encoder recall (fast) -> cross-encoder rerank (precise, candidates only)
import cohere
co = cohere.Client()

candidates = rrf([bm25_hits, dense_hits])[:100]   # Stage 1: over-recall 100
docs = [chunk_text[c] for c in candidates]

reranked = co.rerank(                              # Stage 2: cross-encoder rerank
    model="rerank-v3.5", query=q, documents=docs, top_n=5)
final = [candidates[r.index] for r in reranked.results]
# 100 -> 5: only these 5 reach the LLM. The reranker "translates"
# recall@100 into precision@5 -- the best-value jump on the §1 curve.

When is a reranker worth it? Look at the §1 recall curve: the gap where recall@100 is high but recall@5 is low is exactly the reranker's payoff zone — the correct doc is already in the candidates, just not ranked up front. In Anthropic's measurements, the reranking layer pushed the failure rate from 2.9% further down to 1.9%, confirming it as an effective layer stacked on top of hybrid search. For open source pick self-hosted bge-reranker-v2; for managed pick Cohere Rerank / Pinecone Rerank.

Failure mode: (1) giving too many candidates (e.g. reranking 500) — latency climbs linearly while the marginal recall@500 vs @100 gain is tiny, a pure latency loss; set candidate count from the recall curve. (2) blindly adding a reranker without checking the recall curve first: if recall@100 is itself low (the retriever missed recall), the reranker is ordering a pile of wrong docs — no matter how well it ranks them, it's useless; fix stage one first. A reranker lifts precision; it cannot rescue recall.

Going deeper · Pinecone Rerankers and Two-Stage Retrieval (bi- vs cross-encoder), pinecone.io/learn/series/rag/rerankers · Anthropic Contextual Retrieval (reranking layer measured), anthropic.com/engineering/contextual-retrieval

// 04

Retrieval Eval: Build a Golden Set, Split Retrieval Failures from Generation Failures

Claim: when RAG answers wrong, 99% of people go tune the prompt — but you have no idea whether retrieval failed to surface the right doc, or the LLM got the right doc and still answered wrong. Without splitting, you're tuning blind.

Background & Principle

RAG chains two stages — retrieval + generation — and a failure in either shows up as a "wrong answer," but the fixes point in opposite directions. A retrieval failure (correct doc never in top-k) calls for fixing chunking/embedding/hybrid/reranker; a generation failure (the doc is right there in context, yet the LLM misused it or answered wrong) calls for fixing prompt/model. Tune them mixed together and you'll waste all your time on the wrong half. So the first principle of RAG eval is layered attribution:

Retrieval-layer metrics (component eval): recall@k (did the right doc enter candidates), nDCG@k / MRR (ordering quality). Requires (query → correct chunk_id) labels.
Generation-layer metrics (end-to-end): on the "retrieval-correct" subset, measure faithfulness (does the answer stay true to the given docs, no fabrication) and answer relevance.

The key engineering move is to first build a golden set: 50~200 labels of (query, correct chunk_id, ideal answer). This is the highest-ROI one-time investment in RAG engineering — with it, every chunk-strategy change, embedding swap, or reranker tweak produces a recall number to compare, instead of "it feels a bit more accurate" before shipping. Pinecone puts it well: RAG evaluation — don't let customers tell you first.

Hands-on

# Layered attribution: judge retrieval first, then generation -> same error, different bucket
def diagnose(sample, retriever, llm):
    q, gold_ids, gold_answer = sample
    retrieved = retriever.search(q, top_k=5)
    retrieval_ok = bool(set(retrieved) & gold_ids)   # right doc in top-5?

    answer = llm.generate(q, context=[chunk_text[c] for c in retrieved])
    answer_ok = judge_faithful(answer, gold_answer)  # LLM-judge on faithfulness

    if not retrieval_ok:
        return "RETRIEVAL_FAIL"    # -> fix chunk/embedding/hybrid/rerank
    if not answer_ok:
        return "GENERATION_FAIL"   # -> fix prompt/model (doc was in context, still wrong)
    return "OK"

# Run the golden set, tally the two shares:
# RETRIEVAL_FAIL 60% / GENERATION_FAIL 40% -> battleground is retrieval, stop tuning prompts

Run this attribution table once and you know exactly where your engineering hours should go. Most teams assume "the prompt is off," then find a large half is RETRIEVAL_FAIL — the doc never reached context, and no amount of prompt tuning will ever help.

Failure mode: (1) only looking at end-to-end answer correctness without layering — when it's wrong you don't know what to fix; layered attribution is the core of this whole setup. (2) a golden set of only easy queries while the real distribution is a long tail of hard queries — offline recall looks great, production collapses; labels must cover the real query distribution (especially exact-match, multi-hop, and negation hard cases). (3) using the same LLM to both generate and judge faithfulness — self-eval has a backward-rationalization bias; the judge is best as a different model or anchored with a reference.

Going deeper · Pinecone RAG Evaluation: Don't let customers tell you first, pinecone.io/.../rag-evaluation · BEIR (nDCG/MRR/recall standard metrics), arXiv:2104.08663

// Capstone · Quantify Your RAG Retrieval Quality in One Run

Chain the four points into a runnable diagnostic pipeline — spend two hours over a weekend and never "tune RAG by feel" again:

Label a golden set: pull 100 real queries from production logs, manually label (query, correct chunk_id). Cover the long tail: a few each of exact-match (codes/error codes), multi-hop, negation. This is the only one-time human cost.
Draw the recall curve (§1): run recall_at_k, inspect @5/@20/@100/@200. Decide whether the bottleneck is "retrieval" (@200 low) or "ordering" (@100 high, @5 low).
Add hybrid search (§2): stack BM25 on pure dense, fuse with RRF, redraw the curve. See how much recall@100 rises — exact-match queries should clearly improve.
Add a reranker (§3): cross-encoder rerank the top-100 candidates, see how much recall@5 / nDCG@5 jumps. This jump is usually the best value.
Layered attribution (§4): run diagnose, tally RETRIEVAL_FAIL vs GENERATION_FAIL, lock in the next battleground.
Freeze it into CI: hang this on the eval gate from Day 42 — from now on every chunk/embedding/prompt change flips red if recall@5 drops, ending "fix one, break three."

Once this is done, your understanding of your own RAG shifts from "feels okay" to "recall@5=0.84, bottleneck is exact-match recall, next step is BM25" — that's the line between engineering and superstition.

// KEY TERMS

Recall@k: Share of queries whose correct document lands in the top-k candidates. The core KPI of stage one.
Precision: Share of returned results that are relevant. The metric owned by stage two (reranker / LLM).
Two-Stage Retrieval: Standard architecture: over-recall with a cheap retriever, then rerank precisely with an expensive one.
Bi-Encoder: Query and doc encoded independently into vectors; precomputable and fast, but no token-level interaction. Used for vector search.
Cross-Encoder: Concatenates (query, doc) for full cross-attention; high precision but cannot be precomputed. Used for reranking.
BM25: Classic lexical/sparse retrieval; strong exact-term hits, no semantics. The lexical half of hybrid search.
RRF: Reciprocal Rank Fusion. Fuses multiple retrievers using only rank, not score — normalization-free, zero tuning.
nDCG / MRR: Position-weighted ordering-quality metrics. Evaluate "ordered right," not just "recalled."
Contextual Retrieval: Anthropic 2024 method: prepend whole-document context to each chunk before embedding, cutting retrieval failure rate.
Faithfulness: Whether the generated answer stays true to the retrieved docs (no hallucination). The core generation-layer metric.

// DEEPER THINKING

Is "recall before precision" universal, or are there counterexamples? When should you sacrifice recall for precision?

Not universal. It holds on the premise that "a downstream reranker + LLM can digest noise." Counterexamples: (1) no reranker and context is expensive — stuffing noise straight into a small-window LLM triggers lost-in-the-middle, diluting the correct doc with irrelevant ones; here high precision matters more than high recall. (2) directly showing retrieval results to users (e.g. a search box) — users won't scroll 100 results; top-3 precision is the experience itself. The essence is "error recoverability": as long as downstream can repair over-recall, prioritize recall; once over-recall's cost is unrecoverable (polluting context / direct display), it flips.

RRF discards raw similarity scores entirely, using only rank. Doesn't that "information loss" throw away useful signal?

It does, but what it throws away is incomparable signal. Raw scores are incomparable across retrievers (unbounded BM25 vs bounded cosine), and forcing their use introduces scale bias. RRF's philosophy: rather than a polluted strong signal, use a clean weak signal (rank). The cost is insensitivity to within-group score gaps like "rank 1 is far stronger than rank 2" — if both retrievers rank a doc first, RRF can't tell which is more confident. Only when you need that granularity do you reach for learned fusion (e.g. an LTR model with scores as features), but that needs labels, training, and risks overfitting. RRF is "good enough at zero cost," the rational endpoint for 90% of cases.

Anthropic Contextual Retrieval prepends whole-document context to each chunk before embedding. How is that fundamentally different from just chunking larger?

Fundamentally different. Larger chunks pack more raw adjacent text into one vector, diluting the topic, adding noise, and still losing global context (the chunk still only sees its neighbors). Contextual Retrieval uses an LLM to generate a passage specifically explaining "what this chunk is about in the full document, which entity/section it belongs to" and concatenates that before embedding — it patches cross-chunk global reference. Classic example: a chunk says "the company's Q3 revenue grew 3%"; embedded alone, "the company" has no referent and is semantically vague; prepend "this is from ACME's 2023 annual report" and the query "ACME third-quarter results" can finally recall it. The cost is running an LLM per chunk at preprocessing (amortized via prompt caching).

Why must retrieval and generation failures use a "different model" as judge? What concrete bias does the same LLM answering and judging introduce?

The core is backward rationalization: a model evaluating an answer it just generated tends to find reasons to favor its own output and give high marks, because the generation and evaluation paths share the same priors and blind spots — the "facts" it fabricated still "look right" on self-eval. Concrete biases: (1) self-preference, systematically overrating its own answers; (2) shared hallucination, content fabricated at generation is undetectable at evaluation. Mitigations: use a judge from a different model family (breaking shared priors), give the judge a reference answer to anchor on (turning open scoring into comparative scoring), or use pairwise comparison instead of absolute scores. Day 6's LLM-as-judge debiasing covered this; faithfulness judgment in retrieval eval applies the same playbook.

If context windows grow to 10M tokens and become cheap enough to stuff the whole corpus, does retrieval quality engineering still matter?

Still matters, but the emphasis shifts. Even if you can stuff the whole corpus, three problems don't vanish: (1) lost-in-the-middle — info utilization in long-context middles systematically drops, so the more you stuff, the greater the risk of "drowning" the key doc, making ordering/filtering more important; (2) cost and latency — processing 10M tokens per query has an unacceptable bill and TTFT, so retrieval is a necessary pre-filter; (3) attribution and control — retrieval gives explicit citation sources, while stuffing everything makes tracing impossible and breaks access control. So retrieval won't disappear, but it shifts from "we must select because it won't fit" to "we actively select for precision, cost, and traceability." The recall/precision trade-off still holds; the value space of k just grows larger.

// FURTHER READING

Anthropic · Introducing Contextual Retrieval — hybrid + context + rerank, three layers of measured failure-rate cuts (35%/49%/67%)
Pinecone · Rerankers and Two-Stage Retrieval — bi-encoder vs cross-encoder and the two-stage architecture
Pinecone · RAG Evaluation — the engineering method of layered retrieval/generation eval
Cormack et al. · Reciprocal Rank Fusion (SIGIR 2009) — the original RRF paper, the foundational fusion method
Thakur et al. · BEIR (NeurIPS 2021) — heterogeneous zero-shot retrieval benchmark and standard IR metrics