AI/ML Deep Dive: RAG Systems

Day 4 · 2026-05-21
For: engineers with coding experience but no AI background

EmbeddingEmbedding

RAGBasics
One-line intuition

Think of it as hashing a sentence into a 1,536-dimensional space — except this hash doesn't aim to minimize collisions; it aims to put semantically close sentences at close coordinates. The cosine distance between two vectors is their semantic similarity.

What problem it solves

Traditional full-text search (grep, ElasticSearch) only matches the literal characters — search "car" and you miss "automobile"; search "how to save money" and you miss "personal finance tips." Once you encode meaning as a numeric vector, "find documents with similar meaning" becomes "find points with similar coordinates" — a database problem. This is the foundation of every RAG (Retrieval-Augmented Generation) system.

How it works (intuition)

Training uses massive "sentence pair + similar/not-similar label" data to learn a mapping f(text) → vector where "semantically similar → vectors near each other." Common dimensions are 384 / 768 / 1536 / 3072. At runtime, one API call turns any string into a vector.

# Intuition: similarity ≈ cosine of the angle between vectors
"The cat loves fish"          → [0.12, -0.34, 0.88, ...]   ┐
"My cat loves canned tuna"    → [0.15, -0.30, 0.91, ...]   ├ small angle ⇒ similarity ≈ 0.92
"Rate hikes hit the housing"  → [0.71, 0.05, -0.42, ...]   ┘ large angle vs above ⇒ similarity ≈ 0.08

Analogy: the classic Word2Vec example — king - man + woman ≈ queen. Semantic relations get encoded as directions in vector space.

Code example
from openai import OpenAI
import numpy as np

client = OpenAI()

def embed(text):
    # One-liner: string → 1536-dim vector
    r = client.embeddings.create(model="text-embedding-3-small", input=text)
    return np.array(r.data[0].embedding)

def cos_sim(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

a = embed("How to limit my child's screen time")
b = embed("Strategies for managing kids' device usage")
c = embed("How Fed rate hikes affect the stock market")

print(cos_sim(a, b))  # ~0.85 high
print(cos_sim(a, c))  # ~0.12 low
Common misconception
Embedding ≠ understanding. What it encodes is "statistical patterns of co-occurring meanings," so "praising X" and "ridiculing X" land close (both discuss X). When your RAG returns "right topic, wrong stance," that's the inherent limit of embeddings — fix it with re-ranking or a follow-up classifier, not by swapping embedding models.
Key resources
Where you see it
Canonical: an internal knowledge base searched by meaning — an employee searches "how do I request PTO" and finds a document titled "Leave Management Policy."
Closer to home: embed three years of your WeChat saves and Notes into a vector store, and when you later want "that article about compounding I read once," a vague description recalls it.
English Summary
An embedding maps text into a high-dimensional vector such that semantically similar texts land near each other. It turns "find similar meaning" into a geometry problem — typically cosine similarity between vectors — and is the foundation layer of every modern semantic search and RAG system.
Think it through
1. Why isn't higher embedding dimension always better? In production, how do you choose 1536 vs 3072?
High dimensions carry three costs: (a) storage doubles linearly — terabytes at scale; (b) retrieval latency rises as ANN indexes (HNSW, IVF) degrade in higher dimensions ("curse of dimensionality"); (c) diminishing returns — 3072 over 1536 usually gains less than 2-3 points on most tasks. Decision rule: prototype with small dimensions (384/768), upgrade only when recall is genuinely insufficient. Matryoshka Embedding is the new trend — trained so the first N dims work independently, letting you truncate dimensions on demand to trade quality for cost.
2. Is the embedding here the same as Day 1's "token embedding" inside a Transformer?
Same family, different use. Inside a Transformer, every token has an embedding from a lookup table — that's "word-level," and it changes after every attention layer (contextual embedding). In RAG, "embedding" usually means "sentence/paragraph-level" — compressing a whole passage into a fixed vector, typically from the last layer's [CLS] token or mean-pooled across all tokens. Sentence Transformers and the OpenAI embedding API are Transformer architectures fine-tuned specifically for this — "make similar sentences land close."
3. Can you embed code as text and do "semantic code search"?
Yes, but not great. General-purpose embeddings are trained mostly on natural language and have weak recall for code semantics ("is this doing authentication?" "is this a deserialization vuln pattern?"). Production setups use three layers: (a) dedicated code embedding models like voyage-code or jina-code-embeddings; (b) generate a "function description" with an LLM first, then embed the description instead of the raw code; (c) hybrid retrieval — BM25/AST for literal code, embedding for descriptions, then fuse. GitHub Copilot's internal search uses a hybrid strategy.
4. If two passages are 99% identical (two versions of a legal clause), embeddings are nearly identical, but the user wants to distinguish them. What now?
That's an embedding blind spot — it's insensitive to fine differences. Three approaches: (a) inject a diff signal at retrieval — compute the character diff between versions and embed the "delta snippet" separately; (b) use finer chunking so the changed sentence is its own chunk; (c) store metadata (version number, effective date) and pre-filter on it before semantic match. Lesson: embeddings are for "find similar," not "find differences" — compliance and version-comparison tasks need a different approach.
5. A company wants to use embeddings to find "similar companies to a prospect." Same as search technically — but what business pitfalls should they watch for?
Technically the same: stitch each company's description, industry, and size into text, embed, run nearest neighbors. Business pitfalls: (a) description bias — Company A says "AI-powered data platform," Company B says "database vendor"; embeddings see them as different even if they're direct competitors. Augment with SEC filings, website scrapes. (b) similarity ≠ conversion likelihood — companies that look most similar may already be covered by the same vendor; companies that look less similar but carry specific buying signals often convert better. (c) privacy and compliance — B2B data must be legally sourced (GDPR, CCPA). Embeddings are a tool; business rules still need post-filtering for the final decision.

Vector DatabaseVector Database

RAGInfra
One-line intuition

It's MySQL where the WHERE clause "= or LIKE" gets replaced by "ORDER BY cosine distance LIMIT 10" — a database whose indexes are optimized for "find the nearest few among millions of vectors."

What problem it solves

Once you have embeddings, "find the K most similar documents" becomes "compute cosine similarity against millions of 1,536-dim vectors and take top-K." Brute force is O(N) — seconds at scale. Unacceptable online. Vector databases (Pinecone, Weaviate, Milvus, pgvector) use Approximate Nearest Neighbor (ANN) indexes to push queries down to milliseconds.

How it works (intuition)

The most common index is HNSW (Hierarchical Navigable Small World): organize vectors as a multi-layer graph — sparse top layer with hub nodes, dense bottom layer with all points. Queries hop quickly through the top layer to the rough region, then descend to refine. Like a skip list of pointers, in vector space.

# Traditional SQL                      # Vector DB
SELECT * FROM docs                      collection.query(
WHERE category = 'finance'      ←→        query_vector=embed(q),
ORDER BY title;                           filter={"category":"finance"},
                                          top_k=10)
# Key difference:
# Traditional DB index = B+ Tree, exact-field ordering
# Vector DB index = HNSW/IVF, approximate ordering by high-dimensional distance

Notice "approximate" — for speed, vector DBs do not guarantee 100% recall of the true top-K. 95%+ recall is usually enough for everyday use.

Code example
import chromadb
from openai import OpenAI

openai = OpenAI()
client = chromadb.Client()
col = client.create_collection("my_notes")

# 1) Write: text + auto-embedding + metadata
docs = ["Compounding is the eighth wonder of the world", "Screen time affects children's attention", ...]
embeds = [openai.embeddings.create(model="text-embedding-3-small",
                                   input=d).data[0].embedding for d in docs]
col.add(documents=docs, embeddings=embeds,
        metadatas=[{"topic": "finance"}, {"topic": "parenting"}],
        ids=["n1", "n2"])

# 2) Query: natural language → top-K relevant items
q_emb = openai.embeddings.create(model="text-embedding-3-small",
                                 input="how to make money work for me").data[0].embedding
hits = col.query(query_embeddings=[q_emb], n_results=3,
                 where={"topic": "finance"})  # metadata filter
print(hits["documents"])  # → ["Compounding is the eighth wonder of the world", ...]
Common misconception
"You need Pinecone/Milvus from day one" — wrong. Under 100K documents, SQLite + pgvector, or even an in-memory numpy array + FAISS, is plenty and zero-ops. Dedicated vector databases shine at billion-scale data, low latency, high QPS, and multi-tenant isolation. Pulling them in early bloats architectural complexity. For MVPs, start with embedded options (Chroma, LanceDB, sqlite-vec).
Key resources
Where you see it
Canonical: customer-support bot knowledge bases — index all FAQs and ticket history; user question → similar items → fed to LLM for an answer.
Closer to home: embed all your research reports and earnings summaries; before an investment decision, search "past views on semiconductor cycles" in natural language to avoid contradicting your own forgotten opinions.
English Summary
A vector database stores high-dimensional embeddings and supports approximate nearest-neighbor (ANN) queries in sub-millisecond time. It typically uses HNSW or IVF indexes, supports metadata filtering, and serves as the storage layer for RAG, semantic search and recommendation systems.
Think it through
1. Does "approximate" recall introduce safety risk? Are vector DBs usable for legal or medical scenarios?
Yes, but manageable. HNSW typically delivers 95%+ recall — 5% of queries may miss the truly nearest item. For "the latest 2024 regulation on X," recall drift can drop a critical clause. Mitigations: (a) raise the index's ef_search parameter, trading speed for recall; (b) use exact (brute-force) search for critical paths — feasible at small scale; (c) hybrid retrieval — keyword/metadata precise filter before vector search. Medical and compliance contexts demand "explainable + auditable," so a rule engine usually layers on top for final gating.
2. PostgreSQL's pgvector vs dedicated Pinecone — what really differs?
Three dimensions: (a) performance ceiling — pgvector handles up to ~10M well; beyond that needs sharding, while dedicated stores ship kernel-level distributed ANN optimizations; (b) consistency model — pgvector inherits Postgres transactions (vectors and business data strongly consistent in one DB); dedicated stores are usually eventually consistent; (c) ops cost — pgvector reuses your Postgres team and monitoring, while dedicated stores add a new dependency. Rule: 90% of projects do fine with pgvector; only the 10% that are truly "vector-first" (e.g. billion-scale image search) need dedicated infrastructure.
3. If a document is "updated from the user's perspective" (a wiki paragraph changes), how do you keep the vector store in sync?
This is where RAG productionization most often breaks. Common patterns: (a) push — when the source system updates a doc, fire an event; a consumer re-embeds and upserts to the vector store; (b) pull — periodically (e.g. hourly) scan the source's updated_at, compute diff, batch update; (c) versioned chunks — each chunk carries source_id + version, update by delete-by-source_id then insert new. All three must handle "embedding API throttling / retries / partial failure." Key metric: embedding lag (time from source update to vector store visibility).
4. Are vector DBs a good fit for "user behavior logs" for behavioral similarity matching?
Depends. Short-term real-time behavior ("the 5 items just browsed") is a poor fit — too short, weak semantic signal, and embedding-API cost gets amplified by high event throughput; collaborative filtering + inverted indexes are better. Long-term profiles (a year of purchases/searches) embed well — summarize the sequence then embed for a "user interest vector," useful for cross-category recommendations. Watch out: embeddings smooth user profiles, drowning niche interests and losing long-tail preferences — so embeddings are usually one recall channel in recommendation systems, with a ranking model making the final call.
5. Is treating a vector DB as LLM "long-term memory" an accurate metaphor? Where does it break?
Intuitive but imperfect. It's closer to "external disk + fuzzy search" than to actual memory. Two key differences: (a) memory has "automatic forgetting + associative reconstruction" — important things get reinforced, unimportant things fade. A vector DB doesn't forget or associate by default — you have to engineer TTLs, importance fields, and reflection jobs for summarization; (b) human memory is episodic (time, place, emotional context) + semantic (abstract concepts); a vector DB only encodes semantic similarity, losing temporal and causal relations. "Long-term memory for agents" typically needs three layers — graph + summarization + retrieval — with the vector DB as just one layer.

Retrieval StrategiesRetrieval Strategies

RAGEngineering
One-line intuition

Like running EXPLAIN before a SQL query: plain "embedding + top-K" works in toy scenarios, but production mixes keyword search, metadata filters, query rewriting, and chunking strategies — tuning a multi-layer "retrieve → rank → generate" pipeline like you'd tune a multi-level cache hit rate.

What problem it solves

The first lesson after shipping a naive RAG: 80% of wrong answers come from "the correct document was never retrieved," not from the LLM being weak. Retrieval strategies tackle exactly this: how to chunk documents, how to rewrite queries, how to mix vector + keyword + metadata (hybrid search), single-shot or multi-hop retrieval — all in service of aligning "what the user asks" with "the relevant slice of the knowledge base."

How it works (intuition)

Production RAG pipeline:

User question q
   ↓ Query rewriting
      LLM turns "where's that note about compounding from last week?" into
      "compounding investing long-term returns" — better for embedding recall
   ↓ Hybrid retrieval
      (1) Vector top 20    ─┐
      (2) BM25 top 20      ─┼→ Union (~30-40 items)
      (3) Metadata filter  ─┘
   ↓ Re-ranking (next section)
      A more precise model trims 30-40 down to top 5
   ↓ Paste into prompt → LLM generates answer

The other critical decision is chunking: split long documents into 200-800 token blocks before embedding. Too large → noisy recall; too small → loss of context. The common approach is "split by semantic boundaries with 50-100 token overlap between adjacent chunks."

Code example
from langchain.retrievers import EnsembleRetriever, BM25Retriever
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

# 1) Set up vector and BM25 retrievers
vec_retriever = Chroma(embedding_function=OpenAIEmbeddings()).as_retriever(k=10)
bm25_retriever = BM25Retriever.from_documents(docs, k=10)

# 2) Hybrid: weighted fusion (0.6 vector + 0.4 keyword)
hybrid = EnsembleRetriever(
    retrievers=[vec_retriever, bm25_retriever],
    weights=[0.6, 0.4]
)

# 3) Query rewriting: let the LLM rephrase user input into a more searchable form
def retrieve(user_question):
    rewritten = llm.invoke(f"Rewrite this question as keyword search: {user_question}")
    return hybrid.invoke(rewritten)
Common misconception
"A bigger embedding model = better recall." Truth: 80% of recall bottlenecks live in chunking and query rewriting, not in the embedding. Evaluate your retrieval hit rate first (label "correct documents" for 100 queries, see if they land in top-10), root-cause the misses, then decide whether to fix chunking, add hybrid, add a reranker, or actually swap embeddings. Blindly upgrading embeddings is "faster CPU to fix slow database queries."
Key resources
Where you see it
Canonical: legal/medical QA bots — single-vector retrieval often misses clauses with specialized terminology, so hybrid + metadata filters are required.
Closer to home: your work notes mix Chinese prose with English jargon; a single-language embedding misses cross-language matches — hybrid + multilingual embedding models close the gap.
English Summary
Retrieval strategies in RAG go beyond simple vector top-K: chunking, query rewriting, hybrid (dense + sparse) retrieval, metadata filtering, and multi-hop search are all tools to align "the question asked" with "the knowledge actually relevant". Most RAG failures trace back to retrieval, not generation.
Think it through
1. Why does raising top-K (e.g. from 5 to 50) often degrade the final answer?
Three reasons: (a) context noise — the LLM passively absorbs everything; more irrelevant documents dilute attention and bury the right answer; (b) position bias — LLMs pay least attention to the middle of long contexts ("lost in the middle"); higher K pushes critical docs into the blind spot; (c) cost and latency — 50 chunks may total 20K tokens, multiplying cost and latency. Right move: keep K small (5-10) and improve precision in that small K via re-ranking, not by widening recall.
2. Is there a "universal best" chunk size? Why can't token count be one-size-fits-all?
No universal value — depends on task and corpus structure. Rules of thumb: QA (short queries, precise answers) → 200-400 token chunks; summarization (needing global view) → 800-1,500 token chunks; code → cut by function/class boundaries, not token count; tables and lists → preserve structure, don't cut mid-row. The advanced approach is "semantic chunking" — find natural boundaries via embedding similarity, but it's complex. Most practical workflow: start with a fixed size (e.g. 512 tokens + 50 overlap), build a retrieval eval set, then A/B different chunking on failure cases.
3. In hybrid retrieval, how much weight does "BM25 + vector" deserve? How does it depend on the corpus?
It depends on "keyword decisiveness." Legal, medical, academic — proper nouns dominate, so BM25 weight should be high (0.5-0.7). Conversational text and product reviews — synonyms and flexible phrasing rule, so vector weight should be high (0.7-0.8). Mixed code + documentation — function and API names need exact match (BM25), context explanations need meaning (vector), so 50:50 is common. A more robust approach is learnable weights — annotate "question → correct document" pairs and use grid search or Bayesian optimization. Reciprocal Rank Fusion (RRF) and similar rank-based fusions are more robust than weighted scores because they don't depend on score scale alignment.
4. Tie this back to Day 3's ReAct — how does multi-hop retrieval really differ from ReAct?
Multi-hop retrieval is "use the LLM to plan the next query during retrieval" — a specialization of ReAct on the "information-gathering subtask." For example, "What's the valuation of a rival founded around the same time as another company founded by OpenAI's founders?" requires first finding OpenAI's founders, then their other companies, then competitors, then valuations — each hop's query depends on the previous hop's answer. ReAct is the general framework (tools beyond retrieval); multi-hop retrieval narrows "tools" to "retrieval APIs," typically preset to 2-3 hops. Production often nests both: the Agent uses ReAct to decide "fetch docs vs compute now," and when it fetches docs, it internally runs a multi-hop loop.
5. From a product perspective, why aren't "retrieval accuracy" and "user satisfaction" always correlated?
Three divergence scenarios: (a) user asks a complex philosophical question — the "correct document" doesn't help; even 100% recall can't produce a satisfying answer because the bottleneck is generation; (b) user asks "what's the weather today" — the correct document doesn't exist; recall is 0 by definition, but the user expects the model to call a weather API or admit it doesn't know — that should route to a non-RAG path; (c) user asks "any objections to X" — all retrieved docs are positive views; technically accurate but one-sided. That's a diversity-of-recall problem — solve with MMR (Maximum Marginal Relevance). Lesson: evaluating RAG with retrieval recall@K alone is a trap. End-to-end answer correctness and user satisfaction matter; classify queries and route them through different pipelines.

Re-rankingRe-ranking

RAGOptimization
One-line intuition

Like a two-stage hiring funnel: HR screens 100 résumés with keywords (coarse), interviewers read 5 résumés closely (fine). The reranker is the "interviewer" in a RAG pipeline — a slower but more accurate model that trims 30-50 retrieved chunks down to the most relevant 5.

What problem it solves

Embedding retrieval is fast but "coarse" — it compresses a passage into a single vector, losing detail. Re-rankers reverse the trade: feed pairs of (query, candidate document) into a more precise model (typically a cross-encoder) that outputs a "relevance score." Higher compute cost, but only on the few dozen recalled candidates, so total cost stays sane. Adding a reranker often lifts end-to-end RAG answer correctness by 10-20 points.

How it works (intuition)

Bi-encoder (embedding) vs cross-encoder (re-ranker):

# Bi-encoder (embedding): query and doc encoded independently
emb_q   = encode(query)        ─┐
emb_doc = encode(doc)          ─┼→ cosine_similarity → score
# Pros: docs can be pre-embedded; only query is embedded at runtime; O(1) compare
# Cons: query and doc don't see each other — fine-grained matching is poor

# Cross-encoder (re-ranker): query and doc concatenated and fed to the model
score = model(f"[CLS] {query} [SEP] {doc}")
# Pros: every layer lets query and doc tokens attend to each other — much sharper matching
# Cons: every pair runs the model — N×M complexity, no full-corpus scans

Hence the two-stage approach: bi-encoder narrows millions to dozens; cross-encoder ranks the dozens finely.

Code example
from sentence_transformers import CrossEncoder
import cohere  # or use Cohere's hosted reranker API

# Option A: local cross-encoder (open source)
reranker = CrossEncoder("BAAI/bge-reranker-base")
query = "How to help my kid focus on homework"
candidates = [d1, d2, ..., d30]   # 30 candidates from the vector store

scores = reranker.predict([(query, d) for d in candidates])
top5 = sorted(zip(candidates, scores), key=lambda x: -x[1])[:5]

# Option B: hosted API (less work)
co = cohere.Client()
result = co.rerank(query=query, documents=candidates,
                   top_n=5, model="rerank-multilingual-v3.0")
top5 = [r.document["text"] for r in result.results]

# Splice top5 into the prompt and let the LLM produce the final answer
Common misconception
"Use an LLM to judge 30 candidates' relevance directly" instead of a dedicated reranker — works, but it's expensive and slow. Trained cross-encoder rerankers (BGE-reranker, Cohere Rerank) are usually only a few hundred million parameters, far smaller than an LLM, and run in tens of milliseconds. They are also trained specifically for scoring, so they're both faster and more accurate than an LLM moonlighting as a scorer. Don't replace them unless your reranker's error rate exceeds your tolerance.
Key resources
Where you see it
Canonical: the "recall → precision" two-stage architecture of search and recommendation systems; GitHub's code search uses the same idea.
Closer to home: an investment-research assistant — recall the past six months of relevant reports (30), then re-rank for "most relevant to today's question" to avoid being drowned by stale takes or generic write-ups.
English Summary
Re-ranking is the second stage of a RAG pipeline: a cross-encoder model jointly scores each (query, document) pair to refine the top candidates from a fast first-stage retriever. It trades extra compute on a small set for substantially better precision at top-K — often the highest-ROI optimization in production RAG.
Think it through
1. If cross-encoders are more accurate, why not use them for retrieval directly and drop the two-stage design?
Complexity explodes. Cross-encoders concatenate query with every candidate doc and run the model — a 1M-document corpus needs 1M model passes per query, taking minutes and costing too much. Bi-encoders let you pre-embed docs and just embed the query at runtime + ANN search, pushing 1M-scale queries to 5 ms. Two-stage retrieval — "cheap coarse + expensive precise" — is what every large-scale search system (Google, Bing, Amazon) does. Traditional search calls it L1/L2 ranking; vector retrieval changed the implementation, not the idea.
2. Rerankers also make mistakes. How do you verify it actually "earned its keep" without degrading the retrieval stage?
Build a ground-truth eval set. Minimum viable: handpick 50-100 real queries, label each with "ideal top-3 documents." Then compare two pipelines: (a) retriever's top-K alone — track hit rate / MRR; (b) retriever + reranker — measure NDCG@5 and Recall@5. Common counterintuitive finding: rerankers boost in-domain data by 15% but can hurt on out-of-domain (different language, different domain) due to training-distribution shift. So always A/B on your own business data before shipping — don't trust marketing benchmarks alone.
3. Using an LLM as the reranker sounds reasonable (flexible and general). When is that a good call?
Four cases where LLM-as-reranker actually wins: (a) complex business rules — "rank by relevance + freshness + user preference"; describing the rules to an LLM is easier than training a cross-encoder; (b) low traffic (under 1K queries/day), where training and deploying a dedicated reranker isn't worth it; (c) multilingual or multi-domain content with no off-the-shelf cross-encoder; (d) when ranking explanations are required (interpretability). Cost: 50-100× per query, latency goes from 50 ms to 1-2 s. In production with heavy traffic, use a dedicated reranker; reserve the LLM for long-tail queries or fallbacks.
4. Tie this back to Day 2's RLHF — where does the reranker's training data come from, and is there a similar feedback loop?
Rerankers usually use contrastive learning: each training sample is (query, positive_doc, negative_doc), pushing the positive's score up and the negative's down. Three data sources: (a) human-annotated query-doc pairs (expensive but high quality); (b) public datasets like MS-MARCO (general but domain-shifted); (c) click logs as weak supervision — clicked items as positives, shown-but-not-clicked as negatives. This creates a feedback loop similar to RLHF — user behavior becomes the next reranker's training signal. There's selection bias: users only click what's shown; great unshown documents stay invisible. The fix is called counterfactual learning to rank — random exposure + IPS weighting to debias.
5. If the query itself is vague ("what about that thing"), can the reranker still save the day?
Not really. Rerankers take (query, doc) pairs; when the query carries little information, no model can divine what the user actually meant. Vague queries need handling upstream: (a) multi-turn — have the LLM ask a clarifying question ("by 'that,' do you mean the compounding topic or the mortgage?"); (b) user profile + session history — append the last 5 turns' topic into the query before embedding; (c) query expansion — have the LLM rewrite the vague question into 3-5 concrete versions, retrieve each, then union. It's an information-theory limit: information absent from the query can't be recovered downstream. Place clarification/expansion before the reranker, not after.
← Back to home