Day 33 Hard AI Backend RAG Agent Loop Embedding Serving

AI Product Backend — RAG, Agentic Loops & Human-in-the-loopRAG, Durable Agent Loops, Embedding Serving, Human-in-the-loop

Scenario + Requirements

Build a "document Q&A + agent assistant" backend for 500K enterprise users: users ask questions, the system retrieves from an internal knowledge base to answer; complex tasks ("compare these three contracts into a table and email it") go to an agent for multi-step execution. This is not "wrap an LLM API"—it is simultaneously a retrieval system, a long-running task orchestrator, and an exercise in fencing a non-deterministic model output inside controllable engineering boundaries.

High-level Architecture

graph TD U[User request] --> GW[API Gateway / auth·rate-limit] GW --> ORCH[Orchestrator
Q&A or Agent route] ORCH -->|Q&A| RET[Retrieval] subgraph RET[Retrieval ≤300ms] EMB[Embedding service
query vectorize] --> VDB[(Vector DB ANN)] BM25[(BM25 inverted)] VDB --> FUSE[RRF fusion] BM25 --> FUSE FUSE --> RR[Cross-encoder rerank] end RR --> LLM[LLM inference
streaming first token] ORCH -->|complex task| AG[Agent Loop
durable execution] AG --> TOOLS[Tool calls
retrieval·DB·external API] AG -.high-risk action.-> HITL[Human-in-the-loop
approval queue] HITL -.approve/deny.-> AG AG --> LLM LLM --> U INGEST[Offline ingest
chunk·embed·index] --> VDB INGEST --> BM25

Three main flows: ① the Q&A hot path runs "retrieve → fuse → rerank → stream", on a tight budget; ② the agent path runs on a durable execution engine that can run long and resume after interruption; ③ offline ingest turns documents into searchable vectors+inverted index, decoupled from online. Four key points below.

Key Technical Points

① RAG retrieval quality — bad recall can't be saved by a smarter model

One-line trade-off: naive chunk-embedding is cheap but recalls poorly; contextual + hybrid + rerank recalls well but is costly and slower.

[Principle] The RAG bottleneck is almost never generation—it's retrieval recall. The model can only answer from the chunks you feed it; what isn't retrieved doesn't exist. The naive approach splits docs into 512-token chunks and embeds each in isolation, but chunking strips context: a passage says "this approach cut latency 40%", yet which approach, in which doc, is invisible from the isolated chunk—so semantic search misses. Anthropic's Contextual Retrieval prepends an LLM-generated "what this chunk is about" summary before embedding, and adds BM25 keyword retrieval for hybrid recall.

Three recall strategies:
ApproachRecall qualityCost
Naive chunk embeddingBaselineCheapest; misses long-tail / proper nouns
Contextual + Hybrid + RerankFailure rate ↓ sharply+1 LLM call per chunk at ingest + 1 rerank online
Stuff full corpus into contextNo retrieval errorToken cost explodes, blows the window; impossible at 100M chunks
# Online retrieval: hybrid recall + rerank (pseudocode)
q_vec = embed(query)                      # vectorize query
dense = vdb.ann_search(q_vec, k=100)      # semantic recall top-100
sparse = bm25.search(query, k=100)        # keyword recall top-100
fused = rrf(dense, sparse)                # reciprocal rank fusion, dedup
top = cross_encoder.rerank(query, fused)[:8]  # keep 8 most relevant
context = "\n\n".join(c.text for c in top)
answer = llm.stream(prompt(query, context))   # stream w/ citations
Real case: Anthropic Contextual Retrieval reports this reduces retrieval failures by ~49%, and ~67% combined with reranking. The original RAG paradigm is from Lewis et al. 2020 (NeurIPS), combining parametric knowledge with an external non-parametric retrieval store. More on retrieval and reranking in Day 31 "Hybrid Search & Reranking".

② Agent loop as a service — turn a long, crash-prone loop into a resumable workflow

One-line trade-off: an in-memory loop is simple but loses everything on crash; durable execution is resumable but must persist every step.

[Principle] An agent is a loop: call model → model requests a tool → run the tool → feed result back → call model again, until done. The problem: this loop may run minutes and dozens of steps, during which any step can hit a deploy restart, tool timeout, or model 503. If state lives only in memory, one crash restarts the whole task from scratch—after burning real tokens on every step. The fix is durable execution: persist each step's input/output as an event, and on crash replay from the last successful step without re-running completed tool calls (via idempotency keys). This is exactly the value of engines like Temporal / AWS Step Functions.

Three agent execution models:
# Durable agent loop (durable workflow pseudocode)
def agent_workflow(task):
    history = [user_msg(task)]
    for step in range(MAX_STEPS):             # hard cap against runaway
        resp = activity.call_llm(history)     # persisted; replay won't recompute
        if resp.is_final:
            return resp.text
        for call in resp.tool_calls:
            # idempotency key = (workflow_id, step, tool, args_hash)
            result = activity.run_tool(call, idem_key=key(step, call))
            history.append(tool_result(call, result))
    raise StepLimitExceeded   # trigger degrade / handoff to human
Real case: Anthropic Building Effective Agents distinguishes workflows (predefined orchestration) from agents (model dynamically directs its path), and stresses "don't build an agent if a simpler solution works"—agents trade latency and cost for flexibility. In production, agent orchestration commonly sits on durable execution engines like Temporal / Step Functions (same lineage as Day 39 "Workflow Engines").

③ Embedding service design — model version and index version must be locked together

One-line trade-off: online real-time embedding is low-latency but pricey, batch embedding is high-throughput but delayed; the biggest trap is changing the model = re-embedding the whole corpus.

[Principle] The embedding service has two traffic classes: online (query vectorization, must be fast, cacheable) and offline (batch-embedding 100M chunks, needs throughput). They should be deployed separately: online for small low-latency batches, offline for large GPU-saturating batches. The key constraint is the same vector space—query and doc must be embedded by the same model, or vectors can't be compared. So once you upgrade the embedding model, the entire index must be rebuilt with the new model, with old and new coexisting during a gradual cutover—never mixed mid-flight (mixing utterly distorts similarity).

Fatal trap: embedding the query with the new model while the docs are still in the old index—the computed similarity is noise, recall is all wrong but nothing errors out (nothing crashes, answers just get worse). Always tag the index with model_version and have the query side assert consistency.
# Offline ingest: batch embed + dual-index canary (pseudocode)
for batch in chunks.stream(size=256):         # large batch saturates GPU
    vecs = embed_offline(batch, model="v3")   # same model family as online
    new_index.upsert(batch.ids, vecs, meta={"model":"v3"})
# On model change: build v3 in background → validate recall → atomic cutover → drop v2
# Online query side:
@cache(ttl="1h")                              # cache hot query vectors
def embed_query(q): return embed_online(q, model=current_index_model())
Online vs offline embedding: online suits queries (high QPS, short text, high cache hit); offline suits ingest (large volume, tolerates minute-level latency, batching saves money). Don't run a full re-embed on the online service—it will wreck online query latency.
Real case: vector DBs (pgvector / Pinecone / Milvus, etc.) commonly use HNSW for ANN; the recall/latency trade-off is in Day 12 "Search Systems". OpenAI / Cohere embedding APIs explicitly version their models, and their docs stress rebuilding the index on a version change—an industry-wide constraint, not one vendor's quirk.

④ Human-in-the-loop — fence high-risk actions behind an approval gate

One-line trade-off: synchronous blocking is simple but holds a connection/thread; async suspend scales but must persist the whole agent state.

[Principle] Models make mistakes, so irreversible high-risk actions (send email, transfer money, delete data, change prod config) can't be left to the agent's own call. The pattern is human-in-the-loop: when the agent reaches a high-risk tool, instead of executing it, it pauses, pushes "I intend to do X" to an approval queue, and continues only after approval. The hard part is that "wait for a human" may take hours or days—you can't have a thread/connection idle-waiting. So it must build on the durable execution of point ②: persist agent state, release all resources, and on the approval callback resume from the pause point.

Three ways to wait:
# High-risk tool → suspend for approval (durable pseudocode)
def run_tool(call):
    if call.tool in HIGH_RISK:
        approval_id = approvals.create(call)        # push to approval queue
        decision = workflow.wait_for_signal(         # durable suspend, can wait days
            approval_id, timeout="3d")
        if decision != "approved":
            return refused(call)                     # don't execute if denied
    return execute(call)                             # execute only if approved/low-risk
Real case: Anthropic Building Effective Agents lists "human checkpoints at key steps" as a core practice for agentic systems—especially when actions are irreversible or high-cost. This complements the Saga compensation idea from Day 7 "Distributed Transactions": compensable actions run automatically, non-compensable ones (an already-sent email) go through human approval.

Scaling & Optimization

Common Pitfalls + Interview Follow-ups

Deep-dive Resources

Going Deeper

1. "Long-context models keep getting cheaper—stuff the whole knowledge base into context and RAG is obsolete." Where is this right and wrong?

Wrong. Even with a huge, cheap context window, RAG remains irreplaceable for three reasons: ① cost and latency stay linear in tokens—stuffing 100M chunks into every request explodes both cost and first-token latency; retrieval compresses it to top-8; ② lost in the middle—in very long contexts a model's use of mid-context information drops markedly, and putting relevant content up front (which is what retrieval does) is more accurate; ③ attributability and freshness—RAG naturally yields "this answer came from this chunk", and knowledge-base updates take effect instantly without retraining/reloading the model.

The right part: retrieval granularity can get coarser—no need for 512-token chunks; you can retrieve whole sections/docs into a large window, reducing the "chunking strips context" problem. So the trend is RAG and long-context fusing (coarse recall + large-window reading), not one replacing the other.

2. Your agent succeeded at the "send email" tool on step 7, but the process crashed before step 8 persisted. How do you avoid sending the email twice on recovery?

The core is idempotency, not "remembering we already sent it". Durable execution recovers by replay: re-execute from the last successfully persisted step. If step 7's tool result was persisted, replay returns the cached result without re-running the tool—no duplicate email. The danger window is "tool succeeded but result not yet persisted before crash": replay will then call the tool again.

Defense is a tool-side idempotency key: every tool call carries idem_key=(workflow_id, step, args_hash), and the email service dedups on it—the same key arriving twice returns the first result without actually sending. This is the same idea as Day 17 "Payment Systems" idempotent recovery points: external side effects must be idempotent themselves; the orchestration engine's bookkeeping isn't enough, because a gap between side effect and persistence always exists.

3. The knowledge base ingests 100K docs/day, but you've switched embedding models and must re-embed the whole corpus (100M chunks, ~8 hours). What happens to user queries during the rebuild?

You can't take an outage, and you can't query a mixed old/new index (vector spaces are incompatible). The standard approach is dual index + atomic cutover: the old index (v2) keeps serving all queries; the new index (v3) is rebuilt in the background with the new model, while queries still go to v2 and are embedded with the v2 model (keeping the space consistent). Once v3 is built, run the golden eval set to confirm recall hasn't regressed, then atomically cut all traffic (including the query embedding model) to v3, and finally drop v2.

The hard part is incremental updates during the rebuild: the 100K docs arriving in those 8 hours must be written to both indexes (v2 with the old model, v3 with the new), i.e. dual-write, or v3 will be missing that data at cutover. Dual-write must handle failure consistency (retry/compensate if one write fails). Cost-wise, the rebuild is a one-off large offline job—use spot/batch GPUs to crush cost, don't take from online resources.

4. Why is "agent observability harder than traditional microservices"? What should you instrument?

The difficulty is non-determinism + multi-step + high cost. A traditional service has one trace per request and reproducible behavior; an agent task is a dynamically unfolding tree of dozens of steps that may take a different path next time on the same input, burning tokens at every step—so when something breaks you must answer not just "which step was slow/wrong" but "why did the model decide to go this way" and "how much did this step cost".

Instrument: ① a full trace per step (model input/output, which tool was chosen, tool latency and result); ② token and cost attribution by task/user/tool; ③ replayable decisions—store each step's prompt+response to debug "why this decision"; ④ quality signals—retrieval recall, whether answers carry citations, human approval pass rate, runaway rate (fraction hitting the step cap). This is the metrics/logs/traces trio of Day 21 "Observability", plus two extra dimensions: cost and decision.

5. After launching a semantic cache (reuse answers for similar queries), hit rate is 30% and saves a lot—but two weeks later users complain about "stale / off-topic answers". What likely went wrong?

Two classic traps. ① Missing invalidation: the knowledge base updated, but the old answers in the semantic cache didn't expire, so users get conclusions based on old docs. Semantic-cache invalidation is harder than plain KV cache—you don't know which chunks a cached answer depended on, so either use short TTLs, bump by knowledge-base version, or record an answer's source chunks and invalidate targeted when those chunks update.

② Similarity threshold too loose: "off-topic" means queries that aren't similar enough were judged as hits. "What's our refund policy" and "what's our return policy" are close in vector space but have different answers. Too loose mismatches; too tight drops hit rate. Tune the threshold on real query pairs, and for high-risk/precise queries (money, policy, personal data) disable the semantic cache, enabling it only for broad FAQs. This is exactly Day 2 "Caching"—"how long can you tolerate stale" and "cache invalidation is hard"—replayed in an AI setting.