Day 33 Hard AI Backend RAG Agent Loop Embedding Serving

AI Product Backend — RAG, Agentic Loops & Human-in-the-loopRAG, Durable Agent Loops, Embedding Serving, Human-in-the-loop

Scenario + Requirements

Build a "document Q&A + agent assistant" backend for 500K enterprise users: users ask questions, the system retrieves from an internal knowledge base to answer; complex tasks ("compare these three contracts into a table and email it") go to an agent for multi-step execution. This is not "wrap an LLM API"—it is simultaneously a retrieval system, a long-running task orchestrator, and an exercise in fencing a non-deterministic model output inside controllable engineering boundaries.

Scale: 100M chunks (~20M source docs), 1024-dim vectors; Q&A peak 2000 QPS, agent task concurrency 5000.
Latency: Q&A p95 first token < 1.5s (retrieval budget ≤ 300ms); agent tasks may run seconds to minutes but must be resumable.
Correctness: answers must be attributable (traceable to a chunk); high-risk actions (send email, mutate data) require human confirmation.
Cost: LLM tokens dominate—the longer the retrieved context, the more expensive. Recall quality converts directly into dollars.

High-level Architecture

graph TD U[User request] --> GW[API Gateway / auth·rate-limit] GW --> ORCH[Orchestrator
Q&A or Agent route] ORCH -->|Q&A| RET[Retrieval] subgraph RET[Retrieval ≤300ms] EMB[Embedding service
query vectorize] --> VDB[(Vector DB ANN)] BM25[(BM25 inverted)] VDB --> FUSE[RRF fusion] BM25 --> FUSE FUSE --> RR[Cross-encoder rerank] end RR --> LLM[LLM inference
streaming first token] ORCH -->|complex task| AG[Agent Loop
durable execution] AG --> TOOLS[Tool calls
retrieval·DB·external API] AG -.high-risk action.-> HITL[Human-in-the-loop
approval queue] HITL -.approve/deny.-> AG AG --> LLM LLM --> U INGEST[Offline ingest
chunk·embed·index] --> VDB INGEST --> BM25

Three main flows: ① the Q&A hot path runs "retrieve → fuse → rerank → stream", on a tight budget; ② the agent path runs on a durable execution engine that can run long and resume after interruption; ③ offline ingest turns documents into searchable vectors+inverted index, decoupled from online. Four key points below.

Key Technical Points

① RAG retrieval quality — bad recall can't be saved by a smarter model

One-line trade-off: naive chunk-embedding is cheap but recalls poorly; contextual + hybrid + rerank recalls well but is costly and slower.

[Principle] The RAG bottleneck is almost never generation—it's retrieval recall. The model can only answer from the chunks you feed it; what isn't retrieved doesn't exist. The naive approach splits docs into 512-token chunks and embeds each in isolation, but chunking strips context: a passage says "this approach cut latency 40%", yet which approach, in which doc, is invisible from the isolated chunk—so semantic search misses. Anthropic's Contextual Retrieval prepends an LLM-generated "what this chunk is about" summary before embedding, and adds BM25 keyword retrieval for hybrid recall.

Three recall strategies:

Approach	Recall quality	Cost
Naive chunk embedding	Baseline	Cheapest; misses long-tail / proper nouns
Contextual + Hybrid + Rerank	Failure rate ↓ sharply	+1 LLM call per chunk at ingest + 1 rerank online
Stuff full corpus into context	No retrieval error	Token cost explodes, blows the window; impossible at 100M chunks

# Online retrieval: hybrid recall + rerank (pseudocode)
q_vec = embed(query)                      # vectorize query
dense = vdb.ann_search(q_vec, k=100)      # semantic recall top-100
sparse = bm25.search(query, k=100)        # keyword recall top-100
fused = rrf(dense, sparse)                # reciprocal rank fusion, dedup
top = cross_encoder.rerank(query, fused)[:8]  # keep 8 most relevant
context = "\n\n".join(c.text for c in top)
answer = llm.stream(prompt(query, context))   # stream w/ citations

Real case: Anthropic Contextual Retrieval reports this reduces retrieval failures by ~49%, and ~67% combined with reranking. The original RAG paradigm is from Lewis et al. 2020 (NeurIPS), combining parametric knowledge with an external non-parametric retrieval store. More on retrieval and reranking in Day 31 "Hybrid Search & Reranking".

② Agent loop as a service — turn a long, crash-prone loop into a resumable workflow

One-line trade-off: an in-memory loop is simple but loses everything on crash; durable execution is resumable but must persist every step.

[Principle] An agent is a loop: call model → model requests a tool → run the tool → feed result back → call model again, until done. The problem: this loop may run minutes and dozens of steps, during which any step can hit a deploy restart, tool timeout, or model 503. If state lives only in memory, one crash restarts the whole task from scratch—after burning real tokens on every step. The fix is durable execution: persist each step's input/output as an event, and on crash replay from the last successful step without re-running completed tool calls (via idempotency keys). This is exactly the value of engines like Temporal / AWS Step Functions.

Three agent execution models:

In-memory loop (a while loop on the request thread): 5 minutes to build, fine for <10s tasks; lost on restart/autoscale, unusable for long tasks.
Durable execution engine (Temporal-style): each step persisted, auto-retry, resume after crash, can pause for days awaiting approval; cost is writing agent logic as a deterministic workflow with all side effects in activities.
Stateless + full replay: store no intermediate state, on recovery re-feed history to "re-walk" it; saves storage but replay burns tokens and is dangerous with side-effecting tools (re-sending email).

# Durable agent loop (durable workflow pseudocode)
def agent_workflow(task):
    history = [user_msg(task)]
    for step in range(MAX_STEPS):             # hard cap against runaway
        resp = activity.call_llm(history)     # persisted; replay won't recompute
        if resp.is_final:
            return resp.text
        for call in resp.tool_calls:
            # idempotency key = (workflow_id, step, tool, args_hash)
            result = activity.run_tool(call, idem_key=key(step, call))
            history.append(tool_result(call, result))
    raise StepLimitExceeded   # trigger degrade / handoff to human

Real case: Anthropic Building Effective Agents distinguishes workflows (predefined orchestration) from agents (model dynamically directs its path), and stresses "don't build an agent if a simpler solution works"—agents trade latency and cost for flexibility. In production, agent orchestration commonly sits on durable execution engines like Temporal / Step Functions (same lineage as Day 39 "Workflow Engines").

③ Embedding service design — model version and index version must be locked together

One-line trade-off: online real-time embedding is low-latency but pricey, batch embedding is high-throughput but delayed; the biggest trap is changing the model = re-embedding the whole corpus.

[Principle] The embedding service has two traffic classes: online (query vectorization, must be fast, cacheable) and offline (batch-embedding 100M chunks, needs throughput). They should be deployed separately: online for small low-latency batches, offline for large GPU-saturating batches. The key constraint is the same vector space—query and doc must be embedded by the same model, or vectors can't be compared. So once you upgrade the embedding model, the entire index must be rebuilt with the new model, with old and new coexisting during a gradual cutover—never mixed mid-flight (mixing utterly distorts similarity).

Fatal trap: embedding the query with the new model while the docs are still in the old index—the computed similarity is noise, recall is all wrong but nothing errors out (nothing crashes, answers just get worse). Always tag the index with model_version and have the query side assert consistency.

# Offline ingest: batch embed + dual-index canary (pseudocode)
for batch in chunks.stream(size=256):         # large batch saturates GPU
    vecs = embed_offline(batch, model="v3")   # same model family as online
    new_index.upsert(batch.ids, vecs, meta={"model":"v3"})
# On model change: build v3 in background → validate recall → atomic cutover → drop v2
# Online query side:
@cache(ttl="1h")                              # cache hot query vectors
def embed_query(q): return embed_online(q, model=current_index_model())

Online vs offline embedding: online suits queries (high QPS, short text, high cache hit); offline suits ingest (large volume, tolerates minute-level latency, batching saves money). Don't run a full re-embed on the online service—it will wreck online query latency.

Real case: vector DBs (pgvector / Pinecone / Milvus, etc.) commonly use HNSW for ANN; the recall/latency trade-off is in Day 12 "Search Systems". OpenAI / Cohere embedding APIs explicitly version their models, and their docs stress rebuilding the index on a version change—an industry-wide constraint, not one vendor's quirk.

④ Human-in-the-loop — fence high-risk actions behind an approval gate

One-line trade-off: synchronous blocking is simple but holds a connection/thread; async suspend scales but must persist the whole agent state.

[Principle] Models make mistakes, so irreversible high-risk actions (send email, transfer money, delete data, change prod config) can't be left to the agent's own call. The pattern is human-in-the-loop: when the agent reaches a high-risk tool, instead of executing it, it pauses, pushes "I intend to do X" to an approval queue, and continues only after approval. The hard part is that "wait for a human" may take hours or days—you can't have a thread/connection idle-waiting. So it must build on the durable execution of point ②: persist agent state, release all resources, and on the approval callback resume from the pause point.

Three ways to wait:

Synchronous blocking: the request thread holds while awaiting approval—simplest, but waiting hours exhausts the connection pool; never for long waits.
Polling / short-timeout retry: periodically check approval status—simple, but with latency and wasted poll overhead.
Durable suspend + event wake: the workflow persists and releases resources, woken on the approval event—can wait arbitrarily long with zero resource hold, at the cost of depending on a durable engine and correct resume semantics.

# High-risk tool → suspend for approval (durable pseudocode)
def run_tool(call):
    if call.tool in HIGH_RISK:
        approval_id = approvals.create(call)        # push to approval queue
        decision = workflow.wait_for_signal(         # durable suspend, can wait days
            approval_id, timeout="3d")
        if decision != "approved":
            return refused(call)                     # don't execute if denied
    return execute(call)                             # execute only if approved/low-risk

Real case: Anthropic Building Effective Agents lists "human checkpoints at key steps" as a core practice for agentic systems—especially when actions are irreversible or high-cost. This complements the Saga compensation idea from Day 7 "Distributed Transactions": compensable actions run automatically, non-compensable ones (an already-sent email) go through human approval.

Scaling & Optimization

Semantic cache: reuse answers for similar queries (return on vector-neighbor hit); high-frequency FAQs skip the whole retrieve+generate chain, saving money and latency. Mind invalidation—knowledge-base updates must expire the cache.
Retrieval sharding: 100M vectors won't fit one machine; shard by tenant/topic, route queries to relevant shards, avoid full-corpus scans (see Day 4 "Sharding").
Multi-tier model routing: simple questions to a small model, complex tasks to a large one, routing by difficulty to cut cost (see Day 32 "LLM Serving").
Incremental reindex: route doc updates through CDC, re-embed only changed chunks rather than rebuilding the whole corpus.
Eval loop: maintain a fixed golden set of "question → expected source", run regression on every retrieval/model change, holding quality with recall/answer-correctness instead of tuning by vibes.

Common Pitfalls + Interview Follow-ups

"Blame model hallucination when recall is the problem": 80% of bad RAG answers are failures to retrieve the right chunk, not the model making things up. Measure retrieval recall first, then talk generation.
"Query and doc use different embedding models": similarity becomes noise, recall all wrong, and nothing alarms—the invisible killer noted above.
"Agent state only in memory": any deploy/autoscale event restarts every running task from scratch, tokens burned for nothing. Long tasks must be persisted.
"Agent infinite loop": no step cap, the model keeps calling tools without converging, blowing the budget. Need a hard cap + degrade to human.
"Irreversible high-risk action executed automatically": a wrong email or deletion can't be rolled back. Irreversible actions always go human-in-the-loop.
Follow-up: "Retrieval budget is only 300ms—what if hybrid+rerank can't finish?" → parallelize dense/sparse recall, rerank only top-N, use a distilled small cross-encoder, route hot queries to the semantic cache.

Deep-dive Resources

Anthropic — Building Effective Agents: workflow vs agent, when to use an agent, human checkpoints, and other practices.
Anthropic — Contextual Retrieval: contextual embedding + BM25 + rerank cuts retrieval failures by ~49%/67%.
Lewis et al. 2020, Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (NeurIPS): the original RAG paper.
Designing Data-Intensive Applications (Kleppmann): the fundamentals of idempotency, state machines, and recoverable execution—the basis for durable agent execution.
This series Day 31 (Hybrid Search & Reranking), Day 32 (LLM Serving), Day 39 (Workflow Engines)—this issue is where those three converge for AI products.

Going Deeper

1. "Long-context models keep getting cheaper—stuff the whole knowledge base into context and RAG is obsolete." Where is this right and wrong?

Wrong. Even with a huge, cheap context window, RAG remains irreplaceable for three reasons: ① cost and latency stay linear in tokens—stuffing 100M chunks into every request explodes both cost and first-token latency; retrieval compresses it to top-8; ② lost in the middle—in very long contexts a model's use of mid-context information drops markedly, and putting relevant content up front (which is what retrieval does) is more accurate; ③ attributability and freshness—RAG naturally yields "this answer came from this chunk", and knowledge-base updates take effect instantly without retraining/reloading the model.

The right part: retrieval granularity can get coarser—no need for 512-token chunks; you can retrieve whole sections/docs into a large window, reducing the "chunking strips context" problem. So the trend is RAG and long-context fusing (coarse recall + large-window reading), not one replacing the other.

2. Your agent succeeded at the "send email" tool on step 7, but the process crashed before step 8 persisted. How do you avoid sending the email twice on recovery?

The core is idempotency, not "remembering we already sent it". Durable execution recovers by replay: re-execute from the last successfully persisted step. If step 7's tool result was persisted, replay returns the cached result without re-running the tool—no duplicate email. The danger window is "tool succeeded but result not yet persisted before crash": replay will then call the tool again.

Defense is a tool-side idempotency key: every tool call carries idem_key=(workflow_id, step, args_hash), and the email service dedups on it—the same key arriving twice returns the first result without actually sending. This is the same idea as Day 17 "Payment Systems" idempotent recovery points: external side effects must be idempotent themselves; the orchestration engine's bookkeeping isn't enough, because a gap between side effect and persistence always exists.

3. The knowledge base ingests 100K docs/day, but you've switched embedding models and must re-embed the whole corpus (100M chunks, ~8 hours). What happens to user queries during the rebuild?

You can't take an outage, and you can't query a mixed old/new index (vector spaces are incompatible). The standard approach is dual index + atomic cutover: the old index (v2) keeps serving all queries; the new index (v3) is rebuilt in the background with the new model, while queries still go to v2 and are embedded with the v2 model (keeping the space consistent). Once v3 is built, run the golden eval set to confirm recall hasn't regressed, then atomically cut all traffic (including the query embedding model) to v3, and finally drop v2.

The hard part is incremental updates during the rebuild: the 100K docs arriving in those 8 hours must be written to both indexes (v2 with the old model, v3 with the new), i.e. dual-write, or v3 will be missing that data at cutover. Dual-write must handle failure consistency (retry/compensate if one write fails). Cost-wise, the rebuild is a one-off large offline job—use spot/batch GPUs to crush cost, don't take from online resources.

4. Why is "agent observability harder than traditional microservices"? What should you instrument?

The difficulty is non-determinism + multi-step + high cost. A traditional service has one trace per request and reproducible behavior; an agent task is a dynamically unfolding tree of dozens of steps that may take a different path next time on the same input, burning tokens at every step—so when something breaks you must answer not just "which step was slow/wrong" but "why did the model decide to go this way" and "how much did this step cost".

Instrument: ① a full trace per step (model input/output, which tool was chosen, tool latency and result); ② token and cost attribution by task/user/tool; ③ replayable decisions—store each step's prompt+response to debug "why this decision"; ④ quality signals—retrieval recall, whether answers carry citations, human approval pass rate, runaway rate (fraction hitting the step cap). This is the metrics/logs/traces trio of Day 21 "Observability", plus two extra dimensions: cost and decision.

5. After launching a semantic cache (reuse answers for similar queries), hit rate is 30% and saves a lot—but two weeks later users complain about "stale / off-topic answers". What likely went wrong?

Two classic traps. ① Missing invalidation: the knowledge base updated, but the old answers in the semantic cache didn't expire, so users get conclusions based on old docs. Semantic-cache invalidation is harder than plain KV cache—you don't know which chunks a cached answer depended on, so either use short TTLs, bump by knowledge-base version, or record an answer's source chunks and invalidate targeted when those chunks update.

② Similarity threshold too loose: "off-topic" means queries that aren't similar enough were judged as hits. "What's our refund policy" and "what's our return policy" are close in vector space but have different answers. Too loose mismatches; too tight drops hit rate. Tune the threshold on real query pairs, and for high-risk/precise queries (money, policy, personal data) disable the semantic cache, enabling it only for broad FAQs. This is exactly Day 2 "Caching"—"how long can you tolerate stale" and "cache invalidation is hard"—replayed in an AI setting.

← Back to index