Build a "document Q&A + agent assistant" backend for 500K enterprise users: users ask questions, the system retrieves from an internal knowledge base to answer; complex tasks ("compare these three contracts into a table and email it") go to an agent for multi-step execution. This is not "wrap an LLM API"—it is simultaneously a retrieval system, a long-running task orchestrator, and an exercise in fencing a non-deterministic model output inside controllable engineering boundaries.
Three main flows: ① the Q&A hot path runs "retrieve → fuse → rerank → stream", on a tight budget; ② the agent path runs on a durable execution engine that can run long and resume after interruption; ③ offline ingest turns documents into searchable vectors+inverted index, decoupled from online. Four key points below.
One-line trade-off: naive chunk-embedding is cheap but recalls poorly; contextual + hybrid + rerank recalls well but is costly and slower.
[Principle] The RAG bottleneck is almost never generation—it's retrieval recall. The model can only answer from the chunks you feed it; what isn't retrieved doesn't exist. The naive approach splits docs into 512-token chunks and embeds each in isolation, but chunking strips context: a passage says "this approach cut latency 40%", yet which approach, in which doc, is invisible from the isolated chunk—so semantic search misses. Anthropic's Contextual Retrieval prepends an LLM-generated "what this chunk is about" summary before embedding, and adds BM25 keyword retrieval for hybrid recall.
| Approach | Recall quality | Cost |
|---|---|---|
| Naive chunk embedding | Baseline | Cheapest; misses long-tail / proper nouns |
| Contextual + Hybrid + Rerank | Failure rate ↓ sharply | +1 LLM call per chunk at ingest + 1 rerank online |
| Stuff full corpus into context | No retrieval error | Token cost explodes, blows the window; impossible at 100M chunks |
# Online retrieval: hybrid recall + rerank (pseudocode)
q_vec = embed(query) # vectorize query
dense = vdb.ann_search(q_vec, k=100) # semantic recall top-100
sparse = bm25.search(query, k=100) # keyword recall top-100
fused = rrf(dense, sparse) # reciprocal rank fusion, dedup
top = cross_encoder.rerank(query, fused)[:8] # keep 8 most relevant
context = "\n\n".join(c.text for c in top)
answer = llm.stream(prompt(query, context)) # stream w/ citations
One-line trade-off: an in-memory loop is simple but loses everything on crash; durable execution is resumable but must persist every step.
[Principle] An agent is a loop: call model → model requests a tool → run the tool → feed result back → call model again, until done. The problem: this loop may run minutes and dozens of steps, during which any step can hit a deploy restart, tool timeout, or model 503. If state lives only in memory, one crash restarts the whole task from scratch—after burning real tokens on every step. The fix is durable execution: persist each step's input/output as an event, and on crash replay from the last successful step without re-running completed tool calls (via idempotency keys). This is exactly the value of engines like Temporal / AWS Step Functions.
# Durable agent loop (durable workflow pseudocode)
def agent_workflow(task):
history = [user_msg(task)]
for step in range(MAX_STEPS): # hard cap against runaway
resp = activity.call_llm(history) # persisted; replay won't recompute
if resp.is_final:
return resp.text
for call in resp.tool_calls:
# idempotency key = (workflow_id, step, tool, args_hash)
result = activity.run_tool(call, idem_key=key(step, call))
history.append(tool_result(call, result))
raise StepLimitExceeded # trigger degrade / handoff to human
One-line trade-off: online real-time embedding is low-latency but pricey, batch embedding is high-throughput but delayed; the biggest trap is changing the model = re-embedding the whole corpus.
[Principle] The embedding service has two traffic classes: online (query vectorization, must be fast, cacheable) and offline (batch-embedding 100M chunks, needs throughput). They should be deployed separately: online for small low-latency batches, offline for large GPU-saturating batches. The key constraint is the same vector space—query and doc must be embedded by the same model, or vectors can't be compared. So once you upgrade the embedding model, the entire index must be rebuilt with the new model, with old and new coexisting during a gradual cutover—never mixed mid-flight (mixing utterly distorts similarity).
model_version and have the query side assert consistency.# Offline ingest: batch embed + dual-index canary (pseudocode)
for batch in chunks.stream(size=256): # large batch saturates GPU
vecs = embed_offline(batch, model="v3") # same model family as online
new_index.upsert(batch.ids, vecs, meta={"model":"v3"})
# On model change: build v3 in background → validate recall → atomic cutover → drop v2
# Online query side:
@cache(ttl="1h") # cache hot query vectors
def embed_query(q): return embed_online(q, model=current_index_model())
One-line trade-off: synchronous blocking is simple but holds a connection/thread; async suspend scales but must persist the whole agent state.
[Principle] Models make mistakes, so irreversible high-risk actions (send email, transfer money, delete data, change prod config) can't be left to the agent's own call. The pattern is human-in-the-loop: when the agent reaches a high-risk tool, instead of executing it, it pauses, pushes "I intend to do X" to an approval queue, and continues only after approval. The hard part is that "wait for a human" may take hours or days—you can't have a thread/connection idle-waiting. So it must build on the durable execution of point ②: persist agent state, release all resources, and on the approval callback resume from the pause point.
# High-risk tool → suspend for approval (durable pseudocode)
def run_tool(call):
if call.tool in HIGH_RISK:
approval_id = approvals.create(call) # push to approval queue
decision = workflow.wait_for_signal( # durable suspend, can wait days
approval_id, timeout="3d")
if decision != "approved":
return refused(call) # don't execute if denied
return execute(call) # execute only if approved/low-risk
Wrong. Even with a huge, cheap context window, RAG remains irreplaceable for three reasons: ① cost and latency stay linear in tokens—stuffing 100M chunks into every request explodes both cost and first-token latency; retrieval compresses it to top-8; ② lost in the middle—in very long contexts a model's use of mid-context information drops markedly, and putting relevant content up front (which is what retrieval does) is more accurate; ③ attributability and freshness—RAG naturally yields "this answer came from this chunk", and knowledge-base updates take effect instantly without retraining/reloading the model.
The right part: retrieval granularity can get coarser—no need for 512-token chunks; you can retrieve whole sections/docs into a large window, reducing the "chunking strips context" problem. So the trend is RAG and long-context fusing (coarse recall + large-window reading), not one replacing the other.
The core is idempotency, not "remembering we already sent it". Durable execution recovers by replay: re-execute from the last successfully persisted step. If step 7's tool result was persisted, replay returns the cached result without re-running the tool—no duplicate email. The danger window is "tool succeeded but result not yet persisted before crash": replay will then call the tool again.
Defense is a tool-side idempotency key: every tool call carries idem_key=(workflow_id, step, args_hash), and the email service dedups on it—the same key arriving twice returns the first result without actually sending. This is the same idea as Day 17 "Payment Systems" idempotent recovery points: external side effects must be idempotent themselves; the orchestration engine's bookkeeping isn't enough, because a gap between side effect and persistence always exists.
You can't take an outage, and you can't query a mixed old/new index (vector spaces are incompatible). The standard approach is dual index + atomic cutover: the old index (v2) keeps serving all queries; the new index (v3) is rebuilt in the background with the new model, while queries still go to v2 and are embedded with the v2 model (keeping the space consistent). Once v3 is built, run the golden eval set to confirm recall hasn't regressed, then atomically cut all traffic (including the query embedding model) to v3, and finally drop v2.
The hard part is incremental updates during the rebuild: the 100K docs arriving in those 8 hours must be written to both indexes (v2 with the old model, v3 with the new), i.e. dual-write, or v3 will be missing that data at cutover. Dual-write must handle failure consistency (retry/compensate if one write fails). Cost-wise, the rebuild is a one-off large offline job—use spot/batch GPUs to crush cost, don't take from online resources.
The difficulty is non-determinism + multi-step + high cost. A traditional service has one trace per request and reproducible behavior; an agent task is a dynamically unfolding tree of dozens of steps that may take a different path next time on the same input, burning tokens at every step—so when something breaks you must answer not just "which step was slow/wrong" but "why did the model decide to go this way" and "how much did this step cost".
Instrument: ① a full trace per step (model input/output, which tool was chosen, tool latency and result); ② token and cost attribution by task/user/tool; ③ replayable decisions—store each step's prompt+response to debug "why this decision"; ④ quality signals—retrieval recall, whether answers carry citations, human approval pass rate, runaway rate (fraction hitting the step cap). This is the metrics/logs/traces trio of Day 21 "Observability", plus two extra dimensions: cost and decision.
Two classic traps. ① Missing invalidation: the knowledge base updated, but the old answers in the semantic cache didn't expire, so users get conclusions based on old docs. Semantic-cache invalidation is harder than plain KV cache—you don't know which chunks a cached answer depended on, so either use short TTLs, bump by knowledge-base version, or record an answer's source chunks and invalidate targeted when those chunks update.
② Similarity threshold too loose: "off-topic" means queries that aren't similar enough were judged as hits. "What's our refund policy" and "what's our return policy" are close in vector space but have different answers. Too loose mismatches; too tight drops hit rate. Tune the threshold on real query pairs, and for high-risk/precise queries (money, policy, personal data) disable the semantic cache, enabling it only for broad FAQs. This is exactly Day 2 "Caching"—"how long can you tolerate stale" and "cache invalidation is hard"—replayed in an AI setting.