DAY 37 / PHASE 4 · PRODUCTION

AI Observability

Trace / Span · Cost Attribution · Drift Detection · Online Eval

2026-06-16 · BigCat

Offline eval tells you whether the model passes a test set; observability tells you what it is doing on live traffic right now.

// WHY THIS MATTERS

When a traditional service breaks, logs + metrics + error codes pin it down. LLM agents are different: they can be wrong without erroring — 200 OK, normal tokens, normal latency, yet the answer is fabricated, the wrong tool was picked, or it took 8 turns to do a 2-turn job. None of that hits an error log. Worse, LLM systems have two failure sources traditional systems don't: non-determinism (same input, different output) and silent provider updates (you changed no code, the model's behavior changed). Observability turns that invisible internal state into data you can query, alert on, and attribute. This issue covers four things: modeling an agent run as a trace + span tree, attributing token cost to each step, detecting drift in both inputs and outputs, and running online evaluation on live traffic — instead of learning something broke only when users complain.

// 01

Trace / Span: An Agent's Debug Unit Isn't a Log Line

Claim: an agent run is a span tree, not a stream of logs. Without the tree shape, you can only guess what happened in the middle.

Background

An agent call chain is nested: a top-level user request (trace) wraps multiple LLM calls, each wrapping tool calls, each possibly wrapping sub-agents. Recording this with flat log lines flattens a tree into a column — you can never reconstruct "which tool's output caused the wrong decision next turn." Distributed tracing's span model fits natively: each operation is a span (with start/end/attributes), spans link by parent_id into a tree, and the whole tree is a trace.

Since 2024 OpenTelemetry's GenAI semantic conventions standardized this: they specify which attributes an LLM call records (model, input/output tokens, prompt, tool calls, provider), so a LangChain agent's span looks identical to a raw OpenAI call — comparable across frameworks and vendors. So don't invent your own field names; align with the gen_ai.* conventions. Langfuse and Arize Phoenix are both built on OTel.

trace: "user: refund my order from last week" (2.4s, 11.2K tok, $0.04) │ ├─ span LLM:plan 0.6s in 1.1K out 180 [stop=tool_use] │ ├─ span tool:search_order 0.3s → order_id=A123 │ └─ span db.query 0.2s SELECT ... WHERE ... │ ├─ span LLM:decide 0.5s in 2.4K out 90 [stop=tool_use] │ ├─ span tool:refund 0.4s ✗ ERROR: past refund window │ └─ span LLM:respond 0.6s in 3.0K out 220 [stop=end_turn] ↑ model reads the tool error, pivots to tell user no refund ✓ ← without this tree you only see "request OK, 220 tokens", invisible: the refund failure AND the model's self-recovery

Example

Add spans to a homemade harness with the OTel SDK, aligning attribute names to the GenAI conventions:

from opentelemetry import trace
tracer = trace.get_tracer("my-agent")

def llm_call(msgs, tools):
    with tracer.start_as_current_span("llm.generate") as sp:
        r = client.messages.create(model=MODEL, messages=msgs, tools=tools, max_tokens=2048)
        # align to OTel gen_ai.*, don't invent fields
        sp.set_attribute("gen_ai.request.model", MODEL)
        sp.set_attribute("gen_ai.usage.input_tokens",  r.usage.input_tokens)
        sp.set_attribute("gen_ai.usage.output_tokens", r.usage.output_tokens)
        sp.set_attribute("gen_ai.response.finish_reason", r.stop_reason)
        sp.set_attribute("app.prompt_version", PROMPT_VERSION)  # custom: which prompt
        return r

The key is that span nesting must follow the agent's logical structure: is a tool call a sibling or a child of the llm span? Should a sub-agent's whole tree link back to the parent trace? Get these parent relationships right and the UI shows at a glance "7K of the 11.2K tokens went to a repeated search on turn three."

Failure mode: (1) Stuffing whole prompts/completions into span attributes — at high volume your trace storage explodes and you may leak user PII into the observability backend. In the OTel conventions, content capture is opt-in; by default record token counts and metadata only, and sample raw content. (2) Wrapping only the outermost layer in one span — that degenerates into a single log line, i.e. nothing.

Going deeper · OpenTelemetry GenAI Observability, opentelemetry.io/blog · genai-observability · Langfuse Tracing data model, langfuse.com/docs · data-model

// 02

Cost & Token Attribution: Every Span Keeps a Ledger

Claim: "API cost was $4000 this month" is useless; "the refund agent's third-turn re-search ate 40% of cost" is actionable.

Background

LLM cost isn't uniform like CPU — it concentrates in a few expensive paths: a feature whose prompt missed cache, a query class that triggers an 8-turn agent loop, a long doc re-stuffed into context every time. Aggregate billing hides these; you need to attribute cost down the span tree onto dimensions: which feature, which user, which prompt version, which step. The span attributes from §1 feed this directly — each LLM span already records input/output tokens and model; add cache_read tokens plus business dimensions and you can slice cost any way you like.

Three derived metrics to watch most: cache hit rate (higher prefix-caching hits = more saved on repeated prefixes — see Day 15/16), turns per task (a spike usually means degraded tool descriptions or the model going in circles), and p95 of per-task cost (averages get dragged down by cheap requests; the long tail is what burns money).

Example

Record full cost fields on the span so the backend can aggregate by dimension:

# record enough to reconstruct cost (incl. cache hits)
u = r.usage
sp.set_attribute("gen_ai.usage.input_tokens",  u.input_tokens)
sp.set_attribute("gen_ai.usage.output_tokens", u.output_tokens)
sp.set_attribute("gen_ai.usage.cache_read_tokens",  getattr(u, "cache_read_input_tokens", 0))
sp.set_attribute("gen_ai.usage.cache_write_tokens", getattr(u, "cache_creation_input_tokens", 0))
# business dims: the group-by keys for attribution
sp.set_attribute("app.feature", "refund_agent")
sp.set_attribute("app.user_tier", user.tier)

# backend query (pseudo-SQL): find the money sinks
# SELECT app.feature, SUM(cost), AVG(turns), PERCENTILE(cost, 0.95)
# FROM spans WHERE day = today GROUP BY app.feature ORDER BY 2 DESC

With that table, optimization has targets: low-cache-hit features get a fixed prompt prefix; high-turn features get their tool descriptions checked; p95-tail features get context trimming or model routing. Without attribution, all of this is guesswork.

Failure mode: monitoring only total tokens / total spend, with no business dimension. When the bill doubles you know it's "expensive" but not where — and LLM cost anomalies are often a single feature's regression (someone removed a cache breakpoint, or turned a workflow into an agent), drowned out in the total.

Going deeper · Langfuse LLM Observability Overview (cost/usage tracking), langfuse.com/docs · observability · This series Day 16 Cost Engineering (token economics & caching)

// 03

Drift Detection: Did the Input Change, or the Model?

Claim: an LLM system can degrade without you changing a single line. Drift detection watches "something quietly changed" before users do.

Background

Offline eval is a one-time snapshot: test before release, ship if it passes. But production is alive and drifts in two directions:

Input / data drift: phrasing, topics, language distribution shift. Your prompt and few-shots were tuned for the old distribution; accuracy drops on the new one, but the inputs themselves throw no error. Detection borrows from classic ML: embed production inputs, compare to a reference period, compute PSI (Population Stability Index) or KS / Jensen-Shannon divergence. Evidently's rule of thumb: PSI > 0.25 means significant drift.
Output / behavior drift: input distribution is stable but output changed — the most insidious kind, usually from a silent provider model update, or you changing the prompt/temperature. Monitor structural features of outputs (length, format-compliance rate, refusal rate, tool-selection distribution, JSON parse-failure rate); any sudden shift is a signal.

Keep the two axes separate: input drift tells you "time to add few-shots / re-test"; output drift while input is stable almost always means the model or config changed — the most dangerous failure mode of a hosted API you can't pin.

output stable output drifted ┌──────────────┬──────────────────────┐ input │ ✓ healthy │ ⚠ model/config moved │ stable │ │ (silent provider │ │ │ update? prompt/temp │ │ │ changed?) │ ├──────────────┼──────────────────────┤ input │ ⚠ new topics/ │ ⚠⚠ double drift │ drifted │ usage flood, │ attribute input │ │ re-test+ │ first, then residual │ │ add examples │ output drift │ └──────────────┴──────────────────────┘

Example

import numpy as np
def psi(ref, cur, bins=10):
    # ref/cur: a scalar distribution for reference vs current period
    # (e.g. an embedding component, or answer length)
    q = np.quantile(ref, np.linspace(0, 1, bins+1))
    q[0], q[-1] = -np.inf, np.inf
    r = np.histogram(ref, q)[0] / len(ref) + 1e-6
    c = np.histogram(cur, q)[0] / len(cur) + 1e-6
    return float(np.sum((c - r) * np.log(c / r)))

# daily compare: embedding projection / answer length / refusal rate
score = psi(ref_lengths, today_lengths)
if score > 0.25:                       # Evidently rule of thumb
    alert(f"output length drift PSI={score:.2f}, check model version / prompt change")

The cheapest practical "canary": maintain a set of fixed probe prompts (20–50 with inputs that never change), run them on a daily schedule, log the outputs. Inputs are constant, so any output change points straight at the model or config — the cheapest sentinel for detecting silent provider updates.

Failure mode: (1) Computing drift only in embedding space without tying it to business metrics — PSI fires but quality didn't drop, just noise; drift signals are only meaningful when coupled with downstream quality (online eval scores, user feedback). (2) Wrong reference period: using an anomalous distribution (a promo day) as baseline, then false-alarming forever. Pick a "known healthy," representative window.

Going deeper · Evidently 5 methods to detect embedding drift, evidentlyai.com/blog · embedding-drift · Evidently open-source library, github.com/evidentlyai/evidently

// 04

Online Eval: Scoring Real Traffic Continuously

Claim: an offline golden set measures "representative past samples"; online eval measures "live traffic right now" — complementary, neither sufficient alone.

Background

Day 6 / Day 29 covered offline eval: golden set + LLM-as-judge, run before release. The problem: a golden set always lags the real distribution and can't cover production's long-tail cases. Online evaluation fills that gap: sample production traces, score them with evaluators in (near-)real-time, and write the scores back as a metric you can alert on, dashboard, and slice. Arize Phoenix calls this online evals — attach an evaluator to incoming traces so each live execution gets scored as it runs.

Three evaluator types, stacked by increasing cost and decreasing coverage:

Rules / heuristics (cheapest, full traffic): is the JSON parseable, are required fields present, did it hit a refusal template, is the length anomalous. Sub-millisecond, run on 100% of traffic.
LLM-as-judge (expensive, sampled): relevance, faithfulness (is the answer supported by the retrieved evidence), is it off-topic. Run only on a sampled X% of traffic to control cost.
Human labels / user feedback (most expensive, sparsest): 👍/👎, manual spot review. Serves as the calibration baseline for the LLM judge — the judge itself drifts and must be re-aligned with humans periodically.

The key engineering constraint: online eval must be async and off the hot path — never in the user's response critical path. Scoring runs in a background worker after the trace lands; it can be slow, but it must not slow the product.

Example

# off-path worker: consume from trace queue, tier-score, write metrics
def score_trace(trace):
    out = trace.final_output
    # 1) rule tier: full traffic, sub-ms
    emit_metric("eval.json_valid", is_valid_json(out), trace.id)
    emit_metric("eval.refused",   hit_refusal(out),    trace.id)
    # 2) LLM-judge tier: sample 5% to control cost
    if sample(0.05):
        s = judge(question=trace.input, answer=out,
                  context=trace.retrieved_docs,
                  rubric="Is the answer fully supported by context? 1-5")
        emit_metric("eval.faithfulness", s, trace.id)
        if s <= 2: alert(f"low-faithfulness answer trace={trace.id}")

Put the p50 of eval.faithfulness on a time-series dashboard; a drop is the earliest signal of a quality regression — days before users complain. Overlay the drift axes from §3: if faithfulness drops while the input distribution is stable, the finger points at a model or retrieval change.

Failure mode: (1) Treating the LLM judge as ground truth — judges are biased (prefer longer answers, their own model family, position bias, see Day 6), and must be calibrated against human labels periodically, or you're measuring drift with a ruler that drifts. (2) Running the LLM judge on full traffic — eval cost can exceed inference cost itself; sampling is mandatory.

Going deeper · Arize Phoenix Evaluation / Online Evals, arize.com/docs/phoenix · llm-evals · This series Day 6 Eval Engineering (LLM-as-judge debiasing) · Day 29 Eval Beyond Benchmark

// PUTTING IT TOGETHER · Bolt observability onto your agent (one weekend)

Wire the four points into a minimal but complete observability chain on your own agent:

Trace it (§1): wrap the agent loop with the OpenTelemetry SDK, one span per LLM call / tool call, attributes aligned to gen_ai.*. Self-host Langfuse or Phoenix as the backend (one Docker line to start).
Attribute cost (§2): record input/output/cache tokens + app.feature + app.user_tier on each LLM span. Build a "cost + avg turns + p95 by feature" dashboard.
Probe drift (§3): pick 30 fixed probe prompts, run them daily, log output length / format-compliance / refusal rate, compute PSI, alert at >0.25. This is your sentinel for silent provider updates.
Online eval (§4): an off-path worker runs the rule tier on full traffic and an LLM-judge faithfulness check on 5% sampled traffic; scores go back to the backend. Put faithfulness p50 on a time series.
One linked board: cost, turns, drift PSI, faithfulness on one chart. After a week you'll "see" your agent's real running state for the first time — before this it was a black box.

Doing this builds an instinct: LLM observability isn't adding monitoring to an LLM, it's turning the three previously-invisible quantities — what the model is thinking, how much it cost, and whether it got worse — into data. Without it, every judgment about a production agent is a guess.

// GLOSSARY

Span: A single operation unit in a trace (with start/end/attributes). One LLM call or tool call = one span.
Trace: A span tree linked by parent_id, representing one full request's execution path. The agent's debug unit.
OTel GenAI Semantic Conventions: OpenTelemetry's standard attribute set (gen_ai.*) for LLMs/agents, comparable across frameworks and vendors.
Cost Attribution: Attributing token cost down the span tree onto dimensions (feature / user / step) to locate money sinks.
Input / Data Drift: Production input distribution shifting from the reference period. Detected with PSI / KS / JS divergence.
Output / Behavior Drift: Input stable but output changed, often from a silent provider model update or prompt/config change.
PSI: Population Stability Index, measuring distribution shift between two periods. Rule of thumb: 0.25 = significant drift.
Probe Prompt: A fixed-input probe set run daily to isolate model/config-side change (a canary sentinel).
Online Eval: Sampling and scoring production traffic in real time, distinct from a pre-release offline golden set.
Faithfulness: Whether an answer is supported by the retrieved evidence — a core metric of RAG online eval.

// DEEP THINKING

Is the traditional APM (Datadog/Prometheus) metric-log-trace triad enough for LLM systems straight away? What's missing?

The skeleton is enough — the trace model already fits agent nesting, which is why OTel could extend into GenAI conventions. But three LLM-specific things are missing: (1) semantic quality — APM only tracks "success/failure/latency," LLMs need "is the answer correct," which requires online eval APM doesn't have; (2) non-determinism — same input, different output; a single trace can't characterize it, you must look at distributions; (3) content dimension — prompt/completion are core signals but also PII risk, needing sampling + redaction. So it's "APM as foundation + an LLM-specific layer on top."

Output drift's likeliest root cause is a silent provider update. But even if you pin the model version it can drift — why?

Because pinning a version number doesn't pin many things: (1) under the same version, the provider may tweak system-level safety/format post-processing; (2) your own RAG retrieval corpus keeps updating, so the context fed to the model changes; (3) temperature, max_tokens etc. get changed; (4) the dynamic content embedded in your prompt template (dates, user profile) shifts distribution. This is exactly why probe prompts must fix the parameters and retrieval too, to collapse the variable down to "pure model side." The value of observability is helping you do this attribution bisection.

Online eval scores with an LLM-as-judge, but the judge is itself a drifting LLM. Isn't this turtles-all-the-way-down unreliability?

It's a real problem but not an infinite loop. The fix is layered anchoring: the judge isn't ground truth, it's an amplifier — it scales sparse human labels onto large traffic. So (1) periodically calibrate the judge against a small batch of human labels (compute judge-vs-human agreement; if it drops, redo the judge prompt); (2) use a stronger or at least different-family model for the judge to reduce same-source bias; (3) critical decisions (e.g. retiring a prompt version) can't rest on the judge alone — go back to humans / offline golden set. The judge is good for "trend alerts," not "final verdicts."

The finer cost attribution gets (per-user / per-request), the higher the observability backend's own storage/compute cost. How to weigh this trade-off?

Use tiered sampling + tiered retention. Metrics (token counts, turns, cost) are low-cardinality aggregates — keep them all, cheap; full traces (prompt/completion) are high-cardinality and bulky — keep a sampled fraction (all error traces + 1% of normal), with a short retention. Tier the attribution dimensions too: low-cardinality dims like feature/tier go onto metrics in full; high-cardinality dims like user_id stay only on sampled traces. Core principle: aggregate metrics complete, raw content sampled — 95% of insight comes from aggregates, raw content only matters when debugging a single trace.

If an agent's observability data shows it "took 8 turns to do a 2-turn job" but the final result was correct — is that a bug? Should it alert?

It's an "efficiency bug," belongs on a dashboard but not necessarily a real-time alert. Correct result means it isn't broken, but 8 turns means 4× cost and latency, and it exposes fragility (a few more turns and it could hit max_iters and fail). Handling: make "turns per task" a distribution monitor, alert when p95 spikes (usually pointing at degraded tool descriptions, context pollution, or a model update). Don't alert per-instance, alert on the trend. This is exactly the extra dimension observability has over error logs — it catches "didn't error but got dumber," a degradation traditional monitoring is blind to.

// FURTHER READING

OpenTelemetry · GenAI Observability — standard span semantics for LLMs/agents
Langfuse · Tracing data model — open-source LLM observability built on OTel
Arize Phoenix · Evaluation — online eval and quality scoring on production traces
Evidently · Embedding Drift Detection — 5 methods for embedding drift, and PSI
evidentlyai/evidently — open-source ML / LLM monitoring framework, 100+ metrics