Day 21 Medium Observability Metrics/Logs/Traces SLO OpenTelemetry

Observability — When the System Breaks, Can You Ask "Why"?Metrics, Logs, Traces, OpenTelemetry & SLO-driven On-call

Problem Scenario + Requirements

Design an observability platform serving 1000+ microservices and a million spans per second. One morning p99 latency jumps from 80ms to 2s and users start complaining — you must pinpoint which service, which dependency, which class of request broke, within 5 minutes. This isn't "add a few dashboards"; it's a system whose data volume rivals the business itself.

Monitoring answers "is the system healthy?" (known problems); observability answers "why isn't it?" (unknown problems). The former is predefined dashboards and alerts; the latter is the ability to slice and drill into any dimension during an incident.

High-Level Architecture

graph LR
    APP["Service + OTel SDK
instrument / propagate ctx"] COL["OTel Collector
batch · sample · route"] M["Metrics TSDB
Prometheus / Mimir"] T["Trace store
Tempo / Jaeger"] L["Log store
Loki / ES"] Q["Query + correlate
Grafana"] A["Alert engine
Alertmanager"] OC["On-call
PagerDuty"] APP -->|OTLP| COL COL --> M COL --> T COL --> L M --> Q T --> Q L --> Q M --> A -->|burn rate exceeded| OC classDef svc fill:#1a2530,stroke:#64c8ff,color:#e8eef5 classDef pipe fill:#0e2030,stroke:#5eead4,color:#e8eef5 classDef store fill:#1a1a30,stroke:#ffb450,color:#e8eef5 classDef sink fill:#2a1530,stroke:#ff7ab6,color:#e8eef5 class APP svc class COL,Q pipe class M,T,L store class A,OC sink

SDK instruments uniformly → Collector decouples app from backend via sampling/routing → three signals cross-link via trace_id / exemplar

Key design: apps depend only on the OTel SDK + Collector, never on a specific backend vendor (avoiding lock-in). The Collector is the control point — sampling, rate limiting, PII scrubbing, downsampling all happen here, so changing the sampling policy doesn't require redeploying hundreds of services.

Key Technical Points

1. The Three "Pillars" Metrics / Logs / Traces — not three tools, but three slices of one dataset

Principle: Metrics are pre-aggregated numeric time series (counter/gauge/histogram) — cheap, ideal for dashboards and alerts, but only the dimensions you predefined. Logs are discrete events — richest information but expensive to search and hard to correlate without structure. Traces stitch one request's cross-service causal chain into a span tree, answering "which hop is slow." Their fundamental difference is cardinality tolerance: adding a high-cardinality label (like user_id) to a metric explodes time series into the millions; traces/structured-logs are built for high cardinality.

Trade-off: The right move: use a metric to spot the anomaly → use an exemplar to jump to a representative trace → use trace_id to pull all logs for that request. The three pillars are stitched by trace_id, not three isolated silos.
Real-world:

2. Distributed Tracing & OpenTelemetry — context propagation + the sampling dilemma

Principle: for a request through A→B→C, attributing all spans to one tree relies on context propagation: the entry point generates a trace_id, each hop creates a span_id and records a parent_id, passed downstream via HTTP headers (W3C traceparent). OpenTelemetry (OTel) unifies the API/SDK/protocol (OTLP), decoupling instrumentation from backend. The core hard problem is sampling: storing every span at 1M/s is bankruptcy, but sampling may drop the very failing trace.

Head vs Tail sampling:
# OTel Collector tail_sampling policy (pseudo-config)
processors:
  tail_sampling:
    decision_wait: 10s          # wait for a trace's spans to gather
    policies:
      - type: status_code       # keep all errors
        status_code: {status_codes: [ERROR]}
      - type: latency           # keep all slow
        latency: {threshold_ms: 1000}
      - type: probabilistic     # sample the rest at 1%
        probabilistic: {sampling_percentage: 1}
# ⚠️ With multiple Collectors, spans of the same trace_id must route
#    to the same instance (load-balancing exporter hashes by trace_id),
#    otherwise they can't be gathered.
Real-world:

3. SLI / SLO / SLA & Error Budget — turning "reliability" into a budget you can decide on

Principle: SLI (indicator) = good requests / total, e.g. "fraction of requests under 200ms." SLO (objective) = an internal threshold on the SLI, e.g. "99.9% monthly." SLA (agreement) = external promise + penalty, usually looser than the SLO (a buffer). The key invention is error budget = 1 − SLO: 99.9% means you're allowed to be "broken" for 43 minutes a month. This turns reliability from a "higher is always better" shouting match into a budget you can spend — under budget, ship boldly (innovation); over budget, freeze releases and fix stability. It aligns the incentives of product teams and SRE.

Alerting with error budget: static thresholds vs burn rate
# Multi-window burn-rate alert (PromQL idea)
# Page only when both 1h and 5m windows exceed 14.4x (debounce)
(
  error_budget_burn_rate{window="1h"} > 14.4
  and
  error_budget_burn_rate{window="5m"} > 14.4
)
# burn_rate = (1 - SLI) / (1 - SLO)
#   = actual error rate / budgeted error rate;  >1 means overspending
Real-world:

4. Metric Aggregation Pitfalls — why you can't average p99

Principle: percentiles are not re-aggregatable. Averaging the p99 of ten machines does NOT yield the global p99 — it's mathematically wrong. The correct way: clients emit a histogram (cumulative bucket counts), and you sum bucket counts across machines, then apply histogram_quantile() at query time. The summary type computes quantiles on the client, so it cannot be merged across instances. This is the core reason Prometheus recommends histogram > summary.

Histogram vs Summary:
Real-world:

Scaling & Optimization

Common Pitfalls + Interview Follow-ups

1. Watching the mean, not percentiles: an 80ms average looks lovely, but p99 is 2s — 1% of users suffer, and they're often the high-value heavy users. Always monitor percentiles.
2. High-cardinality labels blow up the TSDB: adding user_id / request_id / full URL to a metric explodes series from thousands to millions, OOMing Prometheus. High-cardinality dimensions belong in traces/logs.
3. Head sampling drops failing requests: at 1% head sampling, an occasional 0.1% error rate is almost never captured, leaving no trace to inspect during an incident. Errors/slow requests need a tail-sampling safety net.
4. Alert fatigue: hundreds of static-threshold alerts, 80% noise → on-call goes numb → real incidents get drowned. Switch to SLO burn-rate alerts and strictly separate pages (wake someone now) from tickets.
5. The observability system itself goes down: when monitoring shares infra with the monitored system, an outage blinds the monitoring too. The observability stack needs an independent deployment and failure domain, plus external black-box probes as a backstop.

Frequent interview questions: ① What does each of metrics/logs/traces solve, and why can't you use just one? ② Head vs tail sampling trade-offs; how does tail guarantee a trace's spans are gathered? ③ How do you set SLO/error budget, and why is multi-window burn-rate alerting better than static thresholds? ④ Why can't you average p99 across machines? Histogram vs summary? ⑤ How do you control observability cost (the trio: sampling + cardinality + downsampling)?

Deep-Dive Resources

Food for Thought (click to expand)

1. Adding a user_id label to a metric looks convenient — why is it a disaster? Do the math: 10M users × 10 other label dimensions?

TSDB cost is proportional to active time series count, not request volume. Each unique label combination = one independent series.

  • Without user_id: say 5 endpoints × 4 statuses × 3 regions = 60 series, trivial.
  • With user_id: 60 × 10M = 600M series. Each costs several KB (labels + recent samples + index), so several TB of RAM → Prometheus OOMs.

Essence: metrics suit "bounded, low-cardinality" dimensions (enums). user_id, request_id, full URL, error stack are "unbounded, high-cardinality" and belong in traces and structured logs — their cost grows with events, not the cartesian product of dimensions. This is exactly why Honeycomb advocates "wide events": high-cardinality queries are the heart of debugging, but you must store them in the right place. Governance: pre-launch label review, Collector-side relabel drops, and cardinality monitoring/alerts on suspected high-cardinality dimensions.

2. Tail sampling waits for all spans of a trace before deciding. In a cluster of many Collectors, what thorny distributed problem does this introduce?

Core tension: spans of one trace come from different services, arrive at different times, and may land on different Collector instances. To make a tail decision they must be gathered in one place.

  • Routing consistency: you need a load-balancing exporter layer that hashes by trace_id so the same trace always reaches the same backend Collector. During scale up/down, the hash ring changes and spans get misrouted during migration (like the resharding problem from Day 4).
  • State and memory: the Collector must buffer undecided spans (the decision_wait window, e.g. 10s). At high traffic this is huge memory pressure, and late/lost spans leave traces incomplete.
  • Decision latency: you must wait out the window before knowing whether to keep, so trace persistence has inherent latency, worse than head.
  • Spans outside the window: tail spans of very long traces (> decision_wait) become orphans — either wrongly dropped or handled separately.

This is why many teams use a hybrid: "low head sampling as a floor + tail only as a backstop for error/slow," limiting tail's state pressure to the few traces worth keeping.

3. SLO at 99.9% or 99.99%? What's the real cost of one more nine? Why is "higher is always better" wrong?

Each extra nine compresses allowed monthly downtime from 43 minutes (99.9%) to 4.3 minutes (99.99%) to 26 seconds (99.999%). The cost isn't linear but exponential:

  • Architecture cost: four nines needs multi-region redundancy and automated failover; five nines needs no single points and cross-region strong consistency — cost can multiply several times.
  • Velocity cost: a smaller error budget means less room to ship/experiment, locking down innovation. 99.999% means you can almost never deploy with downtime.
  • Diminishing returns: users and downstream dependencies often can't perceive reliability beyond "their own reliability + network jitter." If the user's own network is only 99.9%, your 99.999% is wasted money.

So an SLO should be derived backward from business/user experience — "how reliable is reliable enough" — not engineers chasing perfection. 100% is the wrong target: it eliminates the error budget, effectively banning all change, and ironically becomes more fragile because you can't iterate. The right SLO is the line "just good enough that users don't complain," spending the rest of the budget on release velocity.

4. Linking to Day 6 (Consistency): metrics themselves are "eventually consistent + lossy." What risk does this pose for alerting?

Monitoring data is itself neither strongly consistent nor exact — often overlooked, yet deadly:

  • Collection delay: Prometheus scrapes on an interval (e.g. 15s), and alert evaluation has its own cycle, so end-to-end lag can be 30s-1min. When you say "alert within 1 min," count this in.
  • Sampling loss: after trace sampling, rare-event statistics are inherently distorted; estimating error rate from sampled data must multiply back the sampling rate, and small samples have high variance.
  • Aggregation loss: metrics are pre-aggregated; a p99 histogram's accuracy is bounded by bucket edges, and wrong edges systematically over/underestimate.
  • Scrape gaps: a briefly unreachable target = a data hole, so alert rules must distinguish "metric is 0" from "metric is missing" (use absent()), or when a service is fully down you fail to alert precisely because there's no data.

Design implication: alerts must tolerate the data's eventual consistency — use multi-window (short for acute, long to debounce), set a for duration to avoid single blips, and alert separately on "monitoring data itself missing." Never assume the number on the dashboard is the exact present truth.

5. In an incident, what's the ideal drill-down path from "p99 alert" to "root-cause service"? What role does each pillar play on that chain?

This is the ultimate test of observability's value — whether you can push MTTR down to minutes. The ideal chain:

  1. Metric detects (What): a burn-rate alert fires, the RED dashboard shows an endpoint's p99 jump 80ms→2s and error rate rising. Confirms "there's a problem, how big, since when."
  2. Metric narrows (Where): slice the metric by region/version/dependency, find only v2.3 + us-east requests are affected → suspect a release or a regional dependency.
  3. Exemplar to Trace (Why-1): from the exemplar of that p99 bucket, open a representative slow trace in one click; the span tree immediately shows "1.8s spent on the hop where service-C calls the DB" — root-cause service located.
  4. Trace to Log (Why-2): filter logs by that trace's trace_id, see service-C's DB connection-pool-exhausted error stack, confirming the mechanism.

Roles: metric = radar (cheap, full coverage, spots anomalies), trace = GPS (causal pinpoint to the hop), log = microscope (single-point detail). They're stitched by trace_id/exemplar; break any link and the drill-down snaps, leaving you to guess by hand — exactly the difference between "three disjoint tools" and true observability.