Day 21 Medium Observability Metrics/Logs/Traces SLO OpenTelemetry

Observability — When the System Breaks, Can You Ask "Why"?Metrics, Logs, Traces, OpenTelemetry & SLO-driven On-call

Problem Scenario + Requirements

Design an observability platform serving 1000+ microservices and a million spans per second. One morning p99 latency jumps from 80ms to 2s and users start complaining — you must pinpoint which service, which dependency, which class of request broke, within 5 minutes. This isn't "add a few dashboards"; it's a system whose data volume rivals the business itself.

Monitoring answers "is the system healthy?" (known problems); observability answers "why isn't it?" (unknown problems). The former is predefined dashboards and alerts; the latter is the ability to slice and drill into any dimension during an incident.

Data volume: 1M spans/s, tens to hundreds of spans per trace; tens of millions of active metric time series; TBs of logs/day.
Latency SLO: alert end-to-end < 1 min; query p95 < 2s (engineers can't wait during an incident).
Cost: observability bills often hit 10-30% of infra spend, so storing every trace is infeasible → sampling is the core design lever.
Retention: high-res metrics for days, downsampled for a year; traces for days; logs per compliance.
Cardinality: the cartesian product of metric labels drives storage cost to the moon — the dimension that blows up most easily.

High-Level Architecture

graph LR
    APP["Service + OTel SDK
instrument / propagate ctx"]
    COL["OTel Collector
batch · sample · route"]
    M["Metrics TSDB
Prometheus / Mimir"]
    T["Trace store
Tempo / Jaeger"]
    L["Log store
Loki / ES"]
    Q["Query + correlate
Grafana"]
    A["Alert engine
Alertmanager"]
    OC["On-call
PagerDuty"]

    APP -->|OTLP| COL
    COL --> M
    COL --> T
    COL --> L
    M --> Q
    T --> Q
    L --> Q
    M --> A -->|burn rate exceeded| OC

    classDef svc fill:#1a2530,stroke:#64c8ff,color:#e8eef5
    classDef pipe fill:#0e2030,stroke:#5eead4,color:#e8eef5
    classDef store fill:#1a1a30,stroke:#ffb450,color:#e8eef5
    classDef sink fill:#2a1530,stroke:#ff7ab6,color:#e8eef5
    class APP svc
    class COL,Q pipe
    class M,T,L store
    class A,OC sink

SDK instruments uniformly → Collector decouples app from backend via sampling/routing → three signals cross-link via trace_id / exemplar

Key design: apps depend only on the OTel SDK + Collector, never on a specific backend vendor (avoiding lock-in). The Collector is the control point — sampling, rate limiting, PII scrubbing, downsampling all happen here, so changing the sampling policy doesn't require redeploying hundreds of services.

Key Technical Points

1. The Three "Pillars" Metrics / Logs / Traces — not three tools, but three slices of one dataset

Principle: Metrics are pre-aggregated numeric time series (counter/gauge/histogram) — cheap, ideal for dashboards and alerts, but only the dimensions you predefined. Logs are discrete events — richest information but expensive to search and hard to correlate without structure. Traces stitch one request's cross-service causal chain into a span tree, answering "which hop is slow." Their fundamental difference is cardinality tolerance: adding a high-cardinality label (like user_id) to a metric explodes time series into the millions; traces/structured-logs are built for high cardinality.

Trade-off:

Metrics: ✅ storage is O(series count), independent of request volume; queries in ms; ❌ fixed dimensions, can't drill to a single request, high cardinality = bankruptcy.
Logs: ✅ richest, flexible; ❌ cost grows linearly with traffic, hard to correlate across services (unless they carry trace_id).
Traces: ✅ causal chain + cross-service latency attribution; ❌ must be sampled (full is too costly), instrumentation is intrusive, sampling hides rare requests.

The right move: use a metric to spot the anomaly → use an exemplar to jump to a representative trace → use trace_id to pull all logs for that request. The three pillars are stitched by trace_id, not three isolated silos.

Real-world:

Honeycomb / Charity Majors: publicly critique "three pillars" — storing data three times and manually correlating across tools is an anti-pattern; they advocate arbitrarily-wide structured events as a single source of truth, with high cardinality as the soul of observability (Observability 2.0).
Prometheus + Grafana: metrics carry exemplars (trace_id), dashboards jump to Tempo traces in one click — the de-facto open-source stitching solution.
Google: internally Monarch (metrics) + Dapper (tracing) divide the work; Borgmon is the spiritual ancestor of Prometheus.

2. Distributed Tracing & OpenTelemetry — context propagation + the sampling dilemma

Principle: for a request through A→B→C, attributing all spans to one tree relies on context propagation: the entry point generates a trace_id, each hop creates a span_id and records a parent_id, passed downstream via HTTP headers (W3C traceparent). OpenTelemetry (OTel) unifies the API/SDK/protocol (OTLP), decoupling instrumentation from backend. The core hard problem is sampling: storing every span at 1M/s is bankruptcy, but sampling may drop the very failing trace.

Head vs Tail sampling:

Head sampling: decide at the entry point whether to keep a trace (e.g. 1%). ✅ simple, stateless, low overhead (Dapper sampled high-traffic services down to 0.01%); ❌ at decision time you don't yet know if it'll error/be slow, so failing requests are likely dropped.
Tail sampling: buffer all spans of a trace, then after it finishes decide based on "any error / > 1s." ✅ keeps 100% of slow/error traces; ❌ the Collector must gather all spans by trace_id (stateful, needs consistent-hash routing), high memory pressure, higher latency.
Compromise: a low fixed head sample as a floor + tail rules "keep 100% of errors and slow requests."

# OTel Collector tail_sampling policy (pseudo-config)
processors:
  tail_sampling:
    decision_wait: 10s          # wait for a trace's spans to gather
    policies:
      - type: status_code       # keep all errors
        status_code: {status_codes: [ERROR]}
      - type: latency           # keep all slow
        latency: {threshold_ms: 1000}
      - type: probabilistic     # sample the rest at 1%
        probabilistic: {sampling_percentage: 1}
# ⚠️ With multiple Collectors, spans of the same trace_id must route
#    to the same instance (load-balancing exporter hashes by trace_id),
#    otherwise they can't be gathered.

Real-world:

Google Dapper (2010 paper, Sigelman et al.): the seminal distributed tracing work; high-traffic services sampled as low as 0.01% without hurting analysis; later spawned OpenTracing/OpenTelemetry. Paper
OpenTelemetry: a CNCF project merging OpenTracing + OpenCensus, now the de-facto cross-language instrumentation standard, fully supported by AWS / Azure / Datadog.
Uber Jaeger: large-scale production tracing, open-sourced into CNCF; adaptive sampling dynamically adjusts the per-second sampling quota per service by traffic.

3. SLI / SLO / SLA & Error Budget — turning "reliability" into a budget you can decide on

Principle: SLI (indicator) = good requests / total, e.g. "fraction of requests under 200ms." SLO (objective) = an internal threshold on the SLI, e.g. "99.9% monthly." SLA (agreement) = external promise + penalty, usually looser than the SLO (a buffer). The key invention is error budget = 1 − SLO: 99.9% means you're allowed to be "broken" for 43 minutes a month. This turns reliability from a "higher is always better" shouting match into a budget you can spend — under budget, ship boldly (innovation); over budget, freeze releases and fix stability. It aligns the incentives of product teams and SRE.

Alerting with error budget: static thresholds vs burn rate

Static threshold alerts (e.g. error rate > 1%): either too sensitive (paging all night, alert fatigue) or too dull.
Burn-rate alerts (Google SRE recommended): monitor "how many times faster than allowed you're consuming the error budget." Multi-window multi-burn-rate: fast burn (2% of monthly budget in 1h, 14.4x) = page immediately; slow burn (a 6-hour steady leak) = open a ticket. Catches disasters yet ignores blips.

# Multi-window burn-rate alert (PromQL idea)
# Page only when both 1h and 5m windows exceed 14.4x (debounce)
(
  error_budget_burn_rate{window="1h"} > 14.4
  and
  error_budget_burn_rate{window="5m"} > 14.4
)
# burn_rate = (1 - SLI) / (1 - SLO)
#   = actual error rate / budgeted error rate;  >1 means overspending

Real-world:

Google SRE: error budget is the core idea of the SRE Book — "100% is the wrong reliability target"; the SRE Workbook's Alerting on SLOs details the multi-window multi-burn-rate formula.
Public status pages: external SLAs (e.g. "99.95% or refund") are universally a notch looser than internal SLOs, leaving operational buffer.

4. Metric Aggregation Pitfalls — why you can't average p99

Principle: percentiles are not re-aggregatable. Averaging the p99 of ten machines does NOT yield the global p99 — it's mathematically wrong. The correct way: clients emit a histogram (cumulative bucket counts), and you sum bucket counts across machines, then apply histogram_quantile() at query time. The summary type computes quantiles on the client, so it cannot be merged across instances. This is the core reason Prometheus recommends histogram > summary.

Histogram vs Summary:

Histogram: ✅ buckets are summable across instances, any quantile over any window; ❌ accuracy depends on bucket boundaries, many buckets = more series.
Summary: ✅ exact client-side quantiles, cheap queries; ❌ not aggregatable, quantiles and windows are hard-coded at instrumentation.
Key metric design: watch RED (Rate / Errors / Duration, request-facing) or USE (Utilization / Saturation / Errors, resource-facing). Always watch percentiles, never the mean — the mean hides outliers; p50 fine while p99 explodes is the norm.

Real-world:

Prometheus official docs: explicitly warn that "aggregating percentiles is meaningless"; only histogram + histogram_quantile() gives correct cross-instance quantiles (Histograms and summaries).
Prometheus Native Histograms: a next-gen sparse histogram with exponentially auto-spaced buckets, resolving the accuracy/cost dilemma of fixed buckets.

Scaling & Optimization

Cardinality governance: review metric labels before launch; use the Collector to drop/relabel high-cardinality labels before ingest, pushing user_id-class data into traces/logs instead of metrics.
Downsampling: keep raw 15s resolution for days, 5m/1h aggregates for a year (Thanos / Mimir / VictoriaMetrics). Tier cold data to object storage to save cost.
Metric → trace jump: exemplars embed a representative trace_id into histogram buckets, so a p99 spike on a dashboard drills straight into that slow trace.
Multi-region / multi-tenant: per-region collection with a global aggregated view; tenants need query isolation and quotas so one team's bad query can't crush the shared TSDB.
eBPF auto-instrumentation: non-intrusive capture of syscalls/network latency (Pixie / Cilium), filling the blind spots of code-level instrumentation.

Common Pitfalls + Interview Follow-ups

1. Watching the mean, not percentiles: an 80ms average looks lovely, but p99 is 2s — 1% of users suffer, and they're often the high-value heavy users. Always monitor percentiles.

2. High-cardinality labels blow up the TSDB: adding user_id / request_id / full URL to a metric explodes series from thousands to millions, OOMing Prometheus. High-cardinality dimensions belong in traces/logs.

3. Head sampling drops failing requests: at 1% head sampling, an occasional 0.1% error rate is almost never captured, leaving no trace to inspect during an incident. Errors/slow requests need a tail-sampling safety net.

4. Alert fatigue: hundreds of static-threshold alerts, 80% noise → on-call goes numb → real incidents get drowned. Switch to SLO burn-rate alerts and strictly separate pages (wake someone now) from tickets.

5. The observability system itself goes down: when monitoring shares infra with the monitored system, an outage blinds the monitoring too. The observability stack needs an independent deployment and failure domain, plus external black-box probes as a backstop.

Frequent interview questions: ① What does each of metrics/logs/traces solve, and why can't you use just one? ② Head vs tail sampling trade-offs; how does tail guarantee a trace's spans are gathered? ③ How do you set SLO/error budget, and why is multi-window burn-rate alerting better than static thresholds? ④ Why can't you average p99 across machines? Histogram vs summary? ⑤ How do you control observability cost (the trio: sampling + cardinality + downsampling)?

Deep-Dive Resources

Site Reliability Engineering & The SRE Workbook (Google, free at sre.google): the authoritative source on SLI/SLO/error budget and burn-rate alerting.
Dapper, a Large-Scale Distributed Systems Tracing Infrastructure (Google, 2010): the foundational paper on distributed tracing and sampling.
Prometheus docs — Histograms and summaries: the percentile-aggregation pitfall and histogram selection.
Honeycomb blog / Charity Majors (charity.wtf): observability 2.0, high cardinality, the critique of the three pillars.
Designing Data-Intensive Applications (Kleppmann), Ch. 1: discussion of monitoring/observability within reliability and maintainability.

Food for Thought (click to expand)

1. Adding a user_id label to a metric looks convenient — why is it a disaster? Do the math: 10M users × 10 other label dimensions?

TSDB cost is proportional to active time series count, not request volume. Each unique label combination = one independent series.

Without user_id: say 5 endpoints × 4 statuses × 3 regions = 60 series, trivial.
With user_id: 60 × 10M = 600M series. Each costs several KB (labels + recent samples + index), so several TB of RAM → Prometheus OOMs.

Essence: metrics suit "bounded, low-cardinality" dimensions (enums). user_id, request_id, full URL, error stack are "unbounded, high-cardinality" and belong in traces and structured logs — their cost grows with events, not the cartesian product of dimensions. This is exactly why Honeycomb advocates "wide events": high-cardinality queries are the heart of debugging, but you must store them in the right place. Governance: pre-launch label review, Collector-side relabel drops, and cardinality monitoring/alerts on suspected high-cardinality dimensions.

2. Tail sampling waits for all spans of a trace before deciding. In a cluster of many Collectors, what thorny distributed problem does this introduce?

Core tension: spans of one trace come from different services, arrive at different times, and may land on different Collector instances. To make a tail decision they must be gathered in one place.

Routing consistency: you need a load-balancing exporter layer that hashes by trace_id so the same trace always reaches the same backend Collector. During scale up/down, the hash ring changes and spans get misrouted during migration (like the resharding problem from Day 4).
State and memory: the Collector must buffer undecided spans (the decision_wait window, e.g. 10s). At high traffic this is huge memory pressure, and late/lost spans leave traces incomplete.
Decision latency: you must wait out the window before knowing whether to keep, so trace persistence has inherent latency, worse than head.
Spans outside the window: tail spans of very long traces (> decision_wait) become orphans — either wrongly dropped or handled separately.

This is why many teams use a hybrid: "low head sampling as a floor + tail only as a backstop for error/slow," limiting tail's state pressure to the few traces worth keeping.

3. SLO at 99.9% or 99.99%? What's the real cost of one more nine? Why is "higher is always better" wrong?

Each extra nine compresses allowed monthly downtime from 43 minutes (99.9%) to 4.3 minutes (99.99%) to 26 seconds (99.999%). The cost isn't linear but exponential:

Architecture cost: four nines needs multi-region redundancy and automated failover; five nines needs no single points and cross-region strong consistency — cost can multiply several times.
Velocity cost: a smaller error budget means less room to ship/experiment, locking down innovation. 99.999% means you can almost never deploy with downtime.
Diminishing returns: users and downstream dependencies often can't perceive reliability beyond "their own reliability + network jitter." If the user's own network is only 99.9%, your 99.999% is wasted money.

So an SLO should be derived backward from business/user experience — "how reliable is reliable enough" — not engineers chasing perfection. 100% is the wrong target: it eliminates the error budget, effectively banning all change, and ironically becomes more fragile because you can't iterate. The right SLO is the line "just good enough that users don't complain," spending the rest of the budget on release velocity.

4. Linking to Day 6 (Consistency): metrics themselves are "eventually consistent + lossy." What risk does this pose for alerting?

Monitoring data is itself neither strongly consistent nor exact — often overlooked, yet deadly:

Collection delay: Prometheus scrapes on an interval (e.g. 15s), and alert evaluation has its own cycle, so end-to-end lag can be 30s-1min. When you say "alert within 1 min," count this in.
Sampling loss: after trace sampling, rare-event statistics are inherently distorted; estimating error rate from sampled data must multiply back the sampling rate, and small samples have high variance.
Aggregation loss: metrics are pre-aggregated; a p99 histogram's accuracy is bounded by bucket edges, and wrong edges systematically over/underestimate.
Scrape gaps: a briefly unreachable target = a data hole, so alert rules must distinguish "metric is 0" from "metric is missing" (use absent()), or when a service is fully down you fail to alert precisely because there's no data.

Design implication: alerts must tolerate the data's eventual consistency — use multi-window (short for acute, long to debounce), set a for duration to avoid single blips, and alert separately on "monitoring data itself missing." Never assume the number on the dashboard is the exact present truth.

5. In an incident, what's the ideal drill-down path from "p99 alert" to "root-cause service"? What role does each pillar play on that chain?

This is the ultimate test of observability's value — whether you can push MTTR down to minutes. The ideal chain:

Metric detects (What): a burn-rate alert fires, the RED dashboard shows an endpoint's p99 jump 80ms→2s and error rate rising. Confirms "there's a problem, how big, since when."
Metric narrows (Where): slice the metric by region/version/dependency, find only v2.3 + us-east requests are affected → suspect a release or a regional dependency.
Exemplar to Trace (Why-1): from the exemplar of that p99 bucket, open a representative slow trace in one click; the span tree immediately shows "1.8s spent on the hop where service-C calls the DB" — root-cause service located.
Trace to Log (Why-2): filter logs by that trace's trace_id, see service-C's DB connection-pool-exhausted error stack, confirming the mechanism.

Roles: metric = radar (cheap, full coverage, spots anomalies), trace = GPS (causal pinpoint to the hop), log = microscope (single-point detail). They're stitched by trace_id/exemplar; break any link and the drill-down snaps, leaving you to guess by hand — exactly the difference between "three disjoint tools" and true observability.

← Back to index