On your monitoring dashboard, CPU and latency always spike together — but the curves alone can't tell you whether CPU drives latency, latency drives CPU, or a hidden common upstream (a traffic surge) is pushing both. That hidden upstream is a confounder — like a shared dependency you never drew into the dependency graph. Reading causation off correlation is like concluding "service A calls service B" just because the two jitter at the same time.
Correlation answers an observational question: "given that I see X high, Y is probably high too" — formally the conditional probability P(Y | X). Causation answers an interventional one: "if I actively raise X, does Y change?" — Judea Pearl writes this with the do-operator as P(Y | do(X)). The whole gap between them lives in confounding: when Z drives both X and Y, X and Y will correlate strongly even with zero causal link.
Mechanically, a confounder opens a backdoor path: X ← Z → Y. Correlation blends "the true X→Y effect" with "the spurious association via Z." The core move of causal identification is to close the backdoor — stratify, regress, or match to "control for Z," leaving only the direct X→Y channel. This is also the root of Simpson's paradox: without controlling for Z, the aggregate trend can flatly contradict the trend inside every subgroup.
import pandas as pd, numpy as np import statsmodels.formula.api as smf np.random.seed(0) n = 2000 Z = np.random.normal(size=n) # confounder: e.g. "user's baseline activity" X = Z + np.random.normal(size=n) # Z raises X (active users use new feature more) Y = Z + np.random.normal(size=n) # Z raises Y too; true X effect on Y = 0 df = pd.DataFrame({"X": X, "Y": Y, "Z": Z}) # Without controlling Z: X looks strongly to drive Y (spurious) print(smf.ols("Y ~ X", df).fit().params["X"]) # ≈ 0.5 # Controlling Z (close backdoor): X coefficient collapses to truth print(smf.ols("Y ~ X + Z", df).fit().params["X"]) # ≈ 0
A counterfactual is "what would have happened if I hadn't shipped this deploy?" The catch: you cannot both deploy and not-deploy the same server at the same instant. This is exactly the A/B test mindset, but with an unavoidable flaw — for each unit you can only ever observe one parallel universe; the other is forever missing data.
Rubin's potential-outcomes framework assigns each unit i two values: Y_i(1) (outcome if treated) and Y_i(0) (if not). The individual causal effect is Y_i(1) − Y_i(0). The trouble: i is either treated or not, so you only ever observe one; the other is the counterfactual. This is the Fundamental Problem of Causal Inference — essentially a missing-data problem.
Since individual effects are unobservable, we settle for an average: ATE = E[Y(1) − Y(0)]. But naively taking "treated-group mean − untreated-group mean" carries selection bias: people who choose treatment differ from those who don't. Randomization is the key — random assignment makes "treated or not" independent of potential outcomes, so the two groups are statistically identical pre-treatment, and the between-group difference is an unbiased ATE. That's why the RCT (randomized controlled trial) is the gold standard.
| Unit | Y(0) untreated | Y(1) treated | Indiv. effect |
|---|---|---|---|
| A (treated) | ? counterfactual | 8 | unknowable |
| B (untreated) | 5 | ? counterfactual | unknowable |
| … | … | … | … |
import numpy as np np.random.seed(1) n = 5000 T = np.random.binomial(1, 0.5, n) # random assignment: key! independent of the unit Y0 = np.random.normal(50, 10, n) # potential outcome if untreated tau = 4.0 # true causal effect +4 Y1 = Y0 + tau Y = np.where(T == 1, Y1, Y0) # observe only the assigned arm # Thanks to randomization, group diff is an unbiased ATE estimate ate = Y[T == 1].mean() - Y[T == 0].mean() print(round(ate, 2)) # ≈ 4.0, close to true tau
When you can't randomize and have a confounder you can't shake off, hunt for a natural randomizer — an external nudge that only changes "treated or not" and is itself unrelated to the outcome or the confounder. Analogy: your canary system assigns a feature flag via a random seed; who gets the new feature is random and unrelated to user profile. So you treat "randomly assigned the flag" as a lever to pry out the feature's causal effect — even though whether users actually adopt the feature is self-selected.
With unobserved confounding, "control for Z" fails — you can't even measure Z. An instrument Z sidesteps it via three conditions: (1) relevance — Z genuinely affects treatment X (Z→X is strong enough); (2) exclusion — Z affects Y only through X, no other path; (3) independence — Z is unrelated to the unobserved confounder, i.e. "as good as randomly assigned."
The mechanism is two-stage least squares (2SLS): stage one uses Z to predict X, isolating the part of X's variation driven only by Z (this part is "clean," confounder-free); stage two uses that clean prediction to explain Y. Intuitively, you use only the small exogenous wiggle the instrument provides to estimate the effect, discarding all the confounder-polluted variation. The cost: larger variance, and conditions (2)(3) can't be tested from data — only argued from domain knowledge. That's IV's most fragile point.
import numpy as np, pandas as pd from linearmodels.iv import IV2SLS # pip install linearmodels np.random.seed(2) n = 4000 Z = np.random.normal(size=n) # instrument: exogenous random nudge U = np.random.normal(size=n) # unobserved confounder, unmeasurable X = 0.8*Z + U + np.random.normal(size=n) # X driven by both Z and confounder U Y = 2.0*X + 3*U + np.random.normal(size=n)# true effect=2, but U pollutes OLS df = pd.DataFrame({"Y": Y, "X": X, "Z": Z}) # OLS is biased by the confounder (far from 2); IV pries out clean effect iv = IV2SLS.from_formula("Y ~ 1 + [X ~ Z]", df).fit() print(round(iv.params["X"], 2)) # ≈ 2.0, close to true causal effect
You change the config on one shard (treatment) and leave another untouched (control). Just looking at the treated shard before/after won't do — the whole cluster may be trending up over that period due to traffic seasonality. DiD's move: (treatment before-after diff) minus (control before-after diff), which subtracts out the shared time trend both groups lived through, leaving only the net effect of the config change.
When treatment isn't randomly assigned, but you have two time points (before/after) and an untreated control group, DiD can identify the effect. It cancels two biases at once: the first difference (before vs after) removes each group's time-invariant fixed differences (e.g. the treated group's baseline was just higher); the second difference (treated vs control) removes the time trend both groups share (the market's seasonal swing). After both subtractions, what's left is the causal effect.
The core assumption is parallel trends: absent treatment, the treated and control groups would have moved along parallel paths. This is the entire source of DiD's credibility — and it can't be proven directly; you can only support it by checking that the two lines ran parallel across multiple pre-treatment periods. The classic case is Card & Krueger (1994) on minimum wage: New Jersey raised its minimum wage while neighboring Pennsylvania didn't; a difference-in-differences on fast-food employment found employment did not fall as standard theory predicted.
import pandas as pd import statsmodels.formula.api as smf # treat: in treatment group? post: in post-treatment period? df = pd.DataFrame({ "y": [20, 22, 18, 25], # control before/after + treated before/after "treat": [0, 0, 1, 1], "post": [0, 1, 0, 1], }) # coefficient on interaction treat:post = the DiD causal effect m = smf.ols("y ~ treat + post + treat:post", df).fit() print(m.params["treat:post"]) # = (25-18) - (22-20) = 7 - 2 = 5 ← subtracted the +2 market trend