AI/ML Explained: Causal Inference

Day 36 · 2026-06-22
For: engineers with coding experience, outside the AI field

Causation vs CorrelationCausation vs Correlation

Core DistinctionConfounding
One-line analogy

On your monitoring dashboard, CPU and latency always spike together — but the curves alone can't tell you whether CPU drives latency, latency drives CPU, or a hidden common upstream (a traffic surge) is pushing both. That hidden upstream is a confounder — like a shared dependency you never drew into the dependency graph. Reading causation off correlation is like concluding "service A calls service B" just because the two jitter at the same time.

What it solves + how it works

Correlation answers an observational question: "given that I see X high, Y is probably high too" — formally the conditional probability P(Y | X). Causation answers an interventional one: "if I actively raise X, does Y change?" — Judea Pearl writes this with the do-operator as P(Y | do(X)). The whole gap between them lives in confounding: when Z drives both X and Y, X and Y will correlate strongly even with zero causal link.

Mechanically, a confounder opens a backdoor path: X ← Z → Y. Correlation blends "the true X→Y effect" with "the spurious association via Z." The core move of causal identification is to close the backdoor — stratify, regress, or match to "control for Z," leaving only the direct X→Y channel. This is also the root of Simpson's paradox: without controlling for Z, the aggregate trend can flatly contradict the trend inside every subgroup.

Confounding as a causal graph (DAG): Z is the backdoor

      Z confounder
        ↙     ↘
X treatment — ? → Y outcome

corr(X,Y) = true effect(X→Y) + backdoor spurious(X←Z→Y)
control for Z = close the backdoor, what remains is causal
Code example
import pandas as pd, numpy as np
import statsmodels.formula.api as smf

np.random.seed(0)
n = 2000
Z = np.random.normal(size=n)          # confounder: e.g. "user's baseline activity"
X = Z + np.random.normal(size=n)      # Z raises X (active users use new feature more)
Y = Z + np.random.normal(size=n)      # Z raises Y too; true X effect on Y = 0
df = pd.DataFrame({"X": X, "Y": Y, "Z": Z})

# Without controlling Z: X looks strongly to drive Y (spurious)
print(smf.ols("Y ~ X", df).fit().params["X"])      # ≈ 0.5
# Controlling Z (close backdoor): X coefficient collapses to truth
print(smf.ols("Y ~ X + Z", df).fit().params["X"])  # ≈ 0
Pitfall + use case
"With enough data, correlation can stand in for causation" — the bigger the dataset, the worse this gets. More samples narrow the confidence interval of the spurious association, making you more certain of a false conclusion. Big data doesn't fix confounding bias; it just estimates the bias more precisely. Only design fixes it: randomization, or explicitly modeling the confounder.
📌 Super-individual scenario: you notice "the weeks I used an AI assistant, my output was higher." Don't credit the tool yet. Ask: were those weeks ones where you were already in good shape (confounder Z = energy/task difficulty), making you both more willing to use AI and more productive? To test causation, make "use AI or not" independent of your state — e.g. randomly assign some tasks to use it and some not.
Takeaway + question
💡 Correlation is P(Y|X), causation is P(Y|do(X)) — the entire gap hides in the common upstreams you never drew into the dependency graph.
🤔 Recall a recent decision you made from a "data correlation": is there a hidden common cause that could explain both ends at once?

Potential Outcomes FrameworkPotential Outcomes / Rubin Model

Theoretical FrameworkCounterfactual
One-line analogy

A counterfactual is "what would have happened if I hadn't shipped this deploy?" The catch: you cannot both deploy and not-deploy the same server at the same instant. This is exactly the A/B test mindset, but with an unavoidable flaw — for each unit you can only ever observe one parallel universe; the other is forever missing data.

What it solves + how it works

Rubin's potential-outcomes framework assigns each unit i two values: Y_i(1) (outcome if treated) and Y_i(0) (if not). The individual causal effect is Y_i(1) − Y_i(0). The trouble: i is either treated or not, so you only ever observe one; the other is the counterfactual. This is the Fundamental Problem of Causal Inference — essentially a missing-data problem.

Since individual effects are unobservable, we settle for an average: ATE = E[Y(1) − Y(0)]. But naively taking "treated-group mean − untreated-group mean" carries selection bias: people who choose treatment differ from those who don't. Randomization is the key — random assignment makes "treated or not" independent of potential outcomes, so the two groups are statistically identical pre-treatment, and the between-group difference is an unbiased ATE. That's why the RCT (randomized controlled trial) is the gold standard.

UnitY(0) untreatedY(1) treatedIndiv. effect
A (treated)? counterfactual8unknowable
B (untreated)5? counterfactualunknowable
Every row has a missing red cell → individual effects are never fully seen; only group averages, via randomization
Code example
import numpy as np
np.random.seed(1)
n = 5000
T = np.random.binomial(1, 0.5, n)   # random assignment: key! independent of the unit
Y0 = np.random.normal(50, 10, n)        # potential outcome if untreated
tau = 4.0                               # true causal effect +4
Y1 = Y0 + tau
Y = np.where(T == 1, Y1, Y0)        # observe only the assigned arm

# Thanks to randomization, group diff is an unbiased ATE estimate
ate = Y[T == 1].mean() - Y[T == 0].mean()
print(round(ate, 2))   # ≈ 4.0, close to true tau
Pitfall + use case
"I have a treatment group and a control group, so comparing means gives the effect" — only true under random assignment. In observational data the two groups are inherently incomparable (selection bias), so a raw mean comparison gives correlation, not the ATE. The framework's value isn't the formula — it's that it forces you to spell out what the counterfactual is and why it's missing.
📌 Decision-support scenario: design a "micro-RCT" for a personal project. To learn "does the Pomodoro technique actually help?", don't only use it on bad days — flip a coin to randomly decide each day whether to use it, then compare the two groups' output after two weeks. Randomization spreads the confounder (your daily state) evenly across both groups, sparing you from modeling a pile of control variables.
Takeaway + question
💡 The fundamental problem of causal inference is missing data: you only ever see one parallel universe per unit, and randomization is the only clean way to "borrow back" the other.
🤔 Which "obviously effective" habit of yours has never been tested against the counterfactual of "not doing it"? How would you design a minimal randomized experiment for it?

Instrumental VariablesInstrumental Variables (IV)

Quasi-experimentIdentification
One-line analogy

When you can't randomize and have a confounder you can't shake off, hunt for a natural randomizer — an external nudge that only changes "treated or not" and is itself unrelated to the outcome or the confounder. Analogy: your canary system assigns a feature flag via a random seed; who gets the new feature is random and unrelated to user profile. So you treat "randomly assigned the flag" as a lever to pry out the feature's causal effect — even though whether users actually adopt the feature is self-selected.

What it solves + how it works

With unobserved confounding, "control for Z" fails — you can't even measure Z. An instrument Z sidesteps it via three conditions: (1) relevance — Z genuinely affects treatment X (Z→X is strong enough); (2) exclusion — Z affects Y only through X, no other path; (3) independence — Z is unrelated to the unobserved confounder, i.e. "as good as randomly assigned."

The mechanism is two-stage least squares (2SLS): stage one uses Z to predict X, isolating the part of X's variation driven only by Z (this part is "clean," confounder-free); stage two uses that clean prediction to explain Y. Intuitively, you use only the small exogenous wiggle the instrument provides to estimate the effect, discarding all the confounder-polluted variation. The cost: larger variance, and conditions (2)(3) can't be tested from data — only argued from domain knowledge. That's IV's most fragile point.

Code example
import numpy as np, pandas as pd
from linearmodels.iv import IV2SLS  # pip install linearmodels

np.random.seed(2)
n = 4000
Z = np.random.normal(size=n)              # instrument: exogenous random nudge
U = np.random.normal(size=n)              # unobserved confounder, unmeasurable
X = 0.8*Z + U + np.random.normal(size=n)  # X driven by both Z and confounder U
Y = 2.0*X + 3*U + np.random.normal(size=n)# true effect=2, but U pollutes OLS
df = pd.DataFrame({"Y": Y, "X": X, "Z": Z})

# OLS is biased by the confounder (far from 2); IV pries out clean effect
iv = IV2SLS.from_formula("Y ~ 1 + [X ~ Z]", df).fit()
print(round(iv.params["X"], 2))   # ≈ 2.0, close to true causal effect
Pitfall + use case
"A weak instrument will do" — very dangerous. When Z→X is weak (a weak instrument), stage one carries almost no information, and 2SLS wildly amplifies bias and variance — the result can be worse than plain OLS. Always check the stage-one F-statistic (a common rule of thumb is > 10) before trusting it.
📌 Cross-disciplinary scenario: in economics, "the effect of years of schooling on earnings" uses quarter of birth as an instrument (Angrist & Krueger) — school-entry-age rules make children born in different quarters passively get slightly more/less schooling, and the "quarter" is unrelated to individual ability — a natural randomizer. Learning to spot "quasi-random events" in life (policy switches, geographic borders, lotteries) is the core skill of treating observational data like an experiment.
Takeaway + question
💡 An instrumental variable is "borrowed randomness" — find an exogenous lever that moves only the treatment and never touches the outcome, then estimate causation from its small clean wiggle.
🤔 For some "does X cause Y" question you care about, is there an event in the world that near-randomly changed X yet has no direct link to Y?

Difference-in-DifferencesDifference-in-Differences (DiD)

Quasi-experimentPanel Data
One-line analogy

You change the config on one shard (treatment) and leave another untouched (control). Just looking at the treated shard before/after won't do — the whole cluster may be trending up over that period due to traffic seasonality. DiD's move: (treatment before-after diff) minus (control before-after diff), which subtracts out the shared time trend both groups lived through, leaving only the net effect of the config change.

What it solves + how it works

When treatment isn't randomly assigned, but you have two time points (before/after) and an untreated control group, DiD can identify the effect. It cancels two biases at once: the first difference (before vs after) removes each group's time-invariant fixed differences (e.g. the treated group's baseline was just higher); the second difference (treated vs control) removes the time trend both groups share (the market's seasonal swing). After both subtractions, what's left is the causal effect.

The core assumption is parallel trends: absent treatment, the treated and control groups would have moved along parallel paths. This is the entire source of DiD's credibility — and it can't be proven directly; you can only support it by checking that the two lines ran parallel across multiple pre-treatment periods. The classic case is Card & Krueger (1994) on minimum wage: New Jersey raised its minimum wage while neighboring Pennsylvania didn't; a difference-in-differences on fast-food employment found employment did not fall as standard theory predicted.

Parallel trends: counterfactual = control's trend shifted onto treatment

Emp.  │               treated (actual)
     │    - - - - treated (counterfactual)
     │              control
     │    
     └──────────────→ time
        before  | after
DiD effect = vertical gap between actual● and counterfactual○
counterfactual = control's change, shifted to the treated baseline
Code example
import pandas as pd
import statsmodels.formula.api as smf

# treat: in treatment group?  post: in post-treatment period?
df = pd.DataFrame({
    "y":     [20, 22, 18, 25],   # control before/after + treated before/after
    "treat": [0,  0,  1,  1],
    "post":  [0,  1,  0,  1],
})
# coefficient on interaction treat:post = the DiD causal effect
m = smf.ols("y ~ treat + post + treat:post", df).fit()
print(m.params["treat:post"])
# = (25-18) - (22-20) = 7 - 2 = 5  ← subtracted the +2 market trend
Pitfall + use case
"Just grab a control group and slap DiD on it" — pick the wrong control, parallel trends fails, and the conclusion is worthless. If the treated group was already accelerating upward before treatment (non-parallel trends to begin with), DiD will misattribute that natural difference to the treatment. Always plot multiple pre-treatment periods to confirm the lines stayed parallel (a "pre-trends test").
📌 Personal-project scenario: to assess "did my weekly-review quality improve after switching note systems," don't just look at your own before/after — find a similar dimension you didn't switch as control (e.g. another category of notes left untouched), and use DiD to subtract out "you were generally more engaged this period," so you don't credit all the gain to the tool.
Takeaway + question
💡 DiD "subtracts twice" to peel off fixed between-group differences and the shared time trend — but all its credibility rests on "parallel trends," an assumption you can never prove, only support.
🤔 For the change you want to attribute, is there a control that "didn't undergo the change but would have fluctuated in sync"? Without it, how do you know the post-change movement wasn't just the market trend?

Further ReadingFurther Reading

Deep QuestionsDeep Questions

1. If the RCT is the gold standard for causation, why invent quasi-experimental methods like IV and DiD at all? Where do they sit on the ladder of causal credibility?
Because RCTs are often impossible: unethical (you can't randomly make people smoke), infeasible (you can't randomly raise the minimum wage in some states), too costly or too slow. Quasi-experiments are "settling for letting nature randomize for you" — IV borrows an exogenous event as a randomizer; DiD borrows a control group to subtract the trend. The credibility ladder roughly: RCT > natural experiment / IV / DiD / regression discontinuity > confounder-controlled regression / matching > raw correlation. The lower you go, the more you lean on assumptions that can't be checked from data (exclusion, parallel trends), and the more fragile the credibility. Key insight: causal strength depends not on how complex the model is, but on how credible the identifying assumptions are. A simple DiD with an airtight control beats a fancy regression stuffed with controls that still can't close unobserved confounding — isomorphic to "correctness comes from invariants, not lines of code" in distributed systems.
2. ML models predict superbly — so why do ML engineers still stumble on causal questions? Are prediction and causation even the same thing?
Not at all the same — one of the deepest traps in ML practice. Prediction optimizes "given the X I observe, what is Y?" — it happily exploits every correlation, including spurious ones from confounding, because that lowers loss too. Causation asks "if I go change X, what happens to Y?" — and now spurious associations are poison. A classic crash: a model finds "among hospitalized patients, asthmatics have lower mortality" (prediction-wise correct — asthmatics get more intensive monitoring), but concluding "asthma protects pneumonia patients" causally and triaging on it would kill people — "monitoring intensity" is the confounder. A 0.99-AUC model can still give catastrophic advice on "what happens after an intervention." So interpretability and feature importance are not causal attribution: a high SHAP value only says the feature is useful for prediction, not that changing it changes the outcome. To decide (not predict), you must switch to a causal framework.
3. "Parallel trends" and "the exclusion restriction" are both assumptions data can't prove, only argument can support. How does this "depending on untestable assumptions" situation compare to engineering domains you know? How should you treat it?
This is highly isomorphic to depending on invariants you can't exhaustively test in distributed systems — you can never test "linearizable under all network partitions," only argue it from protocol design plus heavy indirect evidence (Jepsen tests, formal verification). The handling transfers too: (1) state the assumption explicitly, don't bury it in code/model — in causal analysis, clearly declare "I assume parallel trends / exclusion"; (2) shore it up with indirect evidence — DiD uses multi-period pre-trend plots, IV uses related placebo tests (run the same analysis on an outcome that should show no effect; if it "shows" one, the assumption is broken); (3) sensitivity analysis — ask "how badly must the assumption fail before the conclusion flips," which is exactly what DoWhy's "refute" step does; (4) honestly label the uncertainty. Core mindset: a causal conclusion is always a conditional claim "true under assumption X"; engineering maturity shows in knowing which bets you placed and how heavily, not in pretending you placed none.
4. Mapping the causal ladder (association → intervention → counterfactual) onto today's LLMs, which rung are they on? What does that mean for "can AI truly understand the world"?
Pearl's ladder has three rungs: association (what would seeing X tell me), intervention (what happens if I do X), counterfactual (what if I hadn't done X). LLMs trained on massive text learn the association rung — they model the joint distribution of tokens, fundamentally a powerful correlation engine. They can recite human-written causal knowledge (because the corpus contains causal statements), but that isn't the same as doing intervention/counterfactual reasoning themselves; faced with novel out-of-distribution interventions, they often give plausible-sounding but causally wrong answers. This is the live debate: some argue pure associational learning can't reach the upper rungs and you must inject explicit causal structure; others argue enough scale plus tools (agents that can actually run experiments and change world state via APIs) can approximate the intervention rung. The lesson for the "super-individual": use LLMs for prediction/generation/retrieval — their home turf — but for causal decisions (what will this change cause), treat them as a hypothesis-proposing assistant, while real causal judgment still needs you to verify with this article's frameworks. The human-AI division of labor falls exactly along the causal ladder.
5. How do you choose among this article's four tools (confounder-controlled regression, potential outcomes, IV, DiD)? Given a real problem, what's the decision path?
Roughly a decision tree: Can you randomize? → if yes, run an RCT (compute the ATE via potential outcomes), don't overthink. Can't randomize, but are all confounders observable? → if yes, control for them via regression/matching (close the backdoor), but honestly ask "did I really observe them all?" — usually not. Unobserved confounding, but can you find an exogenous instrument? → if yes, and it's strong enough (stage-one F>10) with a defensible exclusion restriction, use IV. Have before/after panel data and a credible control group? → if yes and pre-trends are parallel, use DiD. The three often combine: take a DiD main estimate, then use IV or sensitivity analysis as a robustness check. For BigCat's real case — "did the new workflow boost my output" — first choice is to randomly assign tasks to use it or not (micro-RCT); failing that, find an unchanged control dimension and run DiD. Always start with "what is my counterfactual, and why is it credible?" — the tools are just means to answer that.