AI/ML Deep Dive: Interpretability

Day 27 · 2026-06-13 · Difficulty ★★★★☆

For: engineers with coding experience, non-AI background

Engineering counterpart → super-individual D10: Hallucination mitigation (turning interpretability insight into anti-hallucination engineering)

Mechanistic InterpretabilityMechanistic Interpretability

reverse engineeringresidual stream

One-line analogy

A trained neural network is a binary with no source code—weights are the "compiled artifact," a forward pass is "running it," yet no one ever wrote its internal logic. Mechanistic interpretability is disassembly + single-step debugging: without touching the weights, you attach "breakpoints" and "distributed tracing" to the black box and watch, layer by layer, what each component reads, what it writes, and which hop's signal it passes downstream. The goal is not "does the model predict well" but "what algorithm is actually running inside."

Problem it solves + how it works

Classic ML interpretability (SHAP, feature importance) only tells you "which input feature influenced the output"—external attribution on a black box. Mechanistic interpretability is more radical: it wants to recover the internal computational circuit, reading the weights like you'd read assembly. The key handle is a structural fact about Transformers—the residual stream:

Residual stream = a shared message bus

token embedding → residual stream →→→→→→→→→→→→ unembed → logits
↑read ↓write ↑read ↓write
Attn headMLP ... every layer adds its contribution onto the bus

Each component reads the bus, computes, and adds its result back—linearly additive, so it's decomposable

The key insight comes from Anthropic's A Mathematical Framework for Transformer Circuits (2021): because every attention head and MLP adds its result back to the residual stream (rather than overwriting), the whole bus is a linear superposition of each component's contribution—you can "pull out" any single hop and inspect it. The most famous finding is the induction head: an attention circuit that only emerges in models with ≥2 layers, which does this—"find where the current token last appeared in context, and copy the token that came after it." This is precisely one of the core mechanisms of in-context learning (see a few examples, then mimic them).

Code example

# pip install transformer_lens — Neel Nanda's standard mech-interp library
from transformer_lens import HookedTransformer
model = HookedTransformer.from_pretrained("gpt2-small")

# run_with_cache: one forward pass, "recording" every layer's activations
tokens = model.to_tokens("The cat sat on the mat. The cat sat on the")
logits, cache = model.run_with_cache(tokens)

# grab layer 5, head 1's attention-weight matrix
attn = cache["pattern", 5][0, 1]   # shape: [seq, seq]
# see where the last token attends—an induction head points after the earlier "cat"
print(attn[-1].argmax().item())  # expected: the position after "cat" in sentence 1

Common pitfall + use case

"Interpretability = attach a natural-language explanation to the output"—wrong. Having the model explain itself (chain-of-thought, "why did you answer that") is self-report, which can be a post-hoc rationalization that doesn't reflect the real computation (Anthropic's work repeatedly shows the two diverge). The value of mechanistic interpretability is exactly that it distrusts the model's self-narration and autopsies the internal circuit directly.

📌 Real-life scenario: when you do interdisciplinary thinking you often ask "why does this system behave this way." Mech interp offers a paradigm—for any black box (an organization, a market, even your own cognitive habits) first ask "is there a decomposable causal circuit inside," rather than stopping at "input→output" correlation.

Takeaway + question

💡 Mechanistic interpretability treats a network as "a program to be disassembled," using the linear additivity of the residual stream to split a black box into readable circuits.
🤔 If a model's self-explanation can diverge from its real circuit, would you still trust "just ask the AI to explain" as a reliable audit?

Sparse AutoencodersSparse Autoencoders (SAE)

dictionary learningsuperposition

One-line analogy

A model has only ~3000 neurons yet must represent tens of thousands of concepts—its trick is to pack multiple concepts into the same dimension, like bit-packing several boolean flags into one int, or hashing infinite keys into finite buckets. The cost: a single neuron becomes polysemantic—one neuron fires for "DNA sequences," "Arabic script," and "French" at once, and you can't read it. An SAE is the unpacker: it "decompresses" the packed representation back into a set of sparse, monosemantic named fields—each representing exactly one concept.

Problem it solves + how it works

This "packing" phenomenon has a name: superposition. Anthropic's Toy Models of Superposition (2022) proved that when features are "numerous and sparse," the network deliberately packs n features into a space far smaller than n dimensions, tolerating collisions because the features rarely co-activate—a lossy compression, and the very reason neurons are unreadable.

The SAE mechanism is dictionary learning: train a very wide autoencoder that encodes the d-dimensional activation vector x into a feature vector that is much wider than d but almost all zeros, then decodes it back to x. Mathematically, minimize:

L = ‖x − x̂‖² + λ · ‖f‖₁

· ‖x − x̂‖² (reconstruction error): the unpacked form must restore the original activation—no information lost
· λ‖f‖₁ (L1 sparsity penalty): forces most dimensions of f = 0, only a few fire each time
· larger λ → sparser, more monosemantic, but worse reconstruction; this is the core trade-off

Why does the L1 penalty force monosemanticity? Because "only a few dimensions may activate" is equivalent to forcing each dimension to claim an independent, reusable concept instead of sharing with others. Anthropic's Towards Monosemanticity (2023) trained a 16×-wide SAE on a small model, yielding nearly 15,000 features of which human raters judged about 70% cleanly map to a single concept; the 2024 Scaling Monosemanticity scaled it to production-grade Claude 3 Sonnet and found the famous "Golden Gate Bridge feature"—clamp its activation to 10× max and the model goes all-in on the Golden Gate Bridge, even identifying as the bridge. This shows features aren't just correlational but causally steerable.

Code example

# pip install sae-lens — load a community-pretrained SAE, no need to train one
from sae_lens import SAE
from transformer_lens import HookedTransformer

model = HookedTransformer.from_pretrained("gpt2-small")
sae, cfg, _ = SAE.from_pretrained("gpt2-small-res-jb", "blocks.7.hook_resid_pre")

_, cache = model.run_with_cache(model.to_tokens("The Golden Gate Bridge is"))
acts = cache["blocks.7.hook_resid_pre"]      # layer-7 residual-stream activation

features = sae.encode(acts)                    # unpack into a very wide sparse feature vector
top = features[0, -1].topk(5)               # top-5 features on the last token
print(top.indices, top.values)             # each index maps to an interpretable concept
# look up what these indices mean on Neuronpedia: neuronpedia.org

Common pitfall + use case

"Every feature an SAE finds is a clean human concept"—don't be that optimistic. About 30% of features remain vague or hard to interpret; SAEs are also subject to the reconstruction-vs-sparsity trade-off—different λ yields different dictionaries—plus known issues like "feature splitting" (one concept fractured into several near-synonym features). An SAE is the best unpacker we have, but not a perfect dictionary.

📌 Real-life scenario: superposition is a superb cross-disciplinary metaphor—how does a capacity-limited system carry information far beyond its capacity? Brains, language, even a day's worth of your attention all "superpose." It's the same family of ideas as hash collisions, lossy compression, and "dimensional collapse" in complexity science—each illuminates the others.

Takeaway + question

💡 Neurons are unreadable because concepts are "superposed" into a few dimensions; SAEs use dictionary learning + L1 sparsity to unpack them into monosemantic features.
🤔 If a feature can be clamped to change a model's behavior (Golden Gate Bridge), how far—in principle—is "editing a model's beliefs" from "editing a person's beliefs"?

Feature CircuitsFeature Circuits

call graphcausal

One-line analogy

A single feature is like a microservice; a feature circuit is the call chain / DAG they form: an early-layer feature "detects this is a person's name" → a mid-layer feature "this is a French place" → a late-layer feature "output French." Features are nodes, weights are edges, and the whole graph is an attribution graph—the same thing as the microservice dependency graphs and Spark DAGs you've drawn. The ultimate goal of mech interp is to reduce a model behavior to one readable, verifiable circuit.

Problem it solves + how it works

Knowing "which features exist" (what SAEs do) is just a parts list, not understanding the circuit. To understand behavior you need to know how features connect and who triggers whom. The core method for validating connections is activation patching (a.k.a. causal tracing)—a clean causal experiment:

Activation patching: A/B perfusion for causal ablation

1. Run a "clean" input clean, cache a component's activation
2. Run a "corrupt" input corrupt (answer broken)
3. Transplant one clean activation into the corrupt forward pass
4. See how much the output recovers → more recovery = more causally critical

= swapping components one-by-one in a distributed system to find "which hop actually decided the result"

This beats "which neuron activated most": high activation is merely correlation, whereas patching is active intervention—directly answering "if we remove/replace this hop, does the result change?" Validate every edge this way and you reconstruct an attribution graph. Anthropic's 2025 "circuit tracing / attribution graphs" work applied this to production-grade Claude, mapping the real circuits behind multi-step reasoning, poetry rhyming, and mental math—finding that the model sometimes plans ahead (picks the rhyme first when writing poetry, then works backward to the line), something self-report cannot see at all.

Code example

# use activation patching to find "which layer matters most for the right answer"
import torch
clean = model.to_tokens("The Eiffel Tower is in the city of")
corrupt = model.to_tokens("The Colosseum is in the city of")
ans = model.to_single_token(" Paris")

_, clean_cache = model.run_with_cache(clean)

def patch_layer(corrupt_act, hook, layer):
    # transplant the clean residual stream into the corrupt forward pass
    corrupt_act[:] = clean_cache["resid_post", layer]
    return corrupt_act

for L in range(model.cfg.n_layers):
    logits = model.run_with_hooks(corrupt, fwd_hooks=[
        (f"blocks.{L}.hook_resid_post", lambda a,h: patch_layer(a,h,L))])
    print(L, logits[0,-1,ans].item())   # the layer whose patch spikes "Paris" = key layer

Common pitfall + use case

"Drawing one circuit means I fully understand the model"—don't overclaim. Existing circuits mostly recover narrow tasks (indirect-object identification, specific fact recall) locally; we're orders of magnitude from "reading the whole model," and circuits also superpose and interfere. It's real progress, but far from "full decompilation." Honestly: what we can read is still only a few parts of this machine.

📌 Real-life scenario: the activation-patching mindset—"to know if a hop is truly causal, transplant/replace it and see if the result changes"—maps directly onto decision retros: instead of arguing "did X cause the outcome," design a small experiment that counterfactually swaps X. It's distributed-systems chaos engineering applied to cognition.

Takeaway + question

💡 Features are parts; circuits are the program. Only causal ablation via activation patching upgrades "correlation" into a "circuit."
🤔 If a model is proven to "plan the rhyme ahead" when writing poetry, how much implicit planning hides beneath its token-by-token appearance that we assume isn't there?

ProbingProbing

read-only probediagnostic

One-line analogy

A probe is a read-only APM probe attached to a running system: tap into a model layer's activations, train a very lightweight linear classifier, and ask "does this layer's internal state encode some information (part-of-speech? sentiment? truthfulness?)." The probe doesn't change the model or join training—it's pure post-hoc tapping of the bus, exactly like attaching a packet sniffer to an internal message bus to see "does this data flow carry field X."

Problem it solves + how it works

SAEs and circuits are heavy and research-grade; probing is the most lightweight, earliest, most practical interpretability tool, originating in Alain & Bengio's Understanding intermediate layers using linear classifier probes (2016). The mechanism is minimal: freeze the model, take layer-ℓ activations as features, train a linear classifier to predict the attribute you care about. The key constraint is "linear":

If a simple linear probe can read "sentiment" out of some layer, that layer already encodes sentiment linearly separably—the information is "ready to use";
If you need a deep nonlinear probe to read it out, it's likely the probe itself learned it, and you can't claim the model "represented" it.

Alain & Bengio swept linear probes across ResNet and found linear separability rises monotonically with depth—low layers hold edges and textures, higher layers grow more abstract. This gave a quantifiable evidence that deep nets "refine representations layer by layer."

Code example

from sklearn.linear_model import LogisticRegression
import numpy as np

# collect each sentence's layer-6, last-token activation, paired with a sentiment label
X, y = [], []
for text, label in dataset:          # label: 1=positive 0=negative
    _, cache = model.run_with_cache(model.to_tokens(text))
    X.append(cache["resid_post", 6][0, -1].numpy())
    y.append(label)

# linear probe: readable linearly = this layer already encodes sentiment "ready to use"
probe = LogisticRegression(max_iter=1000).fit(np.array(X), y)
print("linear separability (probe accuracy):", probe.score(np.array(X), y))
# repeat across layers → see at which layer sentiment "crystallizes"

Common pitfall + use case

"If a probe can read X, the model uses X to decide"—the classic trap. A probe only proves the information exists in the activations (correlation), not that the model actually uses it in the forward pass (causation). "This layer can read out syntax" doesn't mean prediction depends on syntax—to test causation, go back to card 3's activation patching. Probing answers "is it there," patching answers "was it used"—don't conflate the two jobs.

📌 Real-life scenario: a probe is the most cost-effective diagnostic starting point. To know whether some open model "actually understands legal entities / medical concepts" internally, don't dissect circuits—first train a few linear probes across layers; 10 lines of code give a quick checkup ("this info barely exists / only crystallizes at layer N"), then decide whether to fine-tune or switch models.

Takeaway + question

💡 A probe is a read-only tap: using "linear separability" as the bar, it quantifies "does a layer already encode some information"—light and fast, but tests correlation only, not causation.
🤔 The gap between "information exists" and "information is used"—does it happen daily in systems you know (a field is in the logs ≠ the business logic reads it)?
Engineering counterpart → super-individual D10: Hallucination mitigation

Deep QuestionsDeep Questions

1. Probing, SAEs, and activation patching all aim to "understand the model." Where does each sit on the correlation-vs-causation axis, and how should you combine them?

This is the key thread of the day. The three form a pipeline from correlation to causation, coarse to fine. Probing is cheapest, answering "does a layer encode X"—pure correlation, proving information exists but not that it's used (a probe reading syntax ≠ the model using syntax). SAEs solve the superposition problem of unreadable neurons, unpacking activations into monosemantic features—a nameable parts list, finer than probing's "is it there," but feature activation alone is still correlational. Activation patching rises to causation: actively transplanting/ablating a hop to answer "remove it, does the result change." In practice: first use probes to quickly sweep layers and locate "roughly which layer the info crystallizes" (10-line checkup) → at the key layer, use an SAE to unpack candidate features → use patching to verify which features/components are truly causal (assembling a circuit). Remember it as: probing asks "where," SAEs ask "what," patching asks "did it do it." Your distributed intuition transfers directly—this is the cognitive version of tracing (where's the signal) → service teardown (what is it) → chaos injection (was it the cause).

2. Anthropic repeatedly finds that "a model's self-explanation (CoT / self-report) diverges from its real circuit." What does this mean for governance approaches like "make the AI explain itself"?

It means treating self-report as audit evidence is dangerous. When a model generates "I answered this because…," it's generating plausible-looking explanation text, not reading and reporting its real computation—architecturally these are not the same thing. Circuit tracing has direct cases: a model says it reasons by some steps while the circuit takes another path (in mental math it says "carry the digit" while the circuit uses a set of parallel approximate lookup features). Three implications: (1) alignment and safety can't rely on "make the model confess"—a model that can deceive can also fabricate its "confession"; mechanistic interpretability is one of the few tools that can bypass self-narration and autopsy directly, which is why it's critical AI-safety infrastructure; (2) CoT is still useful—it genuinely improves computation (more intermediate tokens = more compute), but "CoT text" ≠ "a faithful internal log," and the gap itself must be monitored; (3) a direct lesson for anti-hallucination engineering (see the D10 counterpart): rather than having the model "self-assess for hallucination," use externally verifiable means (retrieval grounding, consistency sampling, future feature-level monitoring)—don't hand the verdict to the defendant.

3. Superposition says "a capacity-limited system is forced to superpose excess information into few dimensions." Does this hold for brains, language, organizations? Is it a bug or a feature?

Most likely a universal principle, and a feature, not a bug. Superposition's preconditions are very general: concepts far outnumber available dimensions, and those concepts rarely co-occur (sparse)—satisfy both and any finite system will "rationally" choose to superpose, trading "minimal collisions" for "high capacity." Brains: neuron counts are far too few for "one concept per neuron," so distributed, superposed coding is nearly inevitable—explaining why single-electrode recordings struggle to read individual neurons (isomorphic to a model's polysemantic neurons). Language: a finite vocabulary expressing infinite meaning via polysemy + context disambiguation—that's superposition at the semantic layer, and SAE "feature splitting/merging" has a direct counterpart in semantics. Organizations: one person wearing many hats, one process carrying many goals—also superposition under capacity limits, with the same cost ("hard to read what a role is actually doing"). Why call it a feature: superposition is the optimal compromise between capacity and separability—as long as collisions are sparse, the gains far exceed occasional interference. It turns into a bug the moment "the sparsity assumption breaks" (concepts start co-occurring frequently) and collisions explode—corresponding to cognitive overload in brains, role conflict in organizations, feature interference in models. A deep reminder for the "AI super-individual": the more roles you superpose on yourself, the more you must keep the precondition "they rarely fire simultaneously," or you slide from high capacity into high interference.

4. If one day we could fully decompile every circuit of a frontier model, what good and bad things follow? Is "full interpretability" a pure blessing?

The upsides are obvious: auditability (spot deception, backdoors, dangerous-capability circuits in advance), precise editing (point-edit beliefs like Golden Gate Bridge, debias or fix errors without full retraining), alignment verification (directly see whether a model "wants" to do something, not just what it says). But "full interpretability" is no pure blessing—at least three tensions: (1) capability-risk symmetry—a tool that point-edits beliefs can also point-inject them; a method that reads out dangerous capabilities can also help extract them. Interpretability is dual-use. (2) decompilers get adversaries—once "readable circuits" becomes the audit standard, you get "models trained to resist interpretability" (hiding dangerous circuits in deeper superposition); it's an arms race, not an endgame. (3) philosophical discomfort—if "belief = an externally clampable feature" holds for models, the metaphor for human minds unsettles many: the boundaries of free will, persuasion, and brainwashing get re-litigated. My take: interpretability is net positive and a necessity (you can't safely deploy strong models without understanding them), but it's not a "switch that makes things safe"—it transforms safety from "black-box trust" into a governance question of who may read and write these circuits. Technology solves "can we look"; "who may change it, and to what" is the harder social choice afterward.

5. Port mech interp's methodology (distrust self-report, find decomposable causal circuits, validate with counterfactual intervention) onto "understanding your own cognition." What do you get?

You get a counterintuitive but powerful self-knowledge discipline. Mapping point by point: (1) distrust self-report → your post-hoc explanation of "why I decided this" likely diverges from the real "circuit" driving you—just like a model's CoT. Admitting this is the first step out of self-deception: the reason you give is a rationalization, not necessarily the cause. (2) find decomposable causal circuits → rather than describing yourself with a vague "I'm just anxious," ask "is there a nameable trigger chain" (a specific context feature → a thought → a behavior). This is isomorphic to splitting "model behavior" into a "feature circuit," and echoes the Buddhist notion of dependent origination—everything is a chain of conditions aggregating, not a single essence. (3) validate with counterfactual intervention → don't stop at introspection (correlation); run an activation-patching-style mini-experiment: remove/replace a condition and see whether the behavior changes—only then do you know the real cause. That's chaos engineering applied to the mind. Deeper insight: mech interp and introspection/mindfulness are strikingly aligned—both refuse to trust surface narratives, both observe "the process itself" directly, both treat "the self" as a decomposable aggregate of processes rather than an indivisible black box. The only difference is tooling: one uses patching and SAEs, the other sustained awareness. For someone pursuing the "AI super-individual," both paths point to the same thing—turning any black box (a model or yourself) into a readable, verifiable, intervenable system.

Mechanistic InterpretabilityMechanistic Interpretability

Sparse AutoencodersSparse Autoencoders (SAE)

Feature CircuitsFeature Circuits

ProbingProbing

Further ReadingFurther Reading

Deep QuestionsDeep Questions