A trained neural network is a binary with no source code—weights are the "compiled artifact," a forward pass is "running it," yet no one ever wrote its internal logic. Mechanistic interpretability is disassembly + single-step debugging: without touching the weights, you attach "breakpoints" and "distributed tracing" to the black box and watch, layer by layer, what each component reads, what it writes, and which hop's signal it passes downstream. The goal is not "does the model predict well" but "what algorithm is actually running inside."
Classic ML interpretability (SHAP, feature importance) only tells you "which input feature influenced the output"—external attribution on a black box. Mechanistic interpretability is more radical: it wants to recover the internal computational circuit, reading the weights like you'd read assembly. The key handle is a structural fact about Transformers—the residual stream:
The key insight comes from Anthropic's A Mathematical Framework for Transformer Circuits (2021): because every attention head and MLP adds its result back to the residual stream (rather than overwriting), the whole bus is a linear superposition of each component's contribution—you can "pull out" any single hop and inspect it. The most famous finding is the induction head: an attention circuit that only emerges in models with ≥2 layers, which does this—"find where the current token last appeared in context, and copy the token that came after it." This is precisely one of the core mechanisms of in-context learning (see a few examples, then mimic them).
# pip install transformer_lens — Neel Nanda's standard mech-interp library from transformer_lens import HookedTransformer model = HookedTransformer.from_pretrained("gpt2-small") # run_with_cache: one forward pass, "recording" every layer's activations tokens = model.to_tokens("The cat sat on the mat. The cat sat on the") logits, cache = model.run_with_cache(tokens) # grab layer 5, head 1's attention-weight matrix attn = cache["pattern", 5][0, 1] # shape: [seq, seq] # see where the last token attends—an induction head points after the earlier "cat" print(attn[-1].argmax().item()) # expected: the position after "cat" in sentence 1
A model has only ~3000 neurons yet must represent tens of thousands of concepts—its trick is to pack multiple concepts into the same dimension, like bit-packing several boolean flags into one int, or hashing infinite keys into finite buckets. The cost: a single neuron becomes polysemantic—one neuron fires for "DNA sequences," "Arabic script," and "French" at once, and you can't read it. An SAE is the unpacker: it "decompresses" the packed representation back into a set of sparse, monosemantic named fields—each representing exactly one concept.
This "packing" phenomenon has a name: superposition. Anthropic's Toy Models of Superposition (2022) proved that when features are "numerous and sparse," the network deliberately packs n features into a space far smaller than n dimensions, tolerating collisions because the features rarely co-activate—a lossy compression, and the very reason neurons are unreadable.
The SAE mechanism is dictionary learning: train a very wide autoencoder that encodes the d-dimensional activation vector x into a feature vector that is much wider than d but almost all zeros, then decodes it back to x. Mathematically, minimize:
Why does the L1 penalty force monosemanticity? Because "only a few dimensions may activate" is equivalent to forcing each dimension to claim an independent, reusable concept instead of sharing with others. Anthropic's Towards Monosemanticity (2023) trained a 16×-wide SAE on a small model, yielding nearly 15,000 features of which human raters judged about 70% cleanly map to a single concept; the 2024 Scaling Monosemanticity scaled it to production-grade Claude 3 Sonnet and found the famous "Golden Gate Bridge feature"—clamp its activation to 10× max and the model goes all-in on the Golden Gate Bridge, even identifying as the bridge. This shows features aren't just correlational but causally steerable.
# pip install sae-lens — load a community-pretrained SAE, no need to train one from sae_lens import SAE from transformer_lens import HookedTransformer model = HookedTransformer.from_pretrained("gpt2-small") sae, cfg, _ = SAE.from_pretrained("gpt2-small-res-jb", "blocks.7.hook_resid_pre") _, cache = model.run_with_cache(model.to_tokens("The Golden Gate Bridge is")) acts = cache["blocks.7.hook_resid_pre"] # layer-7 residual-stream activation features = sae.encode(acts) # unpack into a very wide sparse feature vector top = features[0, -1].topk(5) # top-5 features on the last token print(top.indices, top.values) # each index maps to an interpretable concept # look up what these indices mean on Neuronpedia: neuronpedia.org
A single feature is like a microservice; a feature circuit is the call chain / DAG they form: an early-layer feature "detects this is a person's name" → a mid-layer feature "this is a French place" → a late-layer feature "output French." Features are nodes, weights are edges, and the whole graph is an attribution graph—the same thing as the microservice dependency graphs and Spark DAGs you've drawn. The ultimate goal of mech interp is to reduce a model behavior to one readable, verifiable circuit.
Knowing "which features exist" (what SAEs do) is just a parts list, not understanding the circuit. To understand behavior you need to know how features connect and who triggers whom. The core method for validating connections is activation patching (a.k.a. causal tracing)—a clean causal experiment:
This beats "which neuron activated most": high activation is merely correlation, whereas patching is active intervention—directly answering "if we remove/replace this hop, does the result change?" Validate every edge this way and you reconstruct an attribution graph. Anthropic's 2025 "circuit tracing / attribution graphs" work applied this to production-grade Claude, mapping the real circuits behind multi-step reasoning, poetry rhyming, and mental math—finding that the model sometimes plans ahead (picks the rhyme first when writing poetry, then works backward to the line), something self-report cannot see at all.
# use activation patching to find "which layer matters most for the right answer" import torch clean = model.to_tokens("The Eiffel Tower is in the city of") corrupt = model.to_tokens("The Colosseum is in the city of") ans = model.to_single_token(" Paris") _, clean_cache = model.run_with_cache(clean) def patch_layer(corrupt_act, hook, layer): # transplant the clean residual stream into the corrupt forward pass corrupt_act[:] = clean_cache["resid_post", layer] return corrupt_act for L in range(model.cfg.n_layers): logits = model.run_with_hooks(corrupt, fwd_hooks=[ (f"blocks.{L}.hook_resid_post", lambda a,h: patch_layer(a,h,L))]) print(L, logits[0,-1,ans].item()) # the layer whose patch spikes "Paris" = key layer
A probe is a read-only APM probe attached to a running system: tap into a model layer's activations, train a very lightweight linear classifier, and ask "does this layer's internal state encode some information (part-of-speech? sentiment? truthfulness?)." The probe doesn't change the model or join training—it's pure post-hoc tapping of the bus, exactly like attaching a packet sniffer to an internal message bus to see "does this data flow carry field X."
SAEs and circuits are heavy and research-grade; probing is the most lightweight, earliest, most practical interpretability tool, originating in Alain & Bengio's Understanding intermediate layers using linear classifier probes (2016). The mechanism is minimal: freeze the model, take layer-ℓ activations as features, train a linear classifier to predict the attribute you care about. The key constraint is "linear":
Alain & Bengio swept linear probes across ResNet and found linear separability rises monotonically with depth—low layers hold edges and textures, higher layers grow more abstract. This gave a quantifiable evidence that deep nets "refine representations layer by layer."
from sklearn.linear_model import LogisticRegression import numpy as np # collect each sentence's layer-6, last-token activation, paired with a sentiment label X, y = [], [] for text, label in dataset: # label: 1=positive 0=negative _, cache = model.run_with_cache(model.to_tokens(text)) X.append(cache["resid_post", 6][0, -1].numpy()) y.append(label) # linear probe: readable linearly = this layer already encodes sentiment "ready to use" probe = LogisticRegression(max_iter=1000).fit(np.array(X), y) print("linear separability (probe accuracy):", probe.score(np.array(X), y)) # repeat across layers → see at which layer sentiment "crystallizes"