AI/ML Explained: Graph Machine Learning

Day 37 · 2026-06-23
For: engineers with coding experience, outside the AI field

Almost every model in the past 36 days assumed the input was a sequence (text) or a grid (image). But much of the data you know best is actually a graph: service-call topologies, social relationships, knowledge graphs, database foreign-key networks. The core information in this data lives in the connections, not in any single node. Today is about how machine learning directly digests a "graph." Its core mechanism is just one thing: let nodes exchange information with their neighbors.

Message PassingMessage Passing / GNN

core mechanismGNN foundation
One-line analogy

Message passing is the gossip protocol from distributed systems: every round, each node sends its state to all neighbors, collects the states they send back, aggregates them, and updates itself. Run K rounds and each node "knows" the entire neighborhood within K hops — the exact same playbook as a gossip protocol converging the whole cluster's state through repeated rounds.

The problem it solves + how it works

Graph data has a brutal property: no fixed order, and neighbor counts vary. Text lines up as a sequence for a Transformer, images as a grid for a CNN — but a graph? Node A has 3 neighbors, node B has 500, and you can't pack that into a fixed-shape tensor. Early approaches hand-crafted graph features (degree, centrality…), which was laborious and lossy.

The Message Passing Neural Network (MPNN, Gilmer et al. 2017) unifies all graph neural networks into one framework. Each layer does three things —

Layer k: one update of node v

neighbor u₁neighbor u₂neighbor u₃
↓ ① Message: each neighbor builds a message m = f(h_u)
m₁m₂m₃
↓ ② Aggregate: sum/mean/max (order-insensitive!)
aggregate = Σ mᵢ
↓ ③ Update: combine with own old state → new state
h_v new ← run K layers = see K-hop neighborhood

The key design: the aggregation function must be permutation-invariant — sum, mean, max all give the same result no matter the order neighbors arrive in. That precisely solves the "graphs have no order" problem. This is the mother mechanism of every GNN: GCN and GAT merely swap in different formulas for the "message" and "aggregate" steps.

Code example
# Pure numpy demo of one message-passing layer (no framework)
import numpy as np
# A: adjacency matrix (who links whom)  H: per-node feature vectors
A = np.array([[0,1,1,0],[1,0,1,0],[1,1,0,1],[0,0,1,0]])
H = np.random.randn(4, 8)   # 4 nodes, 8 dims each
W = np.random.randn(8, 16)  # learnable transform

# ① Message + ② Aggregate: A @ H = "sum each node's neighbor features"
agg = A @ H                  # (4,4)@(4,8) = (4,8) summed neighbors
# ③ Update: linear transform + nonlinearity
H_next = np.maximum(0, agg @ W)  # ReLU, gives (4,16) new reps
# Stack 2 layers → each node fuses its 2-hop neighborhood
Common misconception + practical scenario
"More layers see farther, so more is better" — wrong. Stacking too many GNN layers causes over-smoothing: after too many gossip rounds, every node's state converges and they all end up looking identical, impossible to tell apart. The intuition matches "over-synchronization destroys information entropy" in distributed systems. In practice GNNs usually have just 2–4 layers, far shallower than a Transformer's dozens.
📌 Super-individual scenario: model the microservice call graph you maintain — nodes are services, edges are calls, node features are QPS/latency/error-rate. A GNN can learn "how a service's health is shaped by its up- and downstream," predicting cascading failures better than looking at each service's metrics in isolation.
Takeaway + question
💡 A GNN's entire magic is one sentence: nodes repeatedly exchange and aggregate information with neighbors. Understand gossip, and you understand GNNs.
🤔 Gossip protocols want "the whole cluster to converge to one state," yet GNNs fear "all nodes converging (over-smoothing)" — same mechanism, so why does one treat convergence as the goal and the other as a disaster?

Graph Convolutional NetworkGCN

spectral, simplifiedclassic GNN
One-line analogy

GCN's aggregation is a normalized neighbor average — like an AVG() over a one-to-many relationship in a database, plus one extra step: weighting by both parties' degrees. The intuition: if a neighbor connects to 1000 nodes, its message to you should be discounted (it says the same thing to everyone); a neighbor connected only to you sends a more "exclusive" message that deserves more weight.

The problem it solves + how it works

GCN (Kipf & Welling 2016) simplifies the heavy theory of "spectral graph convolution" all the way down to a startlingly clean formula. In the previous card we used A @ H for a naive sum, but that has two flaws: (1) it ignores the node itself; (2) the aggregate explodes for high-degree nodes. GCN's fix is one line:

H⁽ˡ⁺¹⁾ = σ( D̃ Ã D̃ H⁽ˡ⁾ W )

Don't let it scare you — unpack it piece by piece. à = A + I: the adjacency matrix plus the identity, i.e. "treat yourself as your own neighbor" (self-loop), so updates don't drop self-information. is the degree matrix (how many edges each node has). The block à D̃ is symmetric normalization: it divides each edge i→j by √(deg_i · deg_j), suppressing high-degree nodes' influence. W is learnable weights, σ the activation. In plain words: each node = activation( normalized weighted average of its neighbors (including itself), then a linear transform ).

Code example
# PyTorch Geometric: the most-used graph-learning library
import torch
from torch_geometric.nn import GCNConv
from torch_geometric.datasets import Planetoid

dataset = Planetoid(root="/tmp/Cora", name="Cora")  # citation graph
data = dataset[0]   # 2708 papers, edges = citations

class GCN(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.c1 = GCNConv(dataset.num_features, 16)
        self.c2 = GCNConv(16, dataset.num_classes)
    def forward(self, x, edge_index):
        x = torch.relu(self.c1(x, edge_index))  # hop 1
        return self.c2(x, edge_index)            # hop 2 → classify
# Just 2 layers reach ~81% node-classification accuracy on Cora
Common misconception + practical scenario
"GCN learns the weight for each neighbor" — wrong. GCN's edge weights are fixed entirely by graph structure (degree); they don't change during training and don't distinguish neighbor importance. After normalization, all neighbors are averaged "equally." This limitation is exactly what the next card, GAT, sets out to solve.
📌 Decision-support scenario: build your knowledge-base notes into a citation graph (notes = nodes, cross-references = edges), and use a GCN for semi-supervised classification — hand-label just 5% of the notes' topics and the GCN will "diffuse" labels along the citation structure to the other 95%, auto-categorizing your whole knowledge base.
Takeaway + question
💡 GCN = a degree-normalized neighbor average with a built-in self-loop + a linear transform. One formula hiding the extreme simplification of an entire spectral theory.
🤔 GCN uses "degree" to set neighbor weights, essentially assuming "less-connected neighbors are more important." Does that hold in your microservice graph? When would it badly misjudge?

Graph Attention NetworkGAT

attention on graphsadaptive weights
One-line analogy

If GCN "averages all neighbors by a fixed rule," GAT equips each neighbor with a smart, weighted load balancer: the weight is no longer hard-coded by degree but computed dynamically from how relevant that neighbor is to me. This is exactly the attention mechanism from Day 1 — except this time the "sequence" is your set of neighbors.

The problem it solves + how it works

GCN's Achilles' heel: all neighbor weights are fixed by structure. But real neighbor importance varies wildly — in your social network, your 1 closest friend matters far more than 200 acquaintances who clicked "like." GAT (Veličković et al. 2017) brings the Transformer's attention onto graphs: for each neighbor j of node i, it first computes an attention coefficient αᵢⱼ ("how important j is to i"), then aggregates weighted by it.

Three steps: (1) concatenate i's and j's features, pass through a small network to get a raw score eᵢⱼ; (2) softmax-normalize the scores over all of i's neighbors (weights sum to 1, just like attention); (3) sum neighbor features weighted by αᵢⱼ. The biggest difference from GCN:

center node i aggregates 3 neighbors

GCN (weights fixed by degree):
j₁ ×0.33j₂ ×0.33j₃ ×0.33 ← all equal

GAT (weights learned, content-dependent):
j₁ ×0.7j₂ ×0.2j₃ ×0.1 ← α set by relevance
↑ same graph, GAT can "focus" on the truly relevant neighbor

Like the Transformer, GAT uses multi-head attention: run several independent attention sets in parallel and concatenate, stabilizing training and capturing multiple kinds of neighbor relationships. The cost is more compute than GCN — there's no free lunch for adaptivity.

Code example
from torch_geometric.nn import GATConv

# Swap the previous card's GCNConv straight for GATConv
# heads=8: 8 attention heads in parallel (cf. Transformer multi-head)
self.c1 = GATConv(in_dim, 8, heads=8)       # outputs 8×8=64 dims
self.c2 = GATConv(64, num_classes, heads=1)  # single head at the end

# forward is identical to GCN; all the difference is internal:
# GATConv learns an attention coefficient α per edge, not fixed by degree
# After training you can export α to see which neighbors it "focused" on
Common misconception + practical scenario
"GAT always beats GCN, just pick GAT" — wrong. On strongly homophilous graphs (neighbors mostly same-class) where structure already says enough, GCN is fast and great, and GAT's extra attention can even overfit. Attention earns its keep only when neighbors are heterogeneous with large importance gaps. Choose by data properties, not by which is newer.
📌 Interdisciplinary scenario: build the papers/books you've read into a citation-and-concept graph, train a GAT, then export the attention coefficients α to see "which cross-domain connections the model deems most critical" — a way to let the machine help surface the high-value bridges in your own knowledge network.
Takeaway + question
💡 GAT = GCN + attention: replace "fixed degree weights" with "learned, content-dependent attention weights," paying compute for expressiveness + interpretability.
🤔 GAT's attention coefficients make the model "interpretable" — you can see whom it attended to. But does "attention" equal "causal importance"? Is a high-attention neighbor necessarily one that truly drives the outcome? (Recall Day 36, causation vs correlation.)

Graph EmbeddingGraph Embedding

representation learningrandom walktransductive vs inductive
One-line analogy

Graph embedding computes a dense vector for each node so that "nodes close in the graph are also close in vector space" — essentially the same as Day 22's semantic search: map objects into a vector space and use distance for similarity. The only difference is that "similar" here comes from graph structure rather than text semantics. The resulting vectors can feed any downstream model (classifier, recommender, nearest-neighbor search).

The problem it solves + how it works

Sometimes you don't want to train an end-to-end GNN; you just want ready-made node vectors. node2vec (Grover & Leskovec 2016) is wonderfully elegant — it reuses word2vec directly: perform random walks on the graph to generate many "node sequences," treat the sequences as "sentences" and nodes as "words," and run word2vec — so nodes that frequently co-occur on walk paths get nearby vectors. It also uses two parameters p, q to steer walks toward BFS (breadth, capturing local community structure) or DFS (depth, capturing global roles) — directly analogous to a crawler's two traversal strategies.

But node2vec has a fundamental limitation called transductive: it looks up a vector for each node seen during training, so a new node leaves it stuck — you must retrain the whole thing. On dynamic graphs (new users daily) that's fatal. GraphSAGE (Hamilton et al. 2017) fixes it: instead of storing a vector per node, it learns an aggregation function (sample neighbors + aggregate — still message passing at heart). When a new node arrives, compute its vector on the fly from its neighbors — this is inductive.

transductive vs inductive — the key difference

node2vec = lookup table: store each old node's vector
  new node arrives not found, must retrain whole graph

GraphSAGE = learn a function: store "how to compute," not "the computed"
  new node arrives compute from neighbors instantly
Code example
from torch_geometric.nn import Node2Vec
import torch

# Random walks on the graph to learn each node's embedding
model = Node2Vec(
    data.edge_index, embedding_dim=128,
    walk_length=20,      # 20 steps per walk
    context_size=10,     # word2vec-style window
    p=1.0, q=1.0,        # small p → BFS(local), small q → DFS(global)
)
loader = model.loader(batch_size=128, shuffle=True)
# ... standard training loop ...
emb = model()           # (N, 128) one vector per node
# With the vectors: cosine similarity finds "structurally similar" nodes
Common misconception + practical scenario
"Embeddings are just a watered-down GNN; use a GNN whenever you can" — not quite. Methods like node2vec need no node features, running on pure structure, which makes them more practical in cold-start cases where you have only connections, no node attributes (e.g. a pure social-relationship graph). GNNs usually need node features to shine.
📌 Personal-project scenario: build a graph from your browser bookmarks / read-it-later by "visited in the same session" and "hyperlinked to each other," compute node2vec vectors, then cluster — auto-discovering the "hidden communities" of your reading interests, revealing the real structure of your thinking better than manual tags.
Takeaway + question
💡 Graph embedding translates "structurally similar" into "close in vector space." node2vec relies on random walks (transductive, lookup); GraphSAGE learns an aggregation function (inductive, handles new nodes) — pick the latter for dynamic graphs.
🤔 node2vec's p, q control BFS/DFS preference, which is really asking "is a node's identity defined by its neighbor community or by its structural role?" Which of these two similarities is closer to what you mean when you say "two people are alike"?

Further ReadingFurther Reading

Deep QuestionsDeep Questions

1. A Transformer is actually a special kind of graph neural network — can you say why?
Treat each token in a sentence as a node, and the Transformer's self-attention is equivalent to running GAT on a fully connected graph (every token linked to every token): the attention coefficients are GAT's αᵢⱼ, and multi-head is multi-head GAT. In other words, Transformer = an attention GNN on a fully connected graph. It drops the "sparse structure" prior (assuming any two tokens may be related) and recovers order via positional encoding (Day 13). Seen the other way, a GNN adds the strong structural constraint that "you can only interact with neighbors you're actually connected to in the graph." This equivalence is an important recent insight — it explains why Graph Transformers fuse the two schools: attention on sparse graphs. For BigCat: the attention you already understand transfers to graphs at near-zero cost — just swap "all positions in a sequence" for "neighbors in a graph."
2. Over-smoothing limits GNNs to shallow depth (2–4 layers), yet Transformers stack dozens — both "repeatedly aggregate information," so why the different fates?
The difference is the scope and manner of aggregation. A GNN layer aggregates only real neighbors, and aggregation is "mean/sum" — run more rounds and information spreads evenly like ink in water, node representations converge (mathematically toward some stationary/low-frequency component of the graph). A Transformer layer is fully connected + attention-weighted; attention can selectively not average (weights can be extremely uneven), and residual connections plus LayerNorm fight convergence hard. The fixes for GNN over-smoothing borrow exactly these: residual connections (e.g. DeepGCN), skip connections, and giving attention (GAT) the power to preserve differences. The deeper meaning: this is another instance of the "synchronization vs consistency vs information preservation" triangle in distributed systems — you want information to propagate, but not until everyone becomes one identical voice. Gossip treats convergence as the goal; a GNN treats "just-enough non-convergence" as an art.
3. GAT's attention coefficients are advertised as "interpretable" — but is that interpretability trustworthy?
Be wary. A high attention weight only means the model routed more information flow to that neighbor during the forward pass — a correlational description, not causal importance (echoing Day 36). Research has noted that attention weights are not robust as explanations: for the same prediction there can be very different attention distributions (explanations aren't unique), and perturbing high-attention inputs sometimes doesn't change the prediction. So the right stance is: treat α as "an observable signal of model behavior" for generating hypotheses, debugging, and visualization — and do not treat it as evidence that "this neighbor causally determined the outcome." To verify causation you must intervene (delete the edge and see if the prediction changes), not just read the weights. For anyone pursuing the "AI super-individual," this is key literacy: use explanation tools, but know what conclusions each kind of explanation can and cannot support.
4. Among your distributed-systems intuitions, which transfer directly to graph learning, and which will mislead you?
Transferable: (a) gossip/epidemic protocols → the multi-round convergence intuition of message passing; (b) consistent hashing/partitioning → graph sampling and distributed GNN training on huge graphs (GraphSAGE's neighbor sampling exists precisely to fit a single machine); (c) caching and hot/cold tiering → the lookup (transductive) vs compute-on-the-fly (inductive) trade-off in embeddings; (d) cascading failures → using GNNs to predict failure propagation. Misleading: (i) distributed systems pursue eventual consistency (global agreement), which GNNs fear (over-smoothing) — opposite goals; (ii) you're used to nodes being "homogeneous servers," but graph-learning nodes are highly heterogeneous, with neighbor importance varying enormously (exactly GAT's motivation); (iii) distributed-systems edges are "communication links," roughly semantics-free, while graph-learning edges often carry their own semantics and weights (friend vs colleague vs stranger) — ignoring edge types loses huge information (heterogeneous/relational GNNs handle this). Transferring intuition is powerful, but check each premise — which is itself interdisciplinary-thinking practice.
5. If "graphs" are the mathematics of relationships, what would building your life/knowledge into a graph and analyzing it with a GNN reveal — and what would it lose?
It would reveal structural insights: which concepts in your knowledge network are "high-centrality" hubs (delete one and whole understandings break), which are islands (learned but not wired into the system), which cross-domain bridges are scarcest yet most valuable (the links among Buddhism ↔ distributed systems ↔ complexity science). These are invisible from a mere "node list" (your reading list) — the value is in the edges, not the points, which is the heart of graph thinking. What it loses: (a) time and causality — a static graph erases the order and causation of "learn A before you can grasp B" (needs temporal/causal graphs, Days 35/36); (b) semantic richness — an edge is just "related," but "inspired me" and "refuted me" are completely different relations, flattened into one edge; (c) the subject's inner experience — a graph can model the structure of your knowledge but not "that flash of insight when a concept clicked." The deeper lesson: a graph is a powerful abstraction of relationships, but abstraction is simplification. For anyone pursuing human–AI collaboration, the right stance is to use graphs (and GNNs) to extend your ability to see structure, while soberly guarding the parts that cannot be graphed — which are often the very core of "what makes you you."