Almost every model in the past 36 days assumed the input was a sequence (text) or a grid (image). But much of the data you know best is actually a graph: service-call topologies, social relationships, knowledge graphs, database foreign-key networks. The core information in this data lives in the connections, not in any single node. Today is about how machine learning directly digests a "graph." Its core mechanism is just one thing: let nodes exchange information with their neighbors.
Message passing is the gossip protocol from distributed systems: every round, each node sends its state to all neighbors, collects the states they send back, aggregates them, and updates itself. Run K rounds and each node "knows" the entire neighborhood within K hops — the exact same playbook as a gossip protocol converging the whole cluster's state through repeated rounds.
Graph data has a brutal property: no fixed order, and neighbor counts vary. Text lines up as a sequence for a Transformer, images as a grid for a CNN — but a graph? Node A has 3 neighbors, node B has 500, and you can't pack that into a fixed-shape tensor. Early approaches hand-crafted graph features (degree, centrality…), which was laborious and lossy.
The Message Passing Neural Network (MPNN, Gilmer et al. 2017) unifies all graph neural networks into one framework. Each layer does three things —
The key design: the aggregation function must be permutation-invariant — sum, mean, max all give the same result no matter the order neighbors arrive in. That precisely solves the "graphs have no order" problem. This is the mother mechanism of every GNN: GCN and GAT merely swap in different formulas for the "message" and "aggregate" steps.
# Pure numpy demo of one message-passing layer (no framework) import numpy as np # A: adjacency matrix (who links whom) H: per-node feature vectors A = np.array([[0,1,1,0],[1,0,1,0],[1,1,0,1],[0,0,1,0]]) H = np.random.randn(4, 8) # 4 nodes, 8 dims each W = np.random.randn(8, 16) # learnable transform # ① Message + ② Aggregate: A @ H = "sum each node's neighbor features" agg = A @ H # (4,4)@(4,8) = (4,8) summed neighbors # ③ Update: linear transform + nonlinearity H_next = np.maximum(0, agg @ W) # ReLU, gives (4,16) new reps # Stack 2 layers → each node fuses its 2-hop neighborhood
GCN's aggregation is a normalized neighbor average — like an AVG() over a one-to-many relationship in a database, plus one extra step: weighting by both parties' degrees. The intuition: if a neighbor connects to 1000 nodes, its message to you should be discounted (it says the same thing to everyone); a neighbor connected only to you sends a more "exclusive" message that deserves more weight.
GCN (Kipf & Welling 2016) simplifies the heavy theory of "spectral graph convolution" all the way down to a startlingly clean formula. In the previous card we used A @ H for a naive sum, but that has two flaws: (1) it ignores the node itself; (2) the aggregate explodes for high-degree nodes. GCN's fix is one line:
H⁽ˡ⁺¹⁾ = σ( D̃-½ Ã D̃-½ H⁽ˡ⁾ W )
Don't let it scare you — unpack it piece by piece. Ã = A + I: the adjacency matrix plus the identity, i.e. "treat yourself as your own neighbor" (self-loop), so updates don't drop self-information. D̃ is the degree matrix (how many edges each node has). The block D̃-½ Ã D̃-½ is symmetric normalization: it divides each edge i→j by √(deg_i · deg_j), suppressing high-degree nodes' influence. W is learnable weights, σ the activation. In plain words: each node = activation( normalized weighted average of its neighbors (including itself), then a linear transform ).
# PyTorch Geometric: the most-used graph-learning library import torch from torch_geometric.nn import GCNConv from torch_geometric.datasets import Planetoid dataset = Planetoid(root="/tmp/Cora", name="Cora") # citation graph data = dataset[0] # 2708 papers, edges = citations class GCN(torch.nn.Module): def __init__(self): super().__init__() self.c1 = GCNConv(dataset.num_features, 16) self.c2 = GCNConv(16, dataset.num_classes) def forward(self, x, edge_index): x = torch.relu(self.c1(x, edge_index)) # hop 1 return self.c2(x, edge_index) # hop 2 → classify # Just 2 layers reach ~81% node-classification accuracy on Cora
If GCN "averages all neighbors by a fixed rule," GAT equips each neighbor with a smart, weighted load balancer: the weight is no longer hard-coded by degree but computed dynamically from how relevant that neighbor is to me. This is exactly the attention mechanism from Day 1 — except this time the "sequence" is your set of neighbors.
GCN's Achilles' heel: all neighbor weights are fixed by structure. But real neighbor importance varies wildly — in your social network, your 1 closest friend matters far more than 200 acquaintances who clicked "like." GAT (Veličković et al. 2017) brings the Transformer's attention onto graphs: for each neighbor j of node i, it first computes an attention coefficient αᵢⱼ ("how important j is to i"), then aggregates weighted by it.
Three steps: (1) concatenate i's and j's features, pass through a small network to get a raw score eᵢⱼ; (2) softmax-normalize the scores over all of i's neighbors (weights sum to 1, just like attention); (3) sum neighbor features weighted by αᵢⱼ. The biggest difference from GCN:
Like the Transformer, GAT uses multi-head attention: run several independent attention sets in parallel and concatenate, stabilizing training and capturing multiple kinds of neighbor relationships. The cost is more compute than GCN — there's no free lunch for adaptivity.
from torch_geometric.nn import GATConv # Swap the previous card's GCNConv straight for GATConv # heads=8: 8 attention heads in parallel (cf. Transformer multi-head) self.c1 = GATConv(in_dim, 8, heads=8) # outputs 8×8=64 dims self.c2 = GATConv(64, num_classes, heads=1) # single head at the end # forward is identical to GCN; all the difference is internal: # GATConv learns an attention coefficient α per edge, not fixed by degree # After training you can export α to see which neighbors it "focused" on
Graph embedding computes a dense vector for each node so that "nodes close in the graph are also close in vector space" — essentially the same as Day 22's semantic search: map objects into a vector space and use distance for similarity. The only difference is that "similar" here comes from graph structure rather than text semantics. The resulting vectors can feed any downstream model (classifier, recommender, nearest-neighbor search).
Sometimes you don't want to train an end-to-end GNN; you just want ready-made node vectors. node2vec (Grover & Leskovec 2016) is wonderfully elegant — it reuses word2vec directly: perform random walks on the graph to generate many "node sequences," treat the sequences as "sentences" and nodes as "words," and run word2vec — so nodes that frequently co-occur on walk paths get nearby vectors. It also uses two parameters p, q to steer walks toward BFS (breadth, capturing local community structure) or DFS (depth, capturing global roles) — directly analogous to a crawler's two traversal strategies.
But node2vec has a fundamental limitation called transductive: it looks up a vector for each node seen during training, so a new node leaves it stuck — you must retrain the whole thing. On dynamic graphs (new users daily) that's fatal. GraphSAGE (Hamilton et al. 2017) fixes it: instead of storing a vector per node, it learns an aggregation function (sample neighbors + aggregate — still message passing at heart). When a new node arrives, compute its vector on the fly from its neighbors — this is inductive.
from torch_geometric.nn import Node2Vec import torch # Random walks on the graph to learn each node's embedding model = Node2Vec( data.edge_index, embedding_dim=128, walk_length=20, # 20 steps per walk context_size=10, # word2vec-style window p=1.0, q=1.0, # small p → BFS(local), small q → DFS(global) ) loader = model.loader(batch_size=128, shuffle=True) # ... standard training loop ... emb = model() # (N, 128) one vector per node # With the vectors: cosine similarity finds "structurally similar" nodes