"Few-shot" is a problem: given only a handful of examples, how do you learn a new task? Today's four concepts are four progressively deeper answers—from "reuse old weights" (transfer learning), to "learn a good starting point" (MAML), to "don't update weights at all, just look up by distance" (metric learning), and finally to the most magical one in LLMs: "learning happens right inside the prompt, without even leaving the forward pass" (in-context learning). The core tension is one sentence: when data is too scarce for gradient descent, at which layer should "learning" happen?
Transfer learning is not writing a service from scratch, but forking a working base image and only changing the top business-logic layer. The pretrained model = a base image with the connection pool pre-warmed and common dependencies installed; your small dataset = that bit of customization. The low-level general abilities (edge detection, grammar) already exist; you only fill in the "last mile".
Pain point: training a deep network from scratch needs huge labeled data + massive compute. But many tasks share low-level features—recognizing cats and dogs both first require edges, textures, shapes. Re-learning these is waste.
Mechanism: take a network pretrained on a large dataset, freeze most of the early layers (they already learned general features), and retrain only the last layer or two to fit your new task. Neural nets are naturally hierarchical: shallow layers learn generic low-level features (edges, lexical patterns), deep layers learn task-specific high-level ones. So "freeze the bottom, fine-tune the top" saves both data and compute. This is the shared bedrock of the next three concepts—without a good pretrained representation, few-shot learning is a non-starter.
import torch, torch.nn as nn from torchvision import models # 1) Load a ResNet pretrained on ImageNet (base image) net = models.resnet50(weights="IMAGENET1K_V2") # 2) Freeze all layers — generic features stop updating for p in net.parameters(): p.requires_grad = False # 3) Swap the final head for your task (e.g. 5 classes) — only this trains net.fc = nn.Linear(net.fc.in_features, 5) # 4) Optimizer collects only requires_grad=True params (the new head) opt = torch.optim.Adam( (p for p in net.parameters() if p.requires_grad), lr=1e-3) # → a few hundred images train a decent classifier, not millions
Ordinary training is "find the optimal solution for one specific task"; MAML is "find the best 'starting point' so it adapts to any new task in just a few steps." Analogy: instead of hand-building an environment for each new project, you carefully configure a golden base image—any new project clones it and runs after tweaking three or five config lines. MAML optimizes not the endpoint, but the starting point.
Pain point: transfer learning reuses "features," but the new task still has to be trained. Can we make the model learn "how to learn a new task" itself? That's meta-learning ("learning to learn"). MAML (Model-Agnostic Meta-Learning, Finn et al. 2017) gives an elegant answer.
The mechanism is bi-level optimization, two nested loops:
One intuition-first formula for the outer update (every symbol explained):
θ ← θ − β ∇θ Σtask i Li( θ − α ∇θ Li(θ) )
The inner bracket θ − α∇Li(θ) is the θ' you get from "taking one step on task i" (α is the inner learning rate); the outer term differentiates this adapted loss again to update the starting point θ (β is the outer learning rate). The key counterintuitive point: a "gradient of a gradient" appears here—you're differentiating "how good things will be after one gradient step." MAML isn't finding params that are "good now," it's finding params that are "one nudge away from good."
# Skeleton of one MAML outer step (pseudo-PyTorch, data loading omitted) def maml_step(theta, tasks, alpha=0.01, beta=0.001): meta_loss = 0 for task in tasks: sup, qry = task.support(), task.query() # support / query set # Inner loop: one step on support → adapted θ' loss = forward_loss(theta, sup) grad = torch.autograd.grad(loss, theta, create_graph=True) theta_prime = [w - alpha*g for w, g in zip(theta, grad)] # Add θ' 's loss on query to meta_loss (note: query!) meta_loss += forward_loss(theta_prime, qry) # Outer loop: second-order gradient w.r.t. the starting point θ meta_grad = torch.autograd.grad(meta_loss, theta) return [w - beta*g for w, g in zip(theta, meta_grad)] # create_graph=True lets us differentiate "loss after one gradient step" (gradient of a gradient)
The first two concepts still "train/update weights." Prototypical networks switch approach: when a new task comes, don't train at all—compute one 'centroid vector' per class and store it; for a new sample, find which centroid it's closest to, and that's the answer. This is essentially the vector search + nearest neighbor (KNN) you know, only the embedding space is learned. "Classification" degenerates into "lookup."
Pain point: MAML still runs an inner-loop gradient descent per new task—too heavy at inference time. Can adapting to new classes involve no gradients at all? Prototypical Networks (Snell et al. 2017) answer: turn classification into a geometry problem.
Mechanism (this is the standard N-way K-shot setup—N new classes, K samples each):
Turn distance into probability with softmax (every symbol explained):
p(y=k | x) = softmax( −d( f(x), ck ) )
d is the distance (usually Euclidean), so closer → larger −d → higher probability. The whole "adaptation" is "compute a few average vectors"—zero gradient updates. What the model truly learns is that encoder f: making same-class samples cluster and different classes spread apart in the embedding space. Once the space is good, classifying new classes is just querying the nearest centroid. This is the same vector geometry as "retrieve the most relevant docs" in RAG.
import torch, torch.nn.functional as F # support: [N, K, D] encoded vectors; query: [Q, D] def proto_classify(support, query): # 1) Each class prototype = mean of its K support vectors → [N, D] prototypes = support.mean(dim=1) # 2) Euclidean distance from each query to each prototype → [Q, N] dist = torch.cdist(query, prototypes) # 3) Negative distance through softmax = class probs (closer = higher) return F.softmax(-dist, dim=1) # At inference: a brand-new 5-way task → just compute prototypes to classify # No backprop needed — "adaptation" is a few mean() calls probs = proto_classify(support_emb, query_emb) pred = probs.argmax(dim=1)
The first three concepts at least "computed something" (trained weights, computed prototypes). ICL is the most magical: you stuff a few examples into the prompt, not a single byte of model weights changes, and it "learns" the new task. Analogy: a database prepared statement—the same engine, no recompile, instantly generates an execution plan from the few parameters you pass. "Learning" happens inside a single forward pass and evaporates when inference ends.
Phenomenon: GPT-3 (Brown et al. 2020, "Language Models are Few-Shot Learners") found something strange—without fine-tuning, just giving a few "input→output" examples in the prompt, a large model can do a new task. This is in-context learning, the underlying principle of few-shot in prompt engineering. But "why does learning happen without updating weights" has been a puzzle, with two mainstream explanations:
A counterintuitive finding (Min et al. 2022): shuffling the few-shot labels, deliberately mislabeling them, barely drops performance. This shows the examples' core role isn't providing "correct answers," but providing the task's "format, label space, input distribution"—i.e. helping the model "locate" which ability to activate, echoing the Bayesian explanation.
| Paradigm | Update weights? | "Adaptation" happens at |
|---|---|---|
| Transfer Learning | Yes (top layers) | fine-tuning stage |
| MAML | Yes (inner few steps) | explicit gradient descent |
| Prototypical Net | No | compute a few mean vectors |
| ICL | No (never changes) | inside one forward pass |
from anthropic import Anthropic client = Anthropic() # reads ANTHROPIC_API_KEY env var # few-shot ICL: 3 examples in the prompt, weights untouched shots = ( "Review: shipping was too slow → negative\n" "Review: quality exceeded expectations → positive\n" "Review: packaging was so-so → neutral\n") resp = client.messages.create( model="claude-opus-4-8", max_tokens=16, messages=[{"role": "user", "content": shots + "Review: support was very patient → "}]) print(resp.content[0].text) # → positive # From 3 examples the model "reads off" task format and label space, # classifying with no training at all — that's in-context learning