AI/ML Deep Dive: Meta-Learning & Few-Shot

Day 39 · 2026-06-25 · Difficulty ★★★☆☆
For: engineers with coding experience, not from an AI background
Engineering counterpart → super-individual D1: Prompt Engineering (how to pick & order few-shot examples)
Today's Thread

"Few-shot" is a problem: given only a handful of examples, how do you learn a new task? Today's four concepts are four progressively deeper answers—from "reuse old weights" (transfer learning), to "learn a good starting point" (MAML), to "don't update weights at all, just look up by distance" (metric learning), and finally to the most magical one in LLMs: "learning happens right inside the prompt, without even leaving the forward pass" (in-context learning). The core tension is one sentence: when data is too scarce for gradient descent, at which layer should "learning" happen?

Transfer Learning

Base ParadigmFeature Reuse
One-line Analogy

Transfer learning is not writing a service from scratch, but forking a working base image and only changing the top business-logic layer. The pretrained model = a base image with the connection pool pre-warmed and common dependencies installed; your small dataset = that bit of customization. The low-level general abilities (edge detection, grammar) already exist; you only fill in the "last mile".

Problem It Solves + How It Works

Pain point: training a deep network from scratch needs huge labeled data + massive compute. But many tasks share low-level features—recognizing cats and dogs both first require edges, textures, shapes. Re-learning these is waste.

Mechanism: take a network pretrained on a large dataset, freeze most of the early layers (they already learned general features), and retrain only the last layer or two to fit your new task. Neural nets are naturally hierarchical: shallow layers learn generic low-level features (edges, lexical patterns), deep layers learn task-specific high-level ones. So "freeze the bottom, fine-tune the top" saves both data and compute. This is the shared bedrock of the next three concepts—without a good pretrained representation, few-shot learning is a non-starter.

Code Example
import torch, torch.nn as nn
from torchvision import models

# 1) Load a ResNet pretrained on ImageNet (base image)
net = models.resnet50(weights="IMAGENET1K_V2")

# 2) Freeze all layers — generic features stop updating
for p in net.parameters():
    p.requires_grad = False

# 3) Swap the final head for your task (e.g. 5 classes) — only this trains
net.fc = nn.Linear(net.fc.in_features, 5)

# 4) Optimizer collects only requires_grad=True params (the new head)
opt = torch.optim.Adam(
    (p for p in net.parameters() if p.requires_grad), lr=1e-3)
# → a few hundred images train a decent classifier, not millions
Common Pitfall + Practical Scenario
Pitfall: "freeze more is always better, only ever train the last layer." Wrong. When your new task differs greatly from the pretraining data (medical scans vs natural photos), even the deep high-level features no longer apply, and you should "unfreeze more layers + use a smaller learning rate" for full fine-tuning. Tiny data → train only the head; moderate data, close domain → unfreeze a few layers; more data, distant domain → fine-tune the whole net.
📌 Super-individual scenario: you want a little tool to "auto-classify screenshots of your personal projects" but only have 200 labeled images. Don't train from scratch—freeze a pretrained ViT/ResNet, train only the head, and you can ship in an afternoon. Transfer learning is the entry ticket to deep learning with small data.
Takeaway + Reflection
💡 The essence of transfer learning isn't "learn faster," it's "reuse expensive general representations as shared infrastructure"—the same engineering wisdom as reusing open-source libraries or base images.
🤔 If shallow features are "generic," why can pretrained models across modalities (image vs text) barely transfer to each other? How fine-grained must "generic" be to be truly generic?

Meta-Learning / MAML

Learning to LearnBi-level Optimization2017
One-line Analogy

Ordinary training is "find the optimal solution for one specific task"; MAML is "find the best 'starting point' so it adapts to any new task in just a few steps." Analogy: instead of hand-building an environment for each new project, you carefully configure a golden base image—any new project clones it and runs after tweaking three or five config lines. MAML optimizes not the endpoint, but the starting point.

Problem It Solves + How It Works

Pain point: transfer learning reuses "features," but the new task still has to be trained. Can we make the model learn "how to learn a new task" itself? That's meta-learning ("learning to learn"). MAML (Model-Agnostic Meta-Learning, Finn et al. 2017) gives an elegant answer.

The mechanism is bi-level optimization, two nested loops:

  • Inner loop: take the current init params θ, do 1–few gradient steps on one task's few samples, getting task-specific params θ'. This simulates "fast adaptation."
  • Outer loop (meta-update): use θ' 's performance on that task's fresh samples to update the original θ. Note—what gets updated is the starting point θ, not θ'.

One intuition-first formula for the outer update (every symbol explained):

θ ← θ − β ∇θ Σtask i Li( θ − α ∇θ Li(θ) )

The inner bracket θ − α∇Li(θ) is the θ' you get from "taking one step on task i" (α is the inner learning rate); the outer term differentiates this adapted loss again to update the starting point θ (β is the outer learning rate). The key counterintuitive point: a "gradient of a gradient" appears here—you're differentiating "how good things will be after one gradient step." MAML isn't finding params that are "good now," it's finding params that are "one nudge away from good."

MAML bi-level structure (meta vs inner)
init θ──inner α, few steps──▶θ'₁ (task1)
└─θ'₂ (task2)θ'ₙ

outer β: use each θ' 's post-adaptation performance to revise θ → make θ a better starting point
Code Example
# Skeleton of one MAML outer step (pseudo-PyTorch, data loading omitted)
def maml_step(theta, tasks, alpha=0.01, beta=0.001):
    meta_loss = 0
    for task in tasks:
        sup, qry = task.support(), task.query()   # support / query set
        # Inner loop: one step on support → adapted θ'
        loss = forward_loss(theta, sup)
        grad = torch.autograd.grad(loss, theta, create_graph=True)
        theta_prime = [w - alpha*g for w, g in zip(theta, grad)]
        # Add θ' 's loss on query to meta_loss (note: query!)
        meta_loss += forward_loss(theta_prime, qry)
    # Outer loop: second-order gradient w.r.t. the starting point θ
    meta_grad = torch.autograd.grad(meta_loss, theta)
    return [w - beta*g for w, g in zip(theta, meta_grad)]
# create_graph=True lets us differentiate "loss after one gradient step" (gradient of a gradient)
Common Pitfall + Practical Scenario
Pitfall: "MAML is just like transfer learning—both give a good init." Looks alike, fundamentally different. Transfer learning's init is a byproduct of the pretraining task (happens to be generic); MAML's init is explicitly optimized for 'fast adaptation'—during training it already simulates the whole "here's a new task, take a few steps" process. Cost: second-order gradients are compute- and memory-heavy, so practice often uses first-order approximations (FOMAML / Reptile).
📌 Cross-disciplinary scenario: MAML's "optimize the start, not the end" is a transferable metacognitive model. When raising a child, or learning a new field yourself, rather than cramming one specific skill (endpoint), polish "the meta-ability to quickly enter any new field" (starting point)—good questioning, analogy-transfer habits. This is exactly the underlying algorithm of a "super-individual."
Takeaway + Reflection
💡 MAML raises "learning" one abstraction level: ordinary training optimizes parameters, MAML optimizes "the process of learning parameters." This "differentiate through the process" idea was the watershed of post-2017 few-shot research.
🤔 If "learning to learn" can stack one more level ("learning to 'learn to learn'"), what does it converge to? Or does it recurse infinitely?

Metric-based Few-shot / Prototypical Networks

Gradient-free AdaptEmbedding + Distance
One-line Analogy

The first two concepts still "train/update weights." Prototypical networks switch approach: when a new task comes, don't train at all—compute one 'centroid vector' per class and store it; for a new sample, find which centroid it's closest to, and that's the answer. This is essentially the vector search + nearest neighbor (KNN) you know, only the embedding space is learned. "Classification" degenerates into "lookup."

Problem It Solves + How It Works

Pain point: MAML still runs an inner-loop gradient descent per new task—too heavy at inference time. Can adapting to new classes involve no gradients at all? Prototypical Networks (Snell et al. 2017) answer: turn classification into a geometry problem.

Mechanism (this is the standard N-way K-shot setup—N new classes, K samples each):

  • Use a learned encoder f to map each sample to a vector;
  • Average each class's K support vectors → that class's "prototype" ck (centroid);
  • For a new sample x, compute the distance from f(x) to each prototype; the nearest prototype is the predicted class.

Turn distance into probability with softmax (every symbol explained):

p(y=k | x) = softmax( −d( f(x), ck ) )

d is the distance (usually Euclidean), so closer → larger −d → higher probability. The whole "adaptation" is "compute a few average vectors"—zero gradient updates. What the model truly learns is that encoder f: making same-class samples cluster and different classes spread apart in the embedding space. Once the space is good, classifying new classes is just querying the nearest centroid. This is the same vector geometry as "retrieve the most relevant docs" in RAG.

Code Example
import torch, torch.nn.functional as F

# support: [N, K, D] encoded vectors; query: [Q, D]
def proto_classify(support, query):
    # 1) Each class prototype = mean of its K support vectors → [N, D]
    prototypes = support.mean(dim=1)
    # 2) Euclidean distance from each query to each prototype → [Q, N]
    dist = torch.cdist(query, prototypes)
    # 3) Negative distance through softmax = class probs (closer = higher)
    return F.softmax(-dist, dim=1)

# At inference: a brand-new 5-way task → just compute prototypes to classify
# No backprop needed — "adaptation" is a few mean() calls
probs = proto_classify(support_emb, query_emb)
pred = probs.argmax(dim=1)
Common Pitfall + Practical Scenario
Pitfall: "prototypical networks are so simple, surely worse than MAML." Quite the opposite—on many few-shot benchmarks, the simple metric method matches or even beats complex MAML, while being far faster at inference and free of second-order-gradient headaches. Lesson: in few-shot regimes, a good embedding space + simple geometry often beats fancy optimization tricks. A simple inductive bias is actually an advantage when data is scarce.
📌 Personal-project scenario: for "face / voiceprint / product recognition" where classes keep getting added, don't use a fixed classification head (each new class needs retraining). Use metric-based: encoder stays fixed, a new class just stores one prototype vector. This is exactly how modern face recognition and vector-DB semantic search work—"register and it works," no retraining.
Takeaway + Reflection
💡 Metric-based few-shot fully moves "learning" from "tuning weights" to "designing a good representation space"—after which all new tasks measure with the same geometric ruler. This is the philosophy of "representation > model" taken to its extreme.
🤔 If a class's K samples are themselves scattered (high variance), "averaging into one prototype" distorts. How would you improve—weighting? Multiple prototypes? Or does this expose the limits of the very concept of a "class"?

In-context Learning (ICL)

No Weight UpdateLearning Inside Forward PassLLM Emergence
One-line Analogy

The first three concepts at least "computed something" (trained weights, computed prototypes). ICL is the most magical: you stuff a few examples into the prompt, not a single byte of model weights changes, and it "learns" the new task. Analogy: a database prepared statement—the same engine, no recompile, instantly generates an execution plan from the few parameters you pass. "Learning" happens inside a single forward pass and evaporates when inference ends.

Problem It Solves + How It Works

Phenomenon: GPT-3 (Brown et al. 2020, "Language Models are Few-Shot Learners") found something strange—without fine-tuning, just giving a few "input→output" examples in the prompt, a large model can do a new task. This is in-context learning, the underlying principle of few-shot in prompt engineering. But "why does learning happen without updating weights" has been a puzzle, with two mainstream explanations:

  • Implicit Bayesian inference (Xie et al. 2021): pretraining exposed the model to vast "coherent text sharing a common latent topic." Given a few examples, the model is really inferring the "latent task concept" the examples share, then continuing per that concept. The examples don't "teach"—they "locate" which already-learned ability you want.
  • Implicit gradient descent (von Oswald et al. 2023): under a simplified linear-attention setting, one can prove that the Transformer's forward computation over examples is equivalent to running gradient descent on an internal small model. In other words, attention layers "secretly" implement a learner inside the forward pass—mathematically linking ICL and MAML.

A counterintuitive finding (Min et al. 2022): shuffling the few-shot labels, deliberately mislabeling them, barely drops performance. This shows the examples' core role isn't providing "correct answers," but providing the task's "format, label space, input distribution"—i.e. helping the model "locate" which ability to activate, echoing the Bayesian explanation.

Four paradigms: at which layer does "learning" happen?
ParadigmUpdate weights?"Adaptation" happens at
Transfer LearningYes (top layers)fine-tuning stage
MAMLYes (inner few steps)explicit gradient descent
Prototypical NetNocompute a few mean vectors
ICLNo (never changes)inside one forward pass
Code Example
from anthropic import Anthropic
client = Anthropic()  # reads ANTHROPIC_API_KEY env var

# few-shot ICL: 3 examples in the prompt, weights untouched
shots = (
    "Review: shipping was too slow → negative\n"
    "Review: quality exceeded expectations → positive\n"
    "Review: packaging was so-so → neutral\n")
resp = client.messages.create(
    model="claude-opus-4-8", max_tokens=16,
    messages=[{"role": "user",
        "content": shots + "Review: support was very patient → "}])
print(resp.content[0].text)  # → positive
# From 3 examples the model "reads off" task format and label space,
# classifying with no training at all — that's in-context learning
Common Pitfall + Practical Scenario
Pitfall: "ICL means the model truly 'learns new knowledge' in the prompt." Be careful—examples mainly "locate the task, regularize the format," not inject new facts. Knowledge the model can't produce won't appear from a few examples (that's RAG/fine-tuning's job). What ICL excels at is "switching an existing ability into the mode you want": style, format, classification boundaries.
📌 Decision-support scenario: you want the AI to analyze problems by your fixed framework (e.g. "list assumptions → find counterexamples → give a confidence"). Rather than describing the requirement at length each time, give 1–2 complete worked examples, and the model instantly aligns in-context to your thinking format. This is more robust than a long block of instructions—demonstration beats explanation.
Takeaway + Reflection
💡 ICL is the "ultimate form" of few-shot learning: learning moves entirely from the "training stage" into "that single forward pass at inference"—zero weight updates, use-and-discard. From transfer learning to ICL, "learning" steadily migrates toward runtime—this thread itself is a microcosm of the past decade of ML.
🤔 Since wrong labels barely hurt, how much of few-shot "learning" is really "learning," and how much is "retrieval + locating"?
Engineering counterpart → super-individual D1: Prompt Engineering (how to pick, order, and size few-shot examples)

Further Reading

Deep Questions

1. Lay today's four concepts on a line (transfer learning → MAML → prototypical net → ICL)—what happens to "learning" along the way?
A clear "learning migrates toward runtime" thread. Transfer learning still learns at the training stage, just reusing someone's base. MAML rises a level, learning "how to learn fast" at training, but adapting a new task still runs a few real gradient steps. Prototypical net's adaptation has no gradient, degenerating to "compute means + compare distances." ICL skips even computing prototypes—adaptation happens entirely inside one forward pass, weights fixed. So the line is "push adaptation cost from training time all the way to inference time": further along, lower marginal cost, at the price of depending ever more on a strong pretrained base. This explains why ICL stars in the LLM era—when the base is strong enough, runtime adaptation is nearly free. BigCat's distributed intuition: like "pushing compute from batch to real-time streaming," trading latency for flexibility, provided the infrastructure is thick enough.
2. What does MAML's "gradient of a gradient" actually optimize? Why is it fundamentally different from pretraining that "happens to give a good init"?
Ordinary pretraining optimizes "do well on the pretraining task"; its init happens to be decent downstream—a byproduct, with no guarantee it's "easy to adapt." MAML's second-order gradient directly optimizes "how good after one gradient step": the inner θ−α∇L simulates adapting one step, the outer differentiates the adapted loss again. Geometrically it seeks a point "close to all tasks, with gradient-friendly terrain around it." The essential difference: transfer learning finds "a good position," MAML finds "a good position whose surrounding terrain favors fast descent"—the latter folds "the future optimization process" into the current objective. The cost is expensive second-order derivatives, hence FOMAML/Reptile first-order approximations; they drop the second-order term yet often still work, hinting MAML's gains may come mostly from "finding the task center" rather than second-order refinement.
3. "Shuffled labels barely drop accuracy" (Min et al. 2022)—what does this mean for whether in-context learning is real "learning"?
It forces redefining "learning." If learning means "extract rules from examples' input-label correspondence," then on classification ICL is largely not that—shuffling labels breaks the correspondence yet performance hardly changes. Min et al. argue examples really convey label space, input distribution, sequence format, plus "this is a classification task" itself—i.e. "evoking/locating" an ability the model already learned, not teaching a new mapping live, matching Xie's Bayesian view. But there's a boundary: this is mainly from classification/multiple-choice; for tasks needing genuine induction of a new mapping (e.g. an unseen synthetic function), von Oswald's "implicit gradient descent" fits better. So the answer is "it depends on the task"—ICL is a continuum between "retrieval-locating" and "implicit optimization," not black-and-white.
4. Prototypical nets say "a good embedding space matters more than a complex model"—where else does this "representation-first" philosophy recur?
A hidden line through modern ML. (a) RAG / semantic search: quality depends almost entirely on the embedding space, the retrieval algorithm is just engineering detail—isomorphic to "the encoder decides everything." (b) Contrastive learning: SimCLR, CLIP are exactly "don't train a classifier, just train a representation that clusters same-class and spreads different-class," after which a linear probe or even zero-shot solves downstream—the large-scale version of the prototype idea. (c) Word-vector linear structure (Day 30): king−man+woman≈queen, meaning encoded into geometry. Shared belief: the bottleneck is "how to represent the world," not "how to decide on the representation"—once the representation is right, decisions often degenerate to simple geometry. BigCat's transferable insight: rather than piling on complex business logic, first invest in "embedding data into a good space," and many downstream problems simplify themselves—same wisdom as a database's "right schema, queries write themselves."
5. If "learning" can be pushed all the way to runtime (ICL), is the line between "training" and "inference" still clear? What does this imply for future AI system design?
The line is systematically blurring. Traditionally training (change weights, slow, offline) and inference (use weights, fast, online) are distinct, but ICL moves "adapting a new task" into inference; o1/reasoning models let the model "spend more compute thinking" at inference (test-time compute); memory-equipped Agents accumulate experience across sessions, like slow online learning. The future looks more like a continuum: weights (slowest, most persistent) → long-term memory → context/prompt → a single forward pass's implicit compute (fastest, evaporating). Design implications: (1) Solidify stable general abilities into weights, leave variable personalization to context—Day 8 context engineering's "data lifecycle tiering" scaled up to "learning lifecycle." (2) A model that adapts in-context "knows what" depends on what you feed it; static benchmarks distort. (3) Most concretely for super-individuals: collaboration quality depends less on "how well the model was trained" (you can't change it) and more on "what context you give at runtime" (you can)—handing initiative back to the user.