A convolutional kernel is a stored procedure that slides across an image—the same small function moves position by position, computing a dot product over whatever local pixels it covers. The point isn't the sliding, it's weight sharing: one piece of logic deployed everywhere, like one CDN edge function serving every region's requests, rather than a separate set of weights per pixel.
If you feed an image straight into a Day-17 fully-connected layer (every input wired to every neuron), parameters explode: a 1000×1000 image has a million inputs, and a thousand-neuron hidden layer means a billion parameters—and the model can't tell "the cat shifted 10 pixels" from "a different cat." Convolution breaks this with two moves: local connectivity (each neuron sees only a small receptive field) + weight sharing (one kernel sweeps the whole image). Parameters drop from a billion to a few hundred, and you get translation invariance for free—the same feature is caught by the same kernel wherever it appears.
Mechanically, each kernel learns one "local pattern detector": the first layer learns edges/color blobs; stacking layers grows the receptive field, so layer two composes textures and corners, higher layers compose eyes, wheels, eventually a whole cat. Interleaved pooling (taking local max/average) downsamples—shrinking size and adding robustness to tiny shifts. This "local → compose → abstract" hierarchy was LeCun's design intuition back in 1989, and AlexNet detonated it on ImageNet in 2012.
import torch import torch.nn as nn # One conv layer: 3 input channels (RGB), 16 output feature maps conv = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, stride=1, padding=1) x = torch.randn(1, 3, 32, 32) # 1 RGB image, 32×32 y = conv(x) print(y.shape) # [1, 16, 32, 32] padding=1 keeps size # Key: params = 16×3×3×3 + 16(bias) = 448, independent of image size # Same output via a dense layer = 3·32·32 × 16·32·32 ≈ 800M params print(sum(p.numel() for p in conv.parameters())) # 448
⚠️ Misconception: that kernels are hand-designed (like Photoshop sharpen/blur filters). In fact CNN kernel weights are all learned automatically by backprop—you only define kernel size and count; what detector each becomes is decided by the data.
🎯 Your scenario: for a personal project tagging/deduplicating thousands of family photos, your first instinct might be a cloud LLM API. But a small few-layer CNN (or an off-the-shelf pretrained backbone) runs locally—zero API cost, data never leaves home. Understanding convolution tells you when you don't need a big model.
A residual / skip connection is a bypass line on every layer—in code it's return x + f(x). Think of a middleware chain where each middleware has a default "pass through unchanged" bypass, and only modifies if it needs to. That bypass turns "add more layers" from "must learn something better" into "at worst, no worse."
Before 2015 people hit a paradox: past a few dozen layers, training error went up. Note—this isn't overfitting (good on train, bad on test); it's degradation: the network can't even fit the training set. Intuitively, extra layers should at least learn an identity mapping (output the input unchanged) as a fallback, but optimizers struggle to make a stack of nonlinear layers approximate identity, and gradients decay badly on the way back.
He et al. 2015 solved it minimally: instead of having layers learn the target H(x) directly, learn the residual F(x) = H(x) − x, with final output H(x) = F(x) + x. If the optimum is the identity, the optimizer just drives F(x) to 0—far easier than fitting an identity from scratch. Even better, that +x opens a gradient highway: in backprop the gradient flows straight back to shallow layers via the skip, bypassing chain-rule decay. With this, ResNet pushed depth to 152 layers, won ImageNet 2015, and the "residual block" became standard in nearly every deep network—including the Transformer.
import torch.nn as nn class ResidualBlock(nn.Module): def __init__(self, ch): super().__init__() self.conv1 = nn.Conv2d(ch, ch, 3, padding=1) self.conv2 = nn.Conv2d(ch, ch, 3, padding=1) self.relu = nn.ReLU() def forward(self, x): out = self.relu(self.conv1(x)) out = self.conv2(out) return self.relu(out + x) # the +x skip is the whole magic
⚠️ Misconception: that residual connections exist "to fix vanishing gradients." Smooth gradients are a side effect; the paper's core point is the degradation problem—making deep nets easy to learn an identity mapping. Related, but not the same thing.
🎯 Your scenario: the residual idea transfers far beyond vision. When you design a multi-step AI pipeline (agent / data chain), keeping a "raw input passthrough" bypass on key stages lets the system degrade gracefully when a step fails, instead of collapsing wholesale—the engineering philosophy of output = step(x) + x.
+x made thousand-layer networks possible, and every Transformer block uses it today.ViT does one thing: it "tokenizes" an image. It cuts the image into 16×16 patches, flattens each into a token, so an image becomes "a sentence made of image blocks," then runs the same text Transformer from Day 1. A vision problem gets translated into an NLP problem.
CNNs carry strong inductive bias (prior assumptions baked into the architecture): locality, translation invariance. That makes them fast and accurate on small/medium data, but limits modeling of global long-range relationships—a convolution needs many layers before the top-left corner can "see" the bottom-right. Transformer self-attention is global by birth: any two tokens connect in one step. Dosovitskiy et al. 2020 made the bet: can we drop the convolutional prior and let pure attention + massive data learn visual regularities itself?
The flow: cut image into patches → linearly project each into a vector (patch embedding) → add positional encoding (Day 13, else the model has no sense of spatial order) → prepend a [CLS] token to aggregate global info → feed a standard Transformer encoder → classify from the [CLS] output. The conclusion is key and honest: data scale is the dividing line. On "mid-sized" ImageNet, ViT loses to a comparable CNN (without the inductive-bias prior, it has to relearn "locality" from data); but with pretraining data in the hundreds of millions, ViT overtakes—fewer priors become an advantage, because it isn't shackled by human assumptions. This is the vision version of Day 14's scaling laws.
from transformers import ViTImageProcessor, ViTForImageClassification from PIL import Image # Load ViT pretrained on ImageNet (base, patch16) proc = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224") model = ViTForImageClassification.from_pretrained("google/vit-base-patch16-224") img = Image.open("cat.jpg") inputs = proc(images=img, return_tensors="pt") # auto patch + normalize logits = model(**inputs).logits pred = logits.argmax(-1).item() print(model.config.id2label[pred]) # predicted class, e.g. 'Egyptian cat'
⚠️ Misconception: that "ViT fully replaced CNNs." Reality is each has its turf: with limited data/compute or when speed matters, CNNs (and modern variants) remain the default; for huge-scale pretraining and text/multimodal integration, ViT wins. Treating it as a mandatory upgrade backfires on small data.
🎯 Your scenario: ViT's real significance is architectural unification—images, text, audio can all be cut into tokens and fed into one Transformer. Understand it and you understand why today's multimodal models (GPT/Gemini seeing images) can "run all modalities on one engine."
CLIP packs images and text into one shared vector space via two-tower retrieval—the multimodal version of the embedding semantic search you know from Day 4. One image encoder, one text encoder, trained so that paired image-text vectors pull together and mismatched ones push apart. After training, "a photo of a cat" and the cat image nearly coincide in space.
Traditional vision models have a hard limit: they only know classes fixed at training time. A model trained on 1000-class ImageNet will forever only output those 1000 labels—to add "latte art" you'd re-label data and retrain. CLIP wants zero-shot (recognizing arbitrary new concepts without any dedicated training).
Radford et al. 2021's approach: crawl 400 million (image, text) pairs from the internet and train with contrastive learning. Each batch takes N images and N texts; after encoding, compute an N×N similarity matrix—the diagonal is the true pairs (positives, pull up), the other N²−N entries are mismatches (negatives, push down). That's the essence of contrastive learning: don't predict absolute labels, just learn "who should be close to whom." At inference, a clever move: wrap candidate class names into a template ("a photo of a {label}"), encode them, compare similarity to the image vector, highest wins—classification becomes image-text retrieval, classes added at will, no retraining. CLIP thus became the visual bedrock of Stable Diffusion and multimodal LLMs.
from transformers import CLIPProcessor, CLIPModel from PIL import Image model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32") proc = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32") img = Image.open("photo.jpg") # Candidate classes—write anything, model need not have trained on them labels = ["a photo of a cat", "a photo of a dog", "a coffee latte"] inputs = proc(text=labels, images=img, return_tensors="pt", padding=True) out = model(**inputs) probs = out.logits_per_image.softmax(dim=1) # image-text similarity → probs print(labels[probs.argmax().item()]) # zero-shot prediction
⚠️ Misconception: that CLIP "understands" image content. It learns image-text statistical association, not true visual reasoning—it stumbles on out-of-distribution concepts, counting, or spatial reasoning. Strong at "matching," weak at "reasoning."
🎯 Your scenario: CLIP lets you build a zero-training personal image semantic search—encode your whole family album into a vector store, then retrieve "photos of the kids at the beach" in natural language. It's Day 4 vector retrieval extended to images: runs locally, privacy intact.
x + f(x) makes the worst case of "one more layer" degrade to "did nothing," so adding depth never hurts, giving confidence to stack arbitrary depth; (2) gradient highway—in backprop the gradient reaches shallow layers directly via +x, bypassing layer-by-layer multiplicative decay. Transformers run dozens to hundreds of layers; without residuals they simply won't train, so every attention sub-layer and FFN sub-layer is followed by x + sublayer(x). Deeper still, this is a meta-principle of architecture design: make "doing nothing" the default, easy-to-reach state, and concentrate learning pressure on "what to change relative to now" (residual) rather than "constructing everything from zero" (absolute). This idea is everywhere in engineering—incremental updates, delta encoding, git diff all "store the residual, not the full state."