AI/ML Explained: CNN & Vision

Day 18 · 2026-06-04
For: engineers with coding experience, outside the AI field

ConvolutionConv

vision basicsweight sharing
One-line analogy

A convolutional kernel is a stored procedure that slides across an image—the same small function moves position by position, computing a dot product over whatever local pixels it covers. The point isn't the sliding, it's weight sharing: one piece of logic deployed everywhere, like one CDN edge function serving every region's requests, rather than a separate set of weights per pixel.

What problem it solves + how it works

If you feed an image straight into a Day-17 fully-connected layer (every input wired to every neuron), parameters explode: a 1000×1000 image has a million inputs, and a thousand-neuron hidden layer means a billion parameters—and the model can't tell "the cat shifted 10 pixels" from "a different cat." Convolution breaks this with two moves: local connectivity (each neuron sees only a small receptive field) + weight sharing (one kernel sweeps the whole image). Parameters drop from a billion to a few hundred, and you get translation invariance for free—the same feature is caught by the same kernel wherever it appears.

Mechanically, each kernel learns one "local pattern detector": the first layer learns edges/color blobs; stacking layers grows the receptive field, so layer two composes textures and corners, higher layers compose eyes, wheels, eventually a whole cat. Interleaved pooling (taking local max/average) downsamples—shrinking size and adding robustness to tiny shifts. This "local → compose → abstract" hierarchy was LeCun's design intuition back in 1989, and AlexNet detonated it on ImageNet in 2012.

One 3×3 kernel slides over the input (weight sharing)

input map     kernel     output map
▦ ▦ ▦ ▦ ▦ × ▣ 3×3 = ▨ ▨ ▨
▦ ▦ ▦ ▦ ▦ ↓slide        ▨ ▨ ▨
▦ ▦ ▦ ▦ ▦ dot product ▨ ▨ ▨
↑ only these 9 weights for the whole image → param count independent of image size
Code example
import torch
import torch.nn as nn

# One conv layer: 3 input channels (RGB), 16 output feature maps
conv = nn.Conv2d(in_channels=3, out_channels=16,
                 kernel_size=3, stride=1, padding=1)

x = torch.randn(1, 3, 32, 32)   # 1 RGB image, 32×32
y = conv(x)
print(y.shape)               # [1, 16, 32, 32] padding=1 keeps size

# Key: params = 16×3×3×3 + 16(bias) = 448, independent of image size
# Same output via a dense layer = 3·32·32 × 16·32·32 ≈ 800M params
print(sum(p.numel() for p in conv.parameters()))  # 448
Common misconception + your scenario

⚠️ Misconception: that kernels are hand-designed (like Photoshop sharpen/blur filters). In fact CNN kernel weights are all learned automatically by backprop—you only define kernel size and count; what detector each becomes is decided by the data.

🎯 Your scenario: for a personal project tagging/deduplicating thousands of family photos, your first instinct might be a cloud LLM API. But a small few-layer CNN (or an off-the-shelf pretrained backbone) runs locally—zero API cost, data never leaves home. Understanding convolution tells you when you don't need a big model.

Takeaway + question
💡 Convolution's entire power comes from one constraint: local connectivity + weight sharing. It bakes the prior "images have translation invariance" directly into the architecture.
🤔 In the backend systems you know, where do you "apply one piece of logic to massive similar units"? What do those designs share with weight sharing?

Residual NetworksResNet

deep trainingskip connection
One-line analogy

A residual / skip connection is a bypass line on every layer—in code it's return x + f(x). Think of a middleware chain where each middleware has a default "pass through unchanged" bypass, and only modifies if it needs to. That bypass turns "add more layers" from "must learn something better" into "at worst, no worse."

What problem it solves + how it works

Before 2015 people hit a paradox: past a few dozen layers, training error went up. Note—this isn't overfitting (good on train, bad on test); it's degradation: the network can't even fit the training set. Intuitively, extra layers should at least learn an identity mapping (output the input unchanged) as a fallback, but optimizers struggle to make a stack of nonlinear layers approximate identity, and gradients decay badly on the way back.

He et al. 2015 solved it minimally: instead of having layers learn the target H(x) directly, learn the residual F(x) = H(x) − x, with final output H(x) = F(x) + x. If the optimum is the identity, the optimizer just drives F(x) to 0—far easier than fitting an identity from scratch. Even better, that +x opens a gradient highway: in backprop the gradient flows straight back to shallow layers via the skip, bypassing chain-rule decay. With this, ResNet pushed depth to 152 layers, won ImageNet 2015, and the "residual block" became standard in nearly every deep network—including the Transformer.

Residual block: H(x) = F(x) + x

  input x
    │      ╲ skip (identity)
  weight → ReLU  ║
  weight          ║ gradient highway
    │ F(x)   ╱
  ⊕ add output F(x)+x
↑ F(x)→0 falls back to identity; "adding layers never hurts"
Code example
import torch.nn as nn

class ResidualBlock(nn.Module):
    def __init__(self, ch):
        super().__init__()
        self.conv1 = nn.Conv2d(ch, ch, 3, padding=1)
        self.conv2 = nn.Conv2d(ch, ch, 3, padding=1)
        self.relu  = nn.ReLU()

    def forward(self, x):
        out = self.relu(self.conv1(x))
        out = self.conv2(out)
        return self.relu(out + x)   # the +x skip is the whole magic
Common misconception + your scenario

⚠️ Misconception: that residual connections exist "to fix vanishing gradients." Smooth gradients are a side effect; the paper's core point is the degradation problem—making deep nets easy to learn an identity mapping. Related, but not the same thing.

🎯 Your scenario: the residual idea transfers far beyond vision. When you design a multi-step AI pipeline (agent / data chain), keeping a "raw input passthrough" bypass on key stages lets the system degrade gracefully when a step fails, instead of collapsing wholesale—the engineering philosophy of output = step(x) + x.

Takeaway + question
💡 ResNet's insight: rewrite a hard-to-learn target as an easy-to-learn residual. One +x made thousand-layer networks possible, and every Transformer block uses it today.
🤔 "Make the default behavior 'no change, change only if needed'"—where else, in system design or even building personal habits, does this principle apply?

Vision TransformerViT

architecture fusionpatch token
One-line analogy

ViT does one thing: it "tokenizes" an image. It cuts the image into 16×16 patches, flattens each into a token, so an image becomes "a sentence made of image blocks," then runs the same text Transformer from Day 1. A vision problem gets translated into an NLP problem.

What problem it solves + how it works

CNNs carry strong inductive bias (prior assumptions baked into the architecture): locality, translation invariance. That makes them fast and accurate on small/medium data, but limits modeling of global long-range relationships—a convolution needs many layers before the top-left corner can "see" the bottom-right. Transformer self-attention is global by birth: any two tokens connect in one step. Dosovitskiy et al. 2020 made the bet: can we drop the convolutional prior and let pure attention + massive data learn visual regularities itself?

The flow: cut image into patches → linearly project each into a vector (patch embedding) → add positional encoding (Day 13, else the model has no sense of spatial order) → prepend a [CLS] token to aggregate global info → feed a standard Transformer encoder → classify from the [CLS] output. The conclusion is key and honest: data scale is the dividing line. On "mid-sized" ImageNet, ViT loses to a comparable CNN (without the inductive-bias prior, it has to relearn "locality" from data); but with pretraining data in the hundreds of millions, ViT overtakes—fewer priors become an advantage, because it isn't shackled by human assumptions. This is the vision version of Day 14's scaling laws.

Turning an image into a token sequence

🖼 image cut 16×16 patches
... (N patches)
     linear projection + positional encoding
[CLS]t₁t₂t₃...t_N
     Transformer Encoder (self-attention)
[CLS] output classifier
↑ almost the same stack as text Transformers—tokens just come from patches
Code example
from transformers import ViTImageProcessor, ViTForImageClassification
from PIL import Image

# Load ViT pretrained on ImageNet (base, patch16)
proc  = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224")
model = ViTForImageClassification.from_pretrained("google/vit-base-patch16-224")

img = Image.open("cat.jpg")
inputs = proc(images=img, return_tensors="pt")  # auto patch + normalize
logits = model(**inputs).logits
pred = logits.argmax(-1).item()
print(model.config.id2label[pred])          # predicted class, e.g. 'Egyptian cat'
Common misconception + your scenario

⚠️ Misconception: that "ViT fully replaced CNNs." Reality is each has its turf: with limited data/compute or when speed matters, CNNs (and modern variants) remain the default; for huge-scale pretraining and text/multimodal integration, ViT wins. Treating it as a mandatory upgrade backfires on small data.

🎯 Your scenario: ViT's real significance is architectural unification—images, text, audio can all be cut into tokens and fed into one Transformer. Understand it and you understand why today's multimodal models (GPT/Gemini seeing images) can "run all modalities on one engine."

Takeaway + question
💡 ViT proved: given enough data, a general architecture beats a specialized prior. The step "an image is a token sequence" merged vision into the Transformer's unified empire.
🤔 "Fewer priors + more data > more priors + less data"—does this trade-off belong only to deep learning? What does it suggest for cross-disciplinary judgment: gathering evidence vs trusting intuition?

Contrastive Language-Image PretrainingCLIP

multimodalcontrastivezero-shot
One-line analogy

CLIP packs images and text into one shared vector space via two-tower retrieval—the multimodal version of the embedding semantic search you know from Day 4. One image encoder, one text encoder, trained so that paired image-text vectors pull together and mismatched ones push apart. After training, "a photo of a cat" and the cat image nearly coincide in space.

What problem it solves + how it works

Traditional vision models have a hard limit: they only know classes fixed at training time. A model trained on 1000-class ImageNet will forever only output those 1000 labels—to add "latte art" you'd re-label data and retrain. CLIP wants zero-shot (recognizing arbitrary new concepts without any dedicated training).

Radford et al. 2021's approach: crawl 400 million (image, text) pairs from the internet and train with contrastive learning. Each batch takes N images and N texts; after encoding, compute an N×N similarity matrix—the diagonal is the true pairs (positives, pull up), the other N²−N entries are mismatches (negatives, push down). That's the essence of contrastive learning: don't predict absolute labels, just learn "who should be close to whom." At inference, a clever move: wrap candidate class names into a template ("a photo of a {label}"), encode them, compare similarity to the image vector, highest wins—classification becomes image-text retrieval, classes added at will, no retraining. CLIP thus became the visual bedrock of Stable Diffusion and multimodal LLMs.

Contrastive learning: N×N similarity matrix, diagonal = positives

      T₁  T₂  T₃  (text)
I₁  
I₂  
I₃  
(image)   ✓ pull up   ✗ push down
↑ inference: class name→text, take highest similarity = zero-shot classification
Code example
from transformers import CLIPProcessor, CLIPModel
from PIL import Image

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
proc  = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

img = Image.open("photo.jpg")
# Candidate classes—write anything, model need not have trained on them
labels = ["a photo of a cat", "a photo of a dog", "a coffee latte"]

inputs = proc(text=labels, images=img, return_tensors="pt", padding=True)
out = model(**inputs)
probs = out.logits_per_image.softmax(dim=1)  # image-text similarity → probs
print(labels[probs.argmax().item()])      # zero-shot prediction
Common misconception + your scenario

⚠️ Misconception: that CLIP "understands" image content. It learns image-text statistical association, not true visual reasoning—it stumbles on out-of-distribution concepts, counting, or spatial reasoning. Strong at "matching," weak at "reasoning."

🎯 Your scenario: CLIP lets you build a zero-training personal image semantic search—encode your whole family album into a vector store, then retrieve "photos of the kids at the beach" in natural language. It's Day 4 vector retrieval extended to images: runs locally, privacy intact.

Takeaway + question
💡 CLIP's lever is using natural language as the supervision signal: no more hand-labeling fixed classes—let massive "image + caption" pairs teach the model. The essence of zero-shot is quietly swapping classification for retrieval.
🤔 "Replace finely-labeled small data with weaker but massive supervision (web image-text)"—does this idea also hold for building your own knowledge and judgment?

Further Reading

Deep Questions

1. Convolution's inductive bias (locality, translation invariance) makes it efficient on small data; ViT throws away these priors and overtakes it with massive data. What general law about "priors vs data" does this trade-off reveal?
The essence: inductive bias is a double-edged sword. A prior is a "free assumption"—when data is scarce it fills in missing information and saves samples (a CNN never learns from scratch that "nearby pixels are correlated," because the architecture already assumes it). But a prior is also a ceiling: when it diverges from the real regularity, no amount of data can break past that wrong assumption. CNNs assume "local patterns are what matter," true for most natural images, but global long-range relations (the symmetry of a face, distant objects echoing each other) take many layers to reach. ViT has almost no spatial prior, so on small data it "wastes" samples rediscovering locality and loses to CNNs; yet once data reaches hundreds of millions, unbound by wrong priors, it learns regularities a CNN architecture can't express, and overtakes. General law: when data is scarce, strong priors win; when data is abundant, weak priors + big models win. This is the same story as Day 14 scaling laws, and even the shift of weight from "prior vs likelihood" in Bayesian inference—the more evidence, the more the prior should yield. For BigCat's cross-disciplinary judgment: relying on experiential intuition (strong prior) when information is thin is rational, but when you can gather abundant evidence, dare to let data overturn intuition.
2. ResNet's skip connection later became standard in every Transformer block. Why does "residual," a trick seemingly designed only for deep vision nets, generalize across architectures?
Because it solves not a vision-specific problem but a common ailment of all deeply stacked structures: as layers pile up, both the signal (forward) and gradient (backward) decay/distort over the long chain, and optimizers struggle to make deep layers approximate identity. The skip connection gives two architecture-agnostic benefits: (1) identity fallbackx + f(x) makes the worst case of "one more layer" degrade to "did nothing," so adding depth never hurts, giving confidence to stack arbitrary depth; (2) gradient highway—in backprop the gradient reaches shallow layers directly via +x, bypassing layer-by-layer multiplicative decay. Transformers run dozens to hundreds of layers; without residuals they simply won't train, so every attention sub-layer and FFN sub-layer is followed by x + sublayer(x). Deeper still, this is a meta-principle of architecture design: make "doing nothing" the default, easy-to-reach state, and concentrate learning pressure on "what to change relative to now" (residual) rather than "constructing everything from zero" (absolute). This idea is everywhere in engineering—incremental updates, delta encoding, git diff all "store the residual, not the full state."
3. CLIP uses "image-text pairs"—cheap but massive weak supervision—to replace expensive hand labeling. What does this paradigm shift mean for "what counts as a high-quality training signal"?
It overturns the old intuition that "the more precise the supervision, the better." Traditional ImageNet has humans label each image into one of 1000 classes—precise but expensive, and capped at those 1000 classes. CLIP uses the internet's ready-made image-text pairs—captions are noisy, irregular, often wrong—but the scale crushes (400M pairs) and the semantic space is open natural language. Key insight: a signal's "information content" = quality × scale × breadth of coverage; when scale and breadth are large enough, per-signal noise averages out, while the concept coverage open language brings is something no fixed label scheme can give. That's why CLIP can do zero-shot—its "classes" are not discrete labels but a continuous language space, naturally generalizing to unseen concepts. The cost: it inherits web data's biases and shallowness, learning statistical co-occurrence rather than deep understanding, so it can't count and is weak at spatial reasoning. The transfer for BigCat: when building personal knowledge, "extensive skimming + forming associations" (weak supervision, broad coverage) and "selective deep reading + mastery" (strong supervision, narrow depth) are two complementary signals; what a super-individual needs may not be either/or, but like CLIP+downstream fine-tuning—first lay a broad foundation with massive weak signal, then dig deep wells with a little precise signal at key points.
4. From CNN (local convolution) to ViT (global attention), vision architecture walked a path "from specialized to general." Is this path comparable to the software evolution you know "from specialized optimization to general platform"?
Highly comparable, but with one key difference worth thinking through. The similar side: early designs are specialized for performance/resource constraints (CNN's local connectivity = "hard-coded optimization" for image structure, just as early systems wrote specialized C modules for single-machine speed), later abundant compute/data shifts toward general abstraction (ViT/Transformer = a general compute substrate, just as the shift to general platforms, microservices, cloud-native), trading for composability and cross-domain reuse—one Transformer engine eats image/text/audio, just as one general infrastructure runs all business. The different side: in software, "generalization" usually trades performance for development efficiency (general solutions are typically slower and more resource-hungry), whereas in deep learning the general architecture under big data even overtakes specialized solutions on performance (accuracy)—rare in traditional engineering. The reason: software's specialized optimization is deterministic logic written by humans, and generalization just reorganizes it; but CNN's "specialization" hard-codes human priors into the architecture, while ViT's "generality" lets the model learn regularities from data better than human assumptions—generalization here buys both flexibility and a higher ceiling. This suggests a counterintuitive judgment: in domains where the data/compute bottleneck has been broken, "general + scale" may be not just a convenient compromise but genuinely the better solution. For your architecture decisions, ask: is this domain's bottleneck "human design wisdom" or "data and compute"? The answer decides whether to bet specialized or general.