CS PAPERS DEEP-READ · PAPER 5
Dosovitskiy et al. · Google Research · ICLR 2021
In 2020, a team at Google proposed ViT (the Vision Transformer) and did something that sounded a little reckless at the time: they took the Transformer — the kind of brain behind ChatGPT, built for text — and pointed it at images completely unchanged, with no image-specific parts at all. The result: as long as it first sees enough pictures, it beats CNNs, the reigning champions that had ruled computer vision for eight years. Today, when you ask an AI to "look at a picture and talk about it," the eyes it uses are most likely a ViT.
The old way of seeing — the CNN — reads a painting with a magnifying glass: get up close to make out a small patch of brushwork, step back to assemble patches into parts, step back again to assemble the whole. "Look at neighbors first, near before far" is a rule carved into its bones. ViT throws away the magnifying glass: it cuts the whole painting into a hundred-plus little squares, lays them all out on the table, and reads them in one go, like reading a sentence of a hundred-plus words. Any two squares can lock eyes directly: the furry one in the top-left and the tail-shaped one in the bottom-right can huddle at first glance — "put us together, are we a cat?"
In the old world, text had text models and images had image models, each with its own inherited craft, mutually incompatible. ViT said: a picture can be read as a "sentence" too — treat the little squares as words, and seeing and reading can share one brain. The weight of that sentence only became fully clear years later: once it's the same brain, images and words can genuinely "think together" — the picture-reading ChatGPT, and the AIs that paint from a one-line prompt, all stand on this step.
Two steps. Step one, turn the image into "words": slice the picture into small squares, compress each into a string of numbers — like making a flashcard out of a word — and stamp each card with a seat number saying "you originally sat in this row and column." Without the stamps, the model would just see a bag of shuffled puzzle pieces. Step two, run the flashcards through a Transformer: in every layer, each piece sizes up all the other pieces and trades notes — the furry piece gets confirmation from the pointy-ear piece — and layer by layer, scattered clues assemble into a verdict: this is a cat.
A CNN is born with a "how to look at pictures" manual: check your neighbors first, ignore the far side. When pictures are scarce, the manual is a crutch — the CNN learns fast and steady, while the blank-slate ViT stumbles and scores worse. But once pictures are plentiful (hundreds of millions), the manual becomes a cage: some clues can only be seen by spanning the whole image, and the blank slate, with no rules fencing it in, figures all of that out by itself — and overtakes. One honest note on the cost: this overtaking required feeding it hundreds of millions of images and a mountain of compute, far beyond an ordinary lab at the time — the data-frugal recipes came later, from others.
Cut the image into little squares and feed it to a Transformer as a sentence; with enough pictures, carrying no inherited rules about how to see turns out to be stronger. From then on, seeing and reading share one brain — this paper pushed open the door to multimodal AI.
Want the patch-and-feed diagram, the "priors vs data" crossover experiment, and the real numbers? → switch to the deep read
ViT (the Vision Transformer) cuts an image into 16×16 patches, treats each patch as a "word," and feeds the resulting "visual sentence" to an almost completely unmodified standard Transformer — then shows by experiment that once pretraining data reaches hundreds of millions of images, this model with no image-specific design beats the strongest CNNs (88.55% ImageNet top-1 accuracy) while spending less pretraining compute. Vision and language have shared one architecture ever since, and the foundation of multimodal models was laid here.
The authors are Dosovitskiy, Beyer, Kolesnikov, and nine colleagues at Google Research (the Brain team); the paper appeared at ICLR 2021 (released October 2020). It inherits the Transformer (Paper 1) and BERT's pretrain–finetune paradigm on one side, and the AlexNet→ResNet deep-vision line on the other; downstream of it come CLIP, MAE, DINO, SAM, and DiT — the "eyes" of today's multimodal models are mostly members of the ViT family.
By 2020, NLP had converged on "Transformer + ever-larger pretraining": scale up models and data together and performance keeps climbing, with no saturation in sight. Vision, meanwhile, was still CNN territory. It's not that nobody had tried moving self-attention into vision — but earlier attempts either mixed it with convolutions or invented specialized local/sparse attention patterns to save compute, and those exotic structures ran poorly on existing hardware accelerators and never scaled.
The most naive plan — one token per pixel — is executed on the spot by arithmetic: self-attention compares all pairs, so its cost grows with the square of the token count, and a 224×224 image is 50,000-plus pixels; square that and it's astronomical.
So the question became: can a standard Transformer, changed as little as possible, do vision directly? And under what conditions can it beat CNNs, which had eight years of head start and built-in image priors?
ViT's first — and nearly only — image-specific operation: slice a 224×224 image into 16×16 patches, giving (224/16)² = 196 of them. Each patch is flattened into 16×16×3 = 768 numbers and passed through one learnable linear projection, producing a patch embedding — the exact counterpart of a word vector in NLP. One cut, and the sequence length drops from 50,000 to 196; the quadratic cost becomes instantly affordable. That is the literal meaning of the title: an image is worth 16×16 words.
Then add two old NLP parts: ① a learnable position embedding per slot, telling the model where each patch originally sat — without it, the model sees only a bag of shuffled puzzle pieces; ② following BERT, a [CLS] classification token at the front of the sentence, which participates in every layer's attention, pools whole-image information, and whose final output feeds a small classification head that says "what this is."
And then — that's it. What follows is a standard Transformer encoder (multi-head self-attention + MLP, each sub-layer wrapped in a residual connection; see Papers 1/2), with not a line changed. The "laziness" is deliberate: keep the architecture untouched, and everything the NLP world knows about scaling Transformers — training tricks, efficient implementations, all of it — transfers to vision overnight.
A CNN's strength comes from three inductive biases welded into its structure: locality (look at neighbors first), translation equivariance (a cat is a cat wherever it moves), and hierarchy (small features assemble into big ones). These assumptions are basically correct for images — half the answers filled in for free — a huge advantage when data is scarce.
ViT strips nearly all of it out: self-attention can look globally from layer one, and "how near or far to look" is no longer dictated by structure — the data teaches it. The only remaining traces of hand-made 2D awareness in the whole model: the patch-slicing cut, and one 2D interpolation of position embeddings when fine-tuning at a different resolution.
The paper's most important experiment quantifies this "priors vs data" trade: pretrained only on ImageNet (1.3M images), ViT is clearly worse than a comparable ResNet — the blank slate loses to the manual; at ImageNet-21k (14M) they roughly tie; at JFT-300M (300M) ViT overtakes across the board, and the bigger the model the more it benefits. One-line conclusion: large-scale pretraining beats hand-crafted inductive bias.
Intriguingly, a trained ViT grows CNN-like habits on its own: some attention heads in the lowest layers look only at nearby patches (convolution-like), while others look across the whole image from layer one — something a CNN structurally cannot do; the position embeddings also spontaneously learn the row-and-column layout. The priors weren't forced into the structure, but the data taught them anyway.
First, architectural unification. Images and text became the same kind of thing — token sequences — feeding the same kind of model. CLIP paired a ViT with a text Transformer for image–text contrastive learning and opened the multimodal era; the vision encoders inside today's picture-reading chat models are almost all ViT-family.
Second, NLP's scaling playbook transferred to vision wholesale. MAE brought BERT's "fill in the blanks" to images (mask most patches and reconstruct them), the DINO line built self-supervised visual representations, SAM ("segment anything") uses a ViT-H image encoder, and generative backbones switched from U-Net to Transformers (DiT — the foundation of the Sora / Stable Diffusion 3 lineage).
Third, methodology. It turned "inductive bias vs data scale" from a philosophical slogan into a measurable experiment, and "less hand design, more data and compute" became the field's default direction thereafter.
① In one line: cut the image into 16×16 patches as "words" and feed a standard Transformer, unchanged; with enough pretraining data, carrying no visual priors turns out stronger.
② The pain: vision was CNN-monopolized; earlier attention attempts either mixed in convolutions or ran slowly; per-pixel attention is ruled out by quadratic cost.
③ Mechanism one: patch embeddings — (224/16)² = 196 patches, each linearly projected to a vector, + position embeddings + a [CLS] token; sequence length drops from 50,000 to 196.
④ Mechanism two: strip out the CNN's locality/translation/hierarchy priors and buy them back with large-scale pretraining — lose at 1.3M images, tie at 14M, overtake at 300M.
⑤ Results: ViT-H/14 hits 88.55% on ImageNet, beating the strongest CNNs; ViT-L/16 overtakes BiT-L with roughly 1/15 the pretraining compute; the public-data recipe reaches 85.30%.
⑥ The model grows its own habits — low layers mix local-looking and global-looking heads — so priors need not be welded in; data can teach them.
⑦ Impact: one architecture for vision and language; CLIP / MAE / DINO / SAM / DiT all stand on it — the eyes of multimodal models.
⑧ Limits: fails on small data (DeiT later fixed this with tricks); quadratic cost at high resolution (Swin reinvited the priors); ConvNeXt questions whether the architecture was even necessary.