An Image is Worth 16×16 Words (ViT) — CS Papers Deep-Read Paper 5

What did this paper do?

In 2020, a team at Google proposed ViT (the Vision Transformer) and did something that sounded a little reckless at the time: they took the Transformer — the kind of brain behind ChatGPT, built for text — and pointed it at images completely unchanged, with no image-specific parts at all. The result: as long as it first sees enough pictures, it beats CNNs, the reigning champions that had ruled computer vision for eight years. Today, when you ask an AI to "look at a picture and talk about it," the eyes it uses are most likely a ViT.

An analogy

The old way of seeing — the CNN — reads a painting with a magnifying glass: get up close to make out a small patch of brushwork, step back to assemble patches into parts, step back again to assemble the whole. "Look at neighbors first, near before far" is a rule carved into its bones. ViT throws away the magnifying glass: it cuts the whole painting into a hundred-plus little squares, lays them all out on the table, and reads them in one go, like reading a sentence of a hundred-plus words. Any two squares can lock eyes directly: the furry one in the top-left and the tail-shaped one in the bottom-right can huddle at first glance — "put us together, are we a cat?"

What's new

In the old world, text had text models and images had image models, each with its own inherited craft, mutually incompatible. ViT said: a picture can be read as a "sentence" too — treat the little squares as words, and seeing and reading can share one brain. The weight of that sentence only became fully clear years later: once it's the same brain, images and words can genuinely "think together" — the picture-reading ChatGPT, and the AIs that paint from a one-line prompt, all stand on this step.

How it works

Two steps. Step one, turn the image into "words": slice the picture into small squares, compress each into a string of numbers — like making a flashcard out of a word — and stamp each card with a seat number saying "you originally sat in this row and column." Without the stamps, the model would just see a bag of shuffled puzzle pieces. Step two, run the flashcards through a Transformer: in every layer, each piece sizes up all the other pieces and trades notes — the furry piece gets confirmation from the pointy-ear piece — and layer by layer, scattered clues assemble into a verdict: this is a cat.

Why the "blank slate" won

A CNN is born with a "how to look at pictures" manual: check your neighbors first, ignore the far side. When pictures are scarce, the manual is a crutch — the CNN learns fast and steady, while the blank-slate ViT stumbles and scores worse. But once pictures are plentiful (hundreds of millions), the manual becomes a cage: some clues can only be seen by spanning the whole image, and the blank slate, with no rules fencing it in, figures all of that out by itself — and overtakes. One honest note on the cost: this overtaking required feeding it hundreds of millions of images and a mountain of compute, far beyond an ordinary lab at the time — the data-frugal recipes came later, from others.

Remember one thing

Cut the image into little squares and feed it to a Transformer as a sentence; with enough pictures, carrying no inherited rules about how to see turns out to be stronger. From then on, seeing and reading share one brain — this paper pushed open the door to multimodal AI.

Want the patch-and-feed diagram, the "priors vs data" crossover experiment, and the real numbers? → switch to the deep read

In one sentence

ViT (the Vision Transformer) cuts an image into 16×16 patches, treats each patch as a "word," and feeds the resulting "visual sentence" to an almost completely unmodified standard Transformer — then shows by experiment that once pretraining data reaches hundreds of millions of images, this model with no image-specific design beats the strongest CNNs (88.55% ImageNet top-1 accuracy) while spending less pretraining compute. Vision and language have shared one architecture ever since, and the foundation of multimodal models was laid here.

Glossary

Neural network / deep learning: a model built by stacking many "layers" that learn patterns from data automatically; more layers, more complex patterns.
CNN (convolutional neural network): a network that slides small windows over an image, extracting features layer by layer from near to far; AlexNet and ResNet (Papers 2/3) are CNNs, and they ruled vision from 2012 to 2020.
Transformer / self-attention: the architecture from 2017 (Paper 1); self-attention lets every element in a sequence "look at" every other element, score relevance pairwise, and pool information accordingly.
Token and embedding: a token is the basic unit a Transformer processes; an embedding turns a token into a learnable string of numbers (a vector).
Pretrain–finetune: first learn general ability on massive data, then adapt with a small amount of target-task data; BERT made this the standard playbook in NLP.
Inductive bias: assumptions about the world that the designer welds into the model's structure — e.g., a CNN assumes "nearby pixels are more related" and "an object shifted is still the same object."
ImageNet / ImageNet-21k / JFT-300M: three image datasets from small to large — roughly 1.3M / 14M / 300M images; the first two are public, JFT is Google-internal and private.

Where it sits

The authors are Dosovitskiy, Beyer, Kolesnikov, and nine colleagues at Google Research (the Brain team); the paper appeared at ICLR 2021 (released October 2020). It inherits the Transformer (Paper 1) and BERT's pretrain–finetune paradigm on one side, and the AlexNet→ResNet deep-vision line on the other; downstream of it come CLIP, MAE, DINO, SAM, and DiT — the "eyes" of today's multimodal models are mostly members of the ViT family.

The problem and the motivation

By 2020, NLP had converged on "Transformer + ever-larger pretraining": scale up models and data together and performance keeps climbing, with no saturation in sight. Vision, meanwhile, was still CNN territory. It's not that nobody had tried moving self-attention into vision — but earlier attempts either mixed it with convolutions or invented specialized local/sparse attention patterns to save compute, and those exotic structures ran poorly on existing hardware accelerators and never scaled.

The most naive plan — one token per pixel — is executed on the spot by arithmetic: self-attention compares all pairs, so its cost grows with the square of the token count, and a 224×224 image is 50,000-plus pixels; square that and it's astronomical.

So the question became: can a standard Transformer, changed as little as possible, do vision directly? And under what conditions can it beat CNNs, which had eight years of head start and built-in image priors?

The core idea

Cut the image into words: patches are visual tokens

ViT's first — and nearly only — image-specific operation: slice a 224×224 image into 16×16 patches, giving (224/16)² = 196 of them. Each patch is flattened into 16×16×3 = 768 numbers and passed through one learnable linear projection, producing a patch embedding — the exact counterpart of a word vector in NLP. One cut, and the sequence length drops from 50,000 to 196; the quadratic cost becomes instantly affordable. That is the literal meaning of the title: an image is worth 16×16 words.

Then add two old NLP parts: ① a learnable position embedding per slot, telling the model where each patch originally sat — without it, the model sees only a bag of shuffled puzzle pieces; ② following BERT, a [CLS] classification token at the front of the sentence, which participates in every layer's attention, pools whole-image information, and whose final output feeds a small classification head that says "what this is."

And then — that's it. What follows is a standard Transformer encoder (multi-head self-attention + MLP, each sub-layer wrapped in a residual connection; see Papers 1/2), with not a line changed. The "laziness" is deliberate: keep the architecture untouched, and everything the NLP world knows about scaling Transformers — training tricks, efficient implementations, all of it — transfers to vision overnight.

Fig 1 · ViT architecture: slice into patches → linear projection into "flashcards" → add position embeddings → standard Transformer encoder → classify from the CLS token's output. The only image-specific design is the slicing.

Remove the priors, buy them back with data

A CNN's strength comes from three inductive biases welded into its structure: locality (look at neighbors first), translation equivariance (a cat is a cat wherever it moves), and hierarchy (small features assemble into big ones). These assumptions are basically correct for images — half the answers filled in for free — a huge advantage when data is scarce.

ViT strips nearly all of it out: self-attention can look globally from layer one, and "how near or far to look" is no longer dictated by structure — the data teaches it. The only remaining traces of hand-made 2D awareness in the whole model: the patch-slicing cut, and one 2D interpolation of position embeddings when fine-tuning at a different resolution.

The paper's most important experiment quantifies this "priors vs data" trade: pretrained only on ImageNet (1.3M images), ViT is clearly worse than a comparable ResNet — the blank slate loses to the manual; at ImageNet-21k (14M) they roughly tie; at JFT-300M (300M) ViT overtakes across the board, and the bigger the model the more it benefits. One-line conclusion: large-scale pretraining beats hand-crafted inductive bias.

Intriguingly, a trained ViT grows CNN-like habits on its own: some attention heads in the lowest layers look only at nearby patches (convolution-like), while others look across the whole image from layer one — something a CNN structurally cannot do; the position embeddings also spontaneously learn the row-and-column layout. The priors weren't forced into the structure, but the data taught them anyway.

Fig 2 · The "priors for data" crossover (trend sketch, not exact values): with small data the prior-laden ResNet leads; around tens of millions of pretraining images they tie, and past hundreds of millions ViT pulls ahead.

Key results

Pretrained on JFT-300M, ViT-H/14 (632M parameters; "/14" means 14×14 patches) reached 88.55% top-1 accuracy on ImageNet, beating the strongest CNN-line contenders of the day: BiT-L (a very large ResNet, 87.54%) and Noisy Student (EfficientNet, 88.5%); it also led on CIFAR-100 (94.55%) and VTAB, a suite of 19 low-data transfer tasks (77.63%).
The sharper point is cost-effectiveness: ViT-L/16 spent roughly 0.68k TPUv3 core-days on pretraining and its result (87.76%) already beat BiT-L's, which took 9.9k core-days — an order of magnitude less compute, and more accurate.
Pretrained on the public ImageNet-21k, ViT-L/16 still reached 85.30% — a reproducible recipe for anyone without JFT.
The honest flip side: pretrained only on "small" ImageNet-1k, ViT loses to comparable ResNets; the crossover sits around tens of millions of images — go big, or don't bother.

Why it mattered

First, architectural unification. Images and text became the same kind of thing — token sequences — feeding the same kind of model. CLIP paired a ViT with a text Transformer for image–text contrastive learning and opened the multimodal era; the vision encoders inside today's picture-reading chat models are almost all ViT-family.

Second, NLP's scaling playbook transferred to vision wholesale. MAE brought BERT's "fill in the blanks" to images (mask most patches and reconstruct them), the DINO line built self-supervised visual representations, SAM ("segment anything") uses a ViT-H image encoder, and generative backbones switched from U-Net to Transformers (DiT — the foundation of the Sora / Stable Diffusion 3 lineage).

Third, methodology. It turned "inductive bias vs data scale" from a philosophical slogan into a measurable experiment, and "less hand design, more data and compute" became the field's default direction thereafter.

Limitations and criticism

Data hunger, and JFT-300M is private: ordinary labs could neither afford nor reproduce the original recipe. But DeiT (December 2020, Meta) promptly showed that with strong data augmentation + knowledge distillation, ViT trains well on ImageNet-1k alone — suggesting "300M images required" was partly just missing training tricks.
The quadratic cost didn't disappear — patches only hide it: raise the resolution and the patch count, and attention costs blow up again, hurting dense high-resolution tasks like detection and segmentation. Swin Transformer (2021) brought CNN-style priors back — local windows plus a hierarchy — to win those tasks. Somewhat ironic.
Did the architecture win, or did scale and recipe? ConvNeXt (2022) modernized the ResNet's training recipe and matched Swin, suggesting the Transformer architecture itself may not be strictly necessary; the debate is still open.
Coverage and an engineering scar: the original paper only did classification, and the learned position embeddings are tied to the training resolution — changing resolution requires an interpolation patch-up, which is not elegant.

The essentials

① In one line: cut the image into 16×16 patches as "words" and feed a standard Transformer, unchanged; with enough pretraining data, carrying no visual priors turns out stronger.

② The pain: vision was CNN-monopolized; earlier attention attempts either mixed in convolutions or ran slowly; per-pixel attention is ruled out by quadratic cost.

③ Mechanism one: patch embeddings — (224/16)² = 196 patches, each linearly projected to a vector, + position embeddings + a [CLS] token; sequence length drops from 50,000 to 196.

④ Mechanism two: strip out the CNN's locality/translation/hierarchy priors and buy them back with large-scale pretraining — lose at 1.3M images, tie at 14M, overtake at 300M.

⑤ Results: ViT-H/14 hits 88.55% on ImageNet, beating the strongest CNNs; ViT-L/16 overtakes BiT-L with roughly 1/15 the pretraining compute; the public-data recipe reaches 85.30%.

⑥ The model grows its own habits — low layers mix local-looking and global-looking heads — so priors need not be welded in; data can teach them.

⑦ Impact: one architecture for vision and language; CLIP / MAE / DINO / SAM / DiT all stand on it — the eyes of multimodal models.

⑧ Limits: fails on small data (DeiT later fixed this with tricks); quadratic cost at high resolution (Swin reinvited the priors); ConvNeXt questions whether the architecture was even necessary.