CS PAPERS DEEP-READ · PAPER 5

An Image is Worth 16×16 Words (ViT)

Dosovitskiy et al. · Google Research · ICLR 2021

中文 →

What did this paper do?

In 2020, a team at Google proposed ViT (the Vision Transformer) and did something that sounded a little reckless at the time: they took the Transformer — the kind of brain behind ChatGPT, built for text — and pointed it at images completely unchanged, with no image-specific parts at all. The result: as long as it first sees enough pictures, it beats CNNs, the reigning champions that had ruled computer vision for eight years. Today, when you ask an AI to "look at a picture and talk about it," the eyes it uses are most likely a ViT.

An analogy

The old way of seeing — the CNN — reads a painting with a magnifying glass: get up close to make out a small patch of brushwork, step back to assemble patches into parts, step back again to assemble the whole. "Look at neighbors first, near before far" is a rule carved into its bones. ViT throws away the magnifying glass: it cuts the whole painting into a hundred-plus little squares, lays them all out on the table, and reads them in one go, like reading a sentence of a hundred-plus words. Any two squares can lock eyes directly: the furry one in the top-left and the tail-shaped one in the bottom-right can huddle at first glance — "put us together, are we a cat?"

What's new

In the old world, text had text models and images had image models, each with its own inherited craft, mutually incompatible. ViT said: a picture can be read as a "sentence" too — treat the little squares as words, and seeing and reading can share one brain. The weight of that sentence only became fully clear years later: once it's the same brain, images and words can genuinely "think together" — the picture-reading ChatGPT, and the AIs that paint from a one-line prompt, all stand on this step.

How it works

Two steps. Step one, turn the image into "words": slice the picture into small squares, compress each into a string of numbers — like making a flashcard out of a word — and stamp each card with a seat number saying "you originally sat in this row and column." Without the stamps, the model would just see a bag of shuffled puzzle pieces. Step two, run the flashcards through a Transformer: in every layer, each piece sizes up all the other pieces and trades notes — the furry piece gets confirmation from the pointy-ear piece — and layer by layer, scattered clues assemble into a verdict: this is a cat.

Why the "blank slate" won

A CNN is born with a "how to look at pictures" manual: check your neighbors first, ignore the far side. When pictures are scarce, the manual is a crutch — the CNN learns fast and steady, while the blank-slate ViT stumbles and scores worse. But once pictures are plentiful (hundreds of millions), the manual becomes a cage: some clues can only be seen by spanning the whole image, and the blank slate, with no rules fencing it in, figures all of that out by itself — and overtakes. One honest note on the cost: this overtaking required feeding it hundreds of millions of images and a mountain of compute, far beyond an ordinary lab at the time — the data-frugal recipes came later, from others.

Remember one thing

Cut the image into little squares and feed it to a Transformer as a sentence; with enough pictures, carrying no inherited rules about how to see turns out to be stronger. From then on, seeing and reading share one brain — this paper pushed open the door to multimodal AI.

Want the patch-and-feed diagram, the "priors vs data" crossover experiment, and the real numbers? → switch to the deep read