CS PAPERS DEEP-READ · PAPER 1

Attention Is All You Need (Transformer)

Vaswani et al. · Google · NeurIPS 2017

中文 →

What did this paper do?

In 2017 a team at Google proposed a new "brain layout" called the Transformer. Almost every AI you've heard of — ChatGPT, AI chat, AI image generators — is built on it. This paper is its birth certificate.

An analogy first

Read this: "The cat didn't cross the street because it was tired." You instantly know "it" = the cat, not the street — because when you hit "it," you glance back over the whole sentence and judge what's most related.

That's exactly what the Transformer does: it lets every word glance back over the whole sentence and decide for itself what to focus on. That "share attention by relevance" move is called attention — which is what the title means by "attention is all you need."

What's actually new?

Before it, machines read a sentence like passing a message down a line: one word at a time, information relayed step by step. A long sentence → the early stuff gets garbled or lost, and it can't go fast because it's strictly one-after-another.

The Transformer swaps "passing down a line" for a meeting: all the words on the table at once, each looking at every other word, reaching the farthest word in a single step. Two wins: ① fast (everyone looks in parallel, no queue) and ② good memory (even distant words are one step away, nothing lost).

So how does it decide "who to listen to"?

Every word carries two little tags: one says "what I'm looking for" (e.g., "it" is looking for a noun mentioned earlier), the other says "what I am" (e.g., "cat" is an animal noun). A word takes its "what I'm looking for" and checks it against every other word's "what I am" — the better the match, the more it listens to that word. "It"'s "looking for a noun" matches "cat"'s "is a noun," so it mostly listens to "cat." That's the whole trick: attention shared by how well things match, nothing mystical.

And it doesn't look just one way: it uses several sets of tags at once, from several angles (one tracking grammar, one tracking who-refers-to-whom, one tracking tone…), then combines them — that's multi-head attention. And since all the words sit "on the table at once" with no built-in order, the model also gives each one a "seat number" so it knows what comes before what.

There's a cost too: with more words, every pair has to look at each other, so the compute grows fast — very long texts strain it, which is exactly what a lot of later work tries to fix.

What did it lead to?

Because it's both fast and retentive, people found that just making it bigger and feeding it more data makes it smarter — giving us BERT, GPT, and today's models that chat, code, and draw. The heart of nearly every powerful AI today is this "meeting" mechanism from this paper.

Remember one thing

Let every word glance over the whole sentence and decide what to focus on — this "attention" move replaces the old "passing a message down a line," so AI reads fast and remembers distant relationships. This structure grew into almost every large model.

Want the mechanism, formulas, and diagrams? → switch to the deep read