CS PAPERS DEEP-READ · PAPER 4

Word2Vec (Word Embeddings)

Mikolov et al. · Google · ICLR Workshop 2013

中文 →

What did this paper do?

In 2013, Tomas Mikolov's team at Google released Word2Vec: a method that lets a computer read mountains of text on its own and assign every word a "meaning coordinate." The magic is that these coordinates can do arithmetic: "king − man + woman ≈ queen." The search and recommendation systems you use today — and the very first step ChatGPT takes to understand your words, "turn each word into a vector" — all became mainstream through this paper.

First, a puzzle

Before this, words were just ID numbers to a computer: "cat" is #4102, "dog" is #9527, "fridge" is #233. To the machine, the relationship between "cat" and "dog" was exactly the same as between "cat" and "fridge" — none at all. To teach it that cats and dogs are both pets, humans had to write dictionary entries by hand: never finished, always behind on new words, and useless in another language.

The idea

Word2Vec's idea could be called "birds of a feather": a word's meaning hides in which words it tends to hang out with. "Cat" and "dog" both show up next to "feed," "shedding," and "vet," so their meanings are close; "fridge" keeps company with "fresh" and "plug in." So give every word a home on a giant "map of meaning"words with similar neighbors live close together. And nobody has to draw this map by hand: the machine draws it itself, just by reading.

How is the map drawn?

Through a game of fill-in-the-blank played billions of times. Cover up one word in a sentence and have the machine guess it from the surrounding words. When it guesses wrong, nudge the relevant words to slightly different spots on the map so the next guess comes easier. Sentence after sentence, nudge after nudge — after billions of words, words used in similar ways have been pushed together. Even better, directions on the map take on meaning: the path from "man" to "woman" points the same way, for the same distance, as the path from "king" to "queen" — which is exactly why that famous arithmetic works.

What it brought

From then on, "turn anything into coordinates" became AI's universal move: words, sentences, products, users, songs — each gets a home on its own map, and "find similar things" becomes "measure distances." Search, translation, and recommendation all jumped a level, and this is the starting point of the road that leads to BERT and ChatGPT. One honest caveat: each word gets only one fixed spot on the map, so "apple" the fruit and "Apple" the company are stuck in the same place — a problem left for later models to solve.

Remember one thing

Have a machine read mountains of text playing fill-in-the-blank, and give every word a home on a map of meaning: similar words live close, relationships become directions, and even "king − man + woman ≈ queen" computes. The era of "everything can be coordinates" starts here.

Want the CBOW / Skip-gram diagrams, negative sampling, and the numbers? → switch to the deep read