CS PAPERS DEEP-READ · PAPER 3

AlexNet — Seeing with Deep Convolutional Nets

Krizhevsky, Sutskever, Hinton · University of Toronto · NeurIPS 2012

中文 →

What did this paper do?

In 2012, three people at the University of Toronto (Krizhevsky, Sutskever, and their advisor Hinton) built a neural network called AlexNet and entered it in the most prestigious image-recognition contest (name what's in a photo — a cat, a dog, or some kind of mushroom — across 1000 classes). It won by a landslide, with an error rate nearly half of the runner-up's. That single result lit the fuse on more than a decade of the deep learning revolution: today's face unlock, auto photo tagging, and self-driving perception all trace back here.

How the old world recognized images

Before this, getting a computer to recognize images relied on experts hand-writing rules: people first racked their brains to design formulas for "what counts as an edge, a corner, a texture," squeezed the image into a string of numbers, and handed that to a classifier. The trouble: those rules were a human's best guess, they often broke on a new task, and they buckled at the thousand poses, lighting conditions, and occlusions a real cat comes in. For years, machine image recognition was stuck.

What was new

AlexNet flipped it around: instead of humans writing the rules, let the machine "see" the patterns itself from over a million images. It's a deep network that looks layer by layer: the bottom layers learn to spot edges and color patches, the middle layers assemble parts like eyes and wheels, the top layers put together "this is a cat." The entire skill of "how to recognize" was trained out of the data by the network — not one line of it hand-written.

How it pulled this off

Three things made it actually run for the first time:

① Train on gaming graphics cards. A network this big takes staggering amounts of computation — forever on an ordinary processor. They switched to GPUs — chips built for game graphics, naturally good at "doing thousands of small calculations at once," exactly what a neural network craves — cutting a months-long training run down to days.

② A snappier "switch." Every unit in the network must decide "let this signal through or not." The old switches were sluggish and got harder to train the deeper you went; they swapped in a dead-simple one — block anything negative, pass anything positive through untouched — which sped up training several times over.

③ Two tricks against rote memorization. A network this powerful tends to memorize the training images and then flounder on new ones. One trick: during training, randomly send some units "on leave," forcing the rest to learn real skills and not lean on each other. The other: flip, shift, and slightly recolor each image to conjure up many more training images.

What it brought

It slashed image-recognition error in one stroke, and made it clear to everyone: rather than asking experts to write rules, feed data and compute to a deep-enough network and let it learn on its own. That sentence became the shared belief of the whole AI field afterward. And one honest note: AlexNet rested on no brand-new theory — it fused a few existing old ideas with massive data and GPU horsepower at just the right moment. It was a victory of scale, at the cost of being hungry for both data and compute.

Remember one thing

AlexNet let a deep network learn "how to recognize images" by itself from a million pictures — made it run for the first time via GPU training, a snappy switch, and two anti-memorization tricks — and won the 2012 image contest by a landslide. From there, AI took the "deep learning" road.

Want the architecture diagram, the ReLU curve, and the numbers? → switch to the deep read