CS PAPERS DEEP-READ · PAPER 2

Deep Residual Learning (ResNet)

He et al. · Microsoft Research · CVPR 2016

中文 →

What did this paper do?

In 2015 a team at Microsoft Research Asia (Kaiming He and colleagues) proposed ResNet (residual networks). With one tiny change, it took neural networks from "about 20 layers at most" to hundreds of layers that keep getting better. It won that year's ImageNet image-recognition contest with an error rate as low as 3.57%. Today almost every deep network — including the previous paper's Transformer, and ChatGPT — uses this little change.

First, a puzzle

Intuitively, a deeper network should be smarter (it can see more levels of pattern). Yet people found the opposite: stack too many layers and it gets worse. The strange part is that this wasn't "memorizing the training questions" (overfitting) — it couldn't even do the training questions well. That's doubly counterintuitive: a deeper network should be no worse, since the extra layers could just do nothing and pass information through. But it couldn't even learn to "do nothing."

The tiny change

ResNet's idea is wonderfully plain: don't ask each layer to "repaint the whole picture" — ask it only "what to change," then add that change back onto the original. Give each small stretch of the network a "bypass lane" (a skip connection): the original information flows through the bypass untouched, and the layers in between only learn a small tweak, which is added back onto the original at the end.

An analogy: editing a document — you don't recopy the whole page, you just mark "add a line here, delete a word there" on the original. Small edits, and you never lose the original text. ResNet has each layer learn only that set of edits (the edits are called the residual).

The payoff is immediate: if a stretch needs no change, it just learns "blank edits," and the original flows through the bypass intact — so adding layers can never make things worse. And "learn to output nothing" is far easier than "learn to copy the input exactly from scratch," so the puzzle is solved.

What exactly do the "edits" learn?

First, why is "learning blank edits" easy while "copying exactly" is hard? Every layer is a kneading, twisting transformation — asking it to reproduce its input untouched is like asking five interpreters to relay a sentence through five languages and return the exact original wording. Nearly impossible. Whereas "change nothing" just means turning every knob to zero. The genius of the bypass is unloading the "keep it as is" burden off the layers and onto a plain wire.

So what's actually written in the edits? A deep network builds understanding in stages: early layers spot edges and lines, middle layers recognize "that's an ear, that's a wheel," and later layers finally assemble "that's a cat." With the bypass, each stretch doesn't rebuild the whole understanding — it only adds the bit it newly noticed on top of what's already there. The "edits" are exactly that layer's new insight. More layers just means more rounds of "look again, understand a little more."

And one honest note: the trick isn't limitless — past a few hundred layers the gains shrink; later research even found such networks behave more like "many shallow paths of different lengths voting together" than one genuinely deep path walked end to end.

So how does it actually help?

Two wins. First, adding layers is no longer risky: learn a tweak if needed, or let the bypass pass things through if not — so the network can safely stack to hundreds of layers and keep improving. Second, the learning signal flows better: to train, a "correction signal" must travel from the last layer all the way back to the front to adjust the weights; in a very deep network that signal fades the farther it goes (so the front layers can't learn). The bypass gives that signal a "highway" straight back to the early layers.

Remember one thing

Give each stretch of the network a "bypass" so the original flows through untouched, and have the layers learn only a "what to change" tweak added back on top — so adding layers never hurts, you can stack hundreds of them, and the correction signal flows freely. This "skip connection" is now standard in almost every deep network (including the Transformer).

Want the block diagram, the formula, and the numbers? → switch to the deep read