CS PAPERS DEEP-READ · PAPER 2
He et al. · Microsoft Research · CVPR 2016
In 2015 a team at Microsoft Research Asia (Kaiming He and colleagues) proposed ResNet (residual networks). With one tiny change, it took neural networks from "about 20 layers at most" to hundreds of layers that keep getting better. It won that year's ImageNet image-recognition contest with an error rate as low as 3.57%. Today almost every deep network — including the previous paper's Transformer, and ChatGPT — uses this little change.
Intuitively, a deeper network should be smarter (it can see more levels of pattern). Yet people found the opposite: stack too many layers and it gets worse. The strange part is that this wasn't "memorizing the training questions" (overfitting) — it couldn't even do the training questions well. That's doubly counterintuitive: a deeper network should be no worse, since the extra layers could just do nothing and pass information through. But it couldn't even learn to "do nothing."
ResNet's idea is wonderfully plain: don't ask each layer to "repaint the whole picture" — ask it only "what to change," then add that change back onto the original. Give each small stretch of the network a "bypass lane" (a skip connection): the original information flows through the bypass untouched, and the layers in between only learn a small tweak, which is added back onto the original at the end.
An analogy: editing a document — you don't recopy the whole page, you just mark "add a line here, delete a word there" on the original. Small edits, and you never lose the original text. ResNet has each layer learn only that set of edits (the edits are called the residual).
The payoff is immediate: if a stretch needs no change, it just learns "blank edits," and the original flows through the bypass intact — so adding layers can never make things worse. And "learn to output nothing" is far easier than "learn to copy the input exactly from scratch," so the puzzle is solved.
First, why is "learning blank edits" easy while "copying exactly" is hard? Every layer is a kneading, twisting transformation — asking it to reproduce its input untouched is like asking five interpreters to relay a sentence through five languages and return the exact original wording. Nearly impossible. Whereas "change nothing" just means turning every knob to zero. The genius of the bypass is unloading the "keep it as is" burden off the layers and onto a plain wire.
So what's actually written in the edits? A deep network builds understanding in stages: early layers spot edges and lines, middle layers recognize "that's an ear, that's a wheel," and later layers finally assemble "that's a cat." With the bypass, each stretch doesn't rebuild the whole understanding — it only adds the bit it newly noticed on top of what's already there. The "edits" are exactly that layer's new insight. More layers just means more rounds of "look again, understand a little more."
And one honest note: the trick isn't limitless — past a few hundred layers the gains shrink; later research even found such networks behave more like "many shallow paths of different lengths voting together" than one genuinely deep path walked end to end.
Two wins. First, adding layers is no longer risky: learn a tweak if needed, or let the bypass pass things through if not — so the network can safely stack to hundreds of layers and keep improving. Second, the learning signal flows better: to train, a "correction signal" must travel from the last layer all the way back to the front to adjust the weights; in a very deep network that signal fades the farther it goes (so the front layers can't learn). The bypass gives that signal a "highway" straight back to the early layers.
Give each stretch of the network a "bypass" so the original flows through untouched, and have the layers learn only a "what to change" tweak added back on top — so adding layers never hurts, you can stack hundreds of them, and the correction signal flows freely. This "skip connection" is now standard in almost every deep network (including the Transformer).
Want the block diagram, the formula, and the numbers? → switch to the deep read
ResNet uses a nearly parameter-free skip connection to have each stretch of the network learn "the difference between the target and the input" (the residual) instead of the target directly. This solves the degradation problem — where deeper networks have higher training error — and trained convolutional nets stably to hundreds of layers for the first time, winning ImageNet 2015 (3.57% top-5 error). Its skip connection later became a basic building block of nearly every deep network, including the Transformer.
The authors are Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, at Microsoft Research Asia; the paper appeared at CVPR 2016 (released late 2015). It inherits the "make networks ever deeper" line of AlexNet (2012) and VGG (2014), but squarely answered the roadblock of the day — why you can't just naively go deeper. It launches nearly every deep network that followed: residual/skip connections became a universal block, and even the "residual connection" around each Transformer sub-layer is the very same idea.
From 2012 on, deeper seemed to always mean stronger (AlexNet's 8 layers → VGG's 19). But going deeper hits two walls. The first is vanishing/exploding gradients — largely eased by good initialization and normalization. The truly stubborn one is the second: degradation.
The authors observed something counterintuitive: going from 20 to 56 layers, the training error goes up, not down — note, training error, so this isn't overfitting; a deeper network is simply harder to optimize. That's especially odd because a deeper network has a "no worse" solution: let the extra layers learn an identity mapping (output the input unchanged), and it reduces to the shallower network. In practice, though, the optimizer struggles to make a stack of nonlinear layers learn "copy the input exactly." The problem isn't expressive power — it's that optimization can't find it.
If learning an identity mapping is hard, then hand the identity to the network for free. Say a stretch of layers is meant to learn a target mapping H(x); instead of learning H(x) directly, ResNet has those layers learn the residual F(x) = H(x) − x, then adds the input x back via a skip connection: output = F(x) + x.
In plain terms: the original x flows through the bypass unchanged, and the layers in between learn only "how much to change x." If the optimum is close to identity (no change needed), they just push F(x) toward 0 — far easier than learning to "copy exactly" from scratch. Identity is provided as a "free fallback," so adding layers is at least never worse, and the puzzle is cured.
The skip connection simply adds the input element-wise onto the output — no extra parameters, almost no extra compute. When the two ends differ in dimension, a single 1×1 convolution projects x to line them up.
To stack to 50/101/152 layers without exploding the cost, deeper variants use bottleneck blocks: a 1×1 conv narrows the channels, a 3×3 conv does the work, and a 1×1 conv restores them — concentrating compute in the narrow middle. Thanks to this, ResNet-152 is 8× deeper than VGG yet has lower computational complexity.
A widely held intuition too: the skip connection gives gradients a "highway" — during backprop the correction signal returns to the early layers along the bypass with almost no decay, which is key to training very deep nets. (This "gradient flows freely" story is the community's common explanation; the paper's own core argument is the "easier optimization / free identity fallback" above.)
On ImageNet, a 152-layer ResNet reached 3.57% top-5 error (ensemble) and won ILSVRC 2015 classification, sweeping detection, localization, and COCO tasks too. The controlled experiments are convincing: a 34-layer plain network has higher training error than an 18-layer one (degradation reproduced), whereas with skip connections the 34-layer ResNet clearly beats the 18-layer, and deeper keeps helping. The authors even trained past a thousand layers on CIFAR-10. "Deeper really does mean better" — a conclusion that should have held but had temporarily broken — was restored.
It reopened the road to "deep," with a tool that's extremely general. Skip connections are now everywhere: later vision backbones (ResNeXt, DenseNet, etc.), nearly every modern deep net, and even the residual connection around each Transformer sub-layer use the same idea. You could say that our ability today to stack models hundreds or thousands of layers deep, to hundreds of billions of parameters, rests on residual connections as one of its foundations. It also made He and colleagues among the most influential researchers in deep learning.
① In one line: add a skip connection to each stretch so it learns only the "residual F(x)=H(x)−x," then add x back — solving deep networks' degradation problem.
② The pain: deeper networks show rising training error (degradation, not overfitting); in theory "extra layers learning identity is no worse," but the optimizer can't learn identity.
③ Mechanism: output = F(x) + x hands identity for free — need no change, push F to 0, far easier than learning to copy from scratch; skip connections add no parameters and almost no compute.
④ Deeper variants use bottleneck blocks (1×1→3×3→1×1) to save compute; ResNet-152 is 8× deeper than VGG yet lower in complexity.
⑤ Bonus intuition: skip connections give gradients a "highway" so the correction signal flows freely back.
⑥ Results: 3.57% top-5 error on ImageNet, ILSVRC 2015 winner; controlled experiments reproduce degradation and show residuals fix it; trainable past a thousand layers on CIFAR.
⑦ Impact: skip connections became a standard foundation of nearly every deep network (including the Transformer's residual connections).
⑧ Limits: diminishing returns when very deep; may act more like a "multi-path ensemble"; why it works is still debated; pre-activation and others show the original wasn't optimal.