Deep Residual Learning (ResNet) — CS Papers Deep-Read Paper 2

What did this paper do?

In 2015 a team at Microsoft Research Asia (Kaiming He and colleagues) proposed ResNet (residual networks). With one tiny change, it took neural networks from "about 20 layers at most" to hundreds of layers that keep getting better. It won that year's ImageNet image-recognition contest with an error rate as low as 3.57%. Today almost every deep network — including the previous paper's Transformer, and ChatGPT — uses this little change.

First, a puzzle

Intuitively, a deeper network should be smarter (it can see more levels of pattern). Yet people found the opposite: stack too many layers and it gets worse. The strange part is that this wasn't "memorizing the training questions" (overfitting) — it couldn't even do the training questions well. That's doubly counterintuitive: a deeper network should be no worse, since the extra layers could just do nothing and pass information through. But it couldn't even learn to "do nothing."

The tiny change

ResNet's idea is wonderfully plain: don't ask each layer to "repaint the whole picture" — ask it only "what to change," then add that change back onto the original. Give each small stretch of the network a "bypass lane" (a skip connection): the original information flows through the bypass untouched, and the layers in between only learn a small tweak, which is added back onto the original at the end.

An analogy: editing a document — you don't recopy the whole page, you just mark "add a line here, delete a word there" on the original. Small edits, and you never lose the original text. ResNet has each layer learn only that set of edits (the edits are called the residual).

The payoff is immediate: if a stretch needs no change, it just learns "blank edits," and the original flows through the bypass intact — so adding layers can never make things worse. And "learn to output nothing" is far easier than "learn to copy the input exactly from scratch," so the puzzle is solved.

What exactly do the "edits" learn?

First, why is "learning blank edits" easy while "copying exactly" is hard? Every layer is a kneading, twisting transformation — asking it to reproduce its input untouched is like asking five interpreters to relay a sentence through five languages and return the exact original wording. Nearly impossible. Whereas "change nothing" just means turning every knob to zero. The genius of the bypass is unloading the "keep it as is" burden off the layers and onto a plain wire.

So what's actually written in the edits? A deep network builds understanding in stages: early layers spot edges and lines, middle layers recognize "that's an ear, that's a wheel," and later layers finally assemble "that's a cat." With the bypass, each stretch doesn't rebuild the whole understanding — it only adds the bit it newly noticed on top of what's already there. The "edits" are exactly that layer's new insight. More layers just means more rounds of "look again, understand a little more."

And one honest note: the trick isn't limitless — past a few hundred layers the gains shrink; later research even found such networks behave more like "many shallow paths of different lengths voting together" than one genuinely deep path walked end to end.

So how does it actually help?

Two wins. First, adding layers is no longer risky: learn a tweak if needed, or let the bypass pass things through if not — so the network can safely stack to hundreds of layers and keep improving. Second, the learning signal flows better: to train, a "correction signal" must travel from the last layer all the way back to the front to adjust the weights; in a very deep network that signal fades the farther it goes (so the front layers can't learn). The bypass gives that signal a "highway" straight back to the early layers.

Remember one thing

Give each stretch of the network a "bypass" so the original flows through untouched, and have the layers learn only a "what to change" tweak added back on top — so adding layers never hurts, you can stack hundreds of them, and the correction signal flows freely. This "skip connection" is now standard in almost every deep network (including the Transformer).

Want the block diagram, the formula, and the numbers? → switch to the deep read

In one sentence

ResNet uses a nearly parameter-free skip connection to have each stretch of the network learn "the difference between the target and the input" (the residual) instead of the target directly. This solves the degradation problem — where deeper networks have higher training error — and trained convolutional nets stably to hundreds of layers for the first time, winning ImageNet 2015 (3.57% top-5 error). Its skip connection later became a basic building block of nearly every deep network, including the Transformer.

Glossary

Neural network / depth: a model built by stacking many "layers" that learn patterns from data; "depth" is the number of layers — more layers express more complex patterns, but are harder to train.
CNN (convolutional neural network): a network great at images, sliding small windows over a picture to extract features layer by layer, from edges to objects. ResNet is a very deep CNN.
Training vs test error: the error rate on "questions it has seen" vs "questions it hasn't." High training error = it never learned; test error far above training error = overfitting.
Overfitting: the model memorized the training data but can't generalize — low training error, high test error.
Vanishing gradient: the signal (gradient) used to adjust weights during training fades layer by layer in a very deep network until it's near zero, so early layers can't learn.
Identity mapping: a mapping that outputs the input unchanged. ResNet's key trick is making "change nothing" easy to learn.
ImageNet / ILSVRC: a benchmark and annual contest of a million-plus images across 1000 classes — the "Olympics" for vision models of that era.

Where it sits

The authors are Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, at Microsoft Research Asia; the paper appeared at CVPR 2016 (released late 2015). It inherits the "make networks ever deeper" line of AlexNet (2012) and VGG (2014), but squarely answered the roadblock of the day — why you can't just naively go deeper. It launches nearly every deep network that followed: residual/skip connections became a universal block, and even the "residual connection" around each Transformer sub-layer is the very same idea.

The problem and the motivation

From 2012 on, deeper seemed to always mean stronger (AlexNet's 8 layers → VGG's 19). But going deeper hits two walls. The first is vanishing/exploding gradients — largely eased by good initialization and normalization. The truly stubborn one is the second: degradation.

The authors observed something counterintuitive: going from 20 to 56 layers, the training error goes up, not down — note, training error, so this isn't overfitting; a deeper network is simply harder to optimize. That's especially odd because a deeper network has a "no worse" solution: let the extra layers learn an identity mapping (output the input unchanged), and it reduces to the shallower network. In practice, though, the optimizer struggles to make a stack of nonlinear layers learn "copy the input exactly." The problem isn't expressive power — it's that optimization can't find it.

Fig 1 · Degradation: a plain network's training error rises going from 20 to 56 layers — not overfitting, just harder to optimize when deeper.

The core idea

Residual learning: learn the "difference," not the whole thing

If learning an identity mapping is hard, then hand the identity to the network for free. Say a stretch of layers is meant to learn a target mapping H(x); instead of learning H(x) directly, ResNet has those layers learn the residual F(x) = H(x) − x, then adds the input x back via a skip connection: output = F(x) + x.

In plain terms: the original x flows through the bypass unchanged, and the layers in between learn only "how much to change x." If the optimum is close to identity (no change needed), they just push F(x) toward 0 — far easier than learning to "copy exactly" from scratch. Identity is provided as a "free fallback," so adding layers is at least never worse, and the puzzle is cured.

Fig 2 · Residual block: the two layers on the main path learn only the residual F(x); the skip connection adds input x back, giving F(x)+x. Need no change? Just push F(x) to 0.

The skip connection simply adds the input element-wise onto the output — no extra parameters, almost no extra compute. When the two ends differ in dimension, a single 1×1 convolution projects x to line them up.

Bottleneck blocks: making "very deep" affordable

To stack to 50/101/152 layers without exploding the cost, deeper variants use bottleneck blocks: a 1×1 conv narrows the channels, a 3×3 conv does the work, and a 1×1 conv restores them — concentrating compute in the narrow middle. Thanks to this, ResNet-152 is 8× deeper than VGG yet has lower computational complexity.

A widely held intuition too: the skip connection gives gradients a "highway" — during backprop the correction signal returns to the early layers along the bypass with almost no decay, which is key to training very deep nets. (This "gradient flows freely" story is the community's common explanation; the paper's own core argument is the "easier optimization / free identity fallback" above.)

Key results

On ImageNet, a 152-layer ResNet reached 3.57% top-5 error (ensemble) and won ILSVRC 2015 classification, sweeping detection, localization, and COCO tasks too. The controlled experiments are convincing: a 34-layer plain network has higher training error than an 18-layer one (degradation reproduced), whereas with skip connections the 34-layer ResNet clearly beats the 18-layer, and deeper keeps helping. The authors even trained past a thousand layers on CIFAR-10. "Deeper really does mean better" — a conclusion that should have held but had temporarily broken — was restored.

Why it mattered

It reopened the road to "deep," with a tool that's extremely general. Skip connections are now everywhere: later vision backbones (ResNeXt, DenseNet, etc.), nearly every modern deep net, and even the residual connection around each Transformer sub-layer use the same idea. You could say that our ability today to stack models hundreds or thousands of layers deep, to hundreds of billions of parameters, rests on residual connections as one of its foundations. It also made He and colleagues among the most influential researchers in deep learning.

Limitations and criticism

Diminishing returns from depth: gains from dozens to ~150 layers are clear, but pushing to a thousand-plus (e.g., CIFAR's 1202-layer) helps little or hurts — more depth isn't always better.
It may behave more like an "ensemble of shallow nets": later work (Veit et al. 2016) argued residual nets act like an ensemble of many paths of varying depth, and dropping some layers barely hurts — not quite the picture of "one genuinely very deep network."
Why it works is still debated: the exact cause of degradation, and whether skip connections help via "easier optimization," "free gradient flow," or "a smoother loss landscape" (a later visualization result), get complementary rather than settled answers.
The original wasn't optimal: the first design applies the activation after the addition; later ResNet v2 (pre-activation) reshaped the block to let identity information flow more freely, performing better — so the original still had room to improve.

The essentials

① In one line: add a skip connection to each stretch so it learns only the "residual F(x)=H(x)−x," then add x back — solving deep networks' degradation problem.

② The pain: deeper networks show rising training error (degradation, not overfitting); in theory "extra layers learning identity is no worse," but the optimizer can't learn identity.

③ Mechanism: output = F(x) + x hands identity for free — need no change, push F to 0, far easier than learning to copy from scratch; skip connections add no parameters and almost no compute.

④ Deeper variants use bottleneck blocks (1×1→3×3→1×1) to save compute; ResNet-152 is 8× deeper than VGG yet lower in complexity.

⑤ Bonus intuition: skip connections give gradients a "highway" so the correction signal flows freely back.

⑥ Results: 3.57% top-5 error on ImageNet, ILSVRC 2015 winner; controlled experiments reproduce degradation and show residuals fix it; trainable past a thousand layers on CIFAR.

⑦ Impact: skip connections became a standard foundation of nearly every deep network (including the Transformer's residual connections).

⑧ Limits: diminishing returns when very deep; may act more like a "multi-path ensemble"; why it works is still debated; pre-activation and others show the original wasn't optimal.