TL;DR
ResNet introduces skip connections β also called shortcut connections β that let a layer learn the residual F(x) = H(x) β x instead of the full mapping H(x). This simple idea lets neural networks scale to hundreds of layers without the training accuracy degrading. ResNet-152 won ILSVRC 2015 with a record 3.57% top-5 error, beating human-level performance.
1. The Degradation Problem
Before ResNet, a natural assumption was: deeper networks should be at least as good as shallower ones. If a 20-layer network is optimal, you can always construct a 56-layer network that matches it by making the extra 36 layers identity mappings. In theory, the deeper network's training error should be no worse.
Yet in practice, experiments showed the opposite. A plain 56-layer network had higher training error than a 20-layer network on CIFAR-10. This was not overfitting β the training error itself was higher, meaning the optimizer simply could not find a good solution in the deeper network's parameter space.
Key insight: The degradation problem shows that deep plain networks are hard to optimize, not that they have insufficient capacity. The solution is to change what the layers are asked to learn.
He et al. hypothesized that it is easier to optimize the residual mapping F(x) = H(x) β x than the original unreferenced mapping H(x). If the optimal solution is close to an identity mapping, it is easier to push F(x) toward zero than to have a stack of nonlinear layers approximate an identity from scratch.
2. The Residual Block
The core building block of ResNet is the residual block. Instead of learning H(x) directly, the block learns the residual function F(x), and the true output is recovered by adding the input x back:
Here x is the input to the block, F(x, {W_i}) is the residual function computed by the stacked layers (e.g., two 3Γ3 convolutions with BN and ReLU), and the addition is element-wise. The shortcut connection adds x directly β no extra parameters, no extra computation.
Identity vs. Projection Shortcuts
When the input and output have the same dimensions (same number of channels and spatial size), the shortcut is a simple identity: x is added as-is. This costs nothing.
When dimensions differ β for example when the number of channels doubles at a downsampling step β the shortcut must be adapted. The paper uses a 1Γ1 convolution (projection shortcut) to match dimensions:
W_s is a 1Γ1 convolution applied with stride 2 (for spatial downsampling) that maps the input from C channels to 2C channels. The paper experiments show that using projection shortcuts only where necessary (option B in the paper) performs well while keeping parameter count low.
3. Architecture Details
ResNet Family
The paper introduces a family of networks with different depths. ResNet-18 and ResNet-34 use basic blocks (two 3Γ3 convolutions). ResNet-50, ResNet-101, and ResNet-152 use bottleneck blocks to keep computation manageable at larger scales.
| Model | Layers | Block type | Params | Top-1 err (ImageNet) |
|---|---|---|---|---|
| ResNet-18 | 18 | Basic | 11.7M | 30.24% |
| ResNet-34 | 34 | Basic | 21.8M | 26.73% |
| ResNet-50 | 50 | Bottleneck | 25.6M | 24.01% |
| ResNet-101 | 101 | Bottleneck | 44.5M | 22.44% |
| ResNet-152 | 152 | Bottleneck | 60.2M | 22.16% |
Bottleneck Block
For deeper networks (ResNet-50+), using two 3Γ3 convolutions per block becomes computationally expensive. The bottleneck design replaces this with a three-layer structure:
Batch Normalization Placement
In the original ResNet, Batch Normalization (BN) is placed after each convolution and before the ReLU activation. The order within each residual branch is: Conv β BN β ReLU β Conv β BN β (Add shortcut) β ReLU.
He et al. later explored 'pre-activation' ResNets (in a 2016 follow-up) where BN and ReLU come before the convolution: BN β ReLU β Conv. This variant makes the shortcut path truly clean (just addition, no BN or activation on the path), and slightly improves performance on very deep networks.
4. Gradient Flow Analysis
The key mathematical benefit of skip connections is how they affect gradient flow during backpropagation. Consider the loss L and the gradient flowing back through the network.
For a residual block with output y = F(x) + x, the gradient of the loss with respect to the block's input x is:
The identity matrix I in the gradient expression is crucial. Even if the gradient flowing through F(x) β the residual branch β becomes very small (vanishing), the identity term I ensures that βL/βy is always passed directly to βL/βx. Gradients cannot fully vanish as long as the shortcut connection is present.
5. Results
ResNet's results on ILSVRC 2015 (ImageNet) were decisive. The ensemble of ResNets achieved 3.57% top-5 error on the test set β better than the reported human-level performance of ~5.1% by some estimates.
On CIFAR-10, the paper showed that a plain 56-layer network performs worse than a plain 20-layer network (the degradation problem), whereas ResNet-56 and ResNet-110 both outperform shallower baselines, confirming that residual connections actually solve the degradation problem rather than just alleviating it.
ResNet also transferred strongly to object detection (COCO) and image segmentation, suggesting the learned features generalize well. ResNet-101/152 as backbone networks became the standard in computer vision for years after publication.
6. Key Takeaways
Depth alone isn't enough
Stacking more layers on a plain network hurts training accuracy. The problem is optimization difficulty, not model capacity.
Residual learning reframes the problem
Asking layers to learn F(x) = H(x) β x is easier than learning H(x) when the optimal H is close to identity. Pushing F toward zero is trivial; pushing H toward identity through multiple nonlinearities is not.
Skip connections are gradient highways
The identity shortcut guarantees a direct gradient path from any layer to any earlier layer, preventing vanishing gradients even in 150+ layer networks.
Bottlenecks make depth computationally feasible
The 1Γ1 β 3Γ3 β 1Γ1 bottleneck design dramatically reduces FLOPs by operating the expensive 3Γ3 conv on a reduced channel count, enabling 50β152 layer networks to train efficiently.
Zero extra parameters for identity shortcuts
When input and output dimensions match, skip connections cost nothing β no parameters, no FLOPs. Only dimension-changing shortcuts (1Γ1 conv) add a small number of parameters.
Further Reading
- He et al. 2016 β Identity Mappings in Deep Residual Networks(pre-activation ResNet)
- Huang et al. 2017 β Densely Connected Convolutional Networks (DenseNet)(extends skip connections to all layers)
- Xie et al. 2017 β Aggregated Residual Transformations (ResNeXt)(grouped convolutions in bottleneck)