PaperTrace — Interactive ML Paper Deep-Dives

TL;DR

ResNet introduces skip connections — also called shortcut connections — that let a layer learn the residual F(x) = H(x) − x instead of the full mapping H(x). This simple idea lets neural networks scale to hundreds of layers without the training accuracy degrading. ResNet-152 won ILSVRC 2015 with a record 3.57% top-5 error, beating human-level performance.

1. The Degradation Problem

Before ResNet, a natural assumption was: deeper networks should be at least as good as shallower ones. If a 20-layer network is optimal, you can always construct a 56-layer network that matches it by making the extra 36 layers identity mappings. In theory, the deeper network's training error should be no worse.

Yet in practice, experiments showed the opposite. A plain 56-layer network had higher training error than a 20-layer network on CIFAR-10. This was not overfitting — the training error itself was higher, meaning the optimizer simply could not find a good solution in the deeper network's parameter space.

Key insight: The degradation problem shows that deep plain networks are hard to optimize, not that they have insufficient capacity. The solution is to change what the layers are asked to learn.

He et al. hypothesized that it is easier to optimize the residual mapping F(x) = H(x) − x than the original unreferenced mapping H(x). If the optimal solution is close to an identity mapping, it is easier to push F(x) toward zero than to have a stack of nonlinear layers approximate an identity from scratch.

2. The Residual Block

The core building block of ResNet is the residual block. Instead of learning H(x) directly, the block learns the residual function F(x), and the true output is recovered by adding the input x back:

\mathbf{y} = \mathcal{F}(\mathbf{x},\, \{W_i\}) + \mathbf{x}

Here x is the input to the block, F(x, {W_i}) is the residual function computed by the stacked layers (e.g., two 3×3 convolutions with BN and ReLU), and the addition is element-wise. The shortcut connection adds x directly — no extra parameters, no extra computation.

Identity vs. Projection Shortcuts

When the input and output have the same dimensions (same number of channels and spatial size), the shortcut is a simple identity: x is added as-is. This costs nothing.

When dimensions differ — for example when the number of channels doubles at a downsampling step — the shortcut must be adapted. The paper uses a 1×1 convolution (projection shortcut) to match dimensions:

\mathbf{y} = \mathcal{F}(\mathbf{x},\, \{W_i\}) + W_s \mathbf{x}

W_s is a 1×1 convolution applied with stride 2 (for spatial downsampling) that maps the input from C channels to 2C channels. The paper experiments show that using projection shortcuts only where necessary (option B in the paper) performs well while keeping parameter count low.

3. Architecture Details

ResNet Family

The paper introduces a family of networks with different depths. ResNet-18 and ResNet-34 use basic blocks (two 3×3 convolutions). ResNet-50, ResNet-101, and ResNet-152 use bottleneck blocks to keep computation manageable at larger scales.

Model	Layers	Block type	Params	Top-1 err (ImageNet)
ResNet-18	18	Basic	11.7M	30.24%
ResNet-34	34	Basic	21.8M	26.73%
ResNet-50	50	Bottleneck	25.6M	24.01%
ResNet-101	101	Bottleneck	44.5M	22.44%
ResNet-152	152	Bottleneck	60.2M	22.16%

Bottleneck Block

For deeper networks (ResNet-50+), using two 3×3 convolutions per block becomes computationally expensive. The bottleneck design replaces this with a three-layer structure:

1 \times 1 \text{ conv} \xrightarrow{\text{reduce}} 3 \times 3 \text{ conv} \xrightarrow{\text{transform}} 1 \times 1 \text{ conv} \xrightarrow{\text{expand}} + \mathbf{x}

Batch Normalization Placement

In the original ResNet, Batch Normalization (BN) is placed after each convolution and before the ReLU activation. The order within each residual branch is: Conv → BN → ReLU → Conv → BN → (Add shortcut) → ReLU.

He et al. later explored 'pre-activation' ResNets (in a 2016 follow-up) where BN and ReLU come before the convolution: BN → ReLU → Conv. This variant makes the shortcut path truly clean (just addition, no BN or activation on the path), and slightly improves performance on very deep networks.

4. Gradient Flow Analysis

The key mathematical benefit of skip connections is how they affect gradient flow during backpropagation. Consider the loss L and the gradient flowing back through the network.

For a residual block with output y = F(x) + x, the gradient of the loss with respect to the block's input x is:

\frac{\partial \mathcal{L}}{\partial \mathbf{x}} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}} \cdot \frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \frac{\partial \mathcal{L}}{\partial \mathbf{y}} \cdot \left(\frac{\partial \mathcal{F}}{\partial \mathbf{x}} + \mathbf{I}\right)

The identity matrix I in the gradient expression is crucial. Even if the gradient flowing through F(x) — the residual branch — becomes very small (vanishing), the identity term I ensures that ∂L/∂y is always passed directly to ∂L/∂x. Gradients cannot fully vanish as long as the shortcut connection is present.

5. Results

ResNet's results on ILSVRC 2015 (ImageNet) were decisive. The ensemble of ResNets achieved 3.57% top-5 error on the test set — better than the reported human-level performance of ~5.1% by some estimates.

3.57%

Top-5 Error (ILSVRC 2015)

ResNet ensemble, test set

152

Depth

layers in ResNet-152

6.7%

Previous record

VGGNet / GoogLeNet, 2014

On CIFAR-10, the paper showed that a plain 56-layer network performs worse than a plain 20-layer network (the degradation problem), whereas ResNet-56 and ResNet-110 both outperform shallower baselines, confirming that residual connections actually solve the degradation problem rather than just alleviating it.

ResNet also transferred strongly to object detection (COCO) and image segmentation, suggesting the learned features generalize well. ResNet-101/152 as backbone networks became the standard in computer vision for years after publication.

6. Key Takeaways

Depth alone isn't enough

Stacking more layers on a plain network hurts training accuracy. The problem is optimization difficulty, not model capacity.

Residual learning reframes the problem

Asking layers to learn F(x) = H(x) − x is easier than learning H(x) when the optimal H is close to identity. Pushing F toward zero is trivial; pushing H toward identity through multiple nonlinearities is not.

Skip connections are gradient highways

The identity shortcut guarantees a direct gradient path from any layer to any earlier layer, preventing vanishing gradients even in 150+ layer networks.

Bottlenecks make depth computationally feasible

The 1×1 → 3×3 → 1×1 bottleneck design dramatically reduces FLOPs by operating the expensive 3×3 conv on a reduced channel count, enabling 50–152 layer networks to train efficiently.

Zero extra parameters for identity shortcuts

When input and output dimensions match, skip connections cost nothing — no parameters, no FLOPs. Only dimension-changing shortcuts (1×1 conv) add a small number of parameters.

Deep Residual Learning for Image Recognition