TL;DR
Original ResNet places activation (BN + ReLU) after the residual addition, breaking the shortcut into a non-identity path. This paper proposes pre-activation ResNet (ResNet v2): move BN and ReLU before the convolutions, so the shortcut is a perfect identity. The result: a clean additive recurrence x_L = x_l + Ξ£ F_i that lets signals and gradients flow freely through hundreds of layers β enabling ResNet-1001 to achieve 4.62% error on CIFAR-10 vs. 7.61% for the original design.
1. ResNet v1 Recap: Residual Blocks with Post-Activation
The original ResNet (He et al., CVPR 2016) introduced the residual block to allow very deep networks to be trained. The key idea was to learn a residual function rather than the full mapping. For each building block, the output is:
The building block follows the order: Conv β BN β ReLU β Conv β BN β Add β ReLU. In this design, the residual branch F(x, W) computes a transformation, and the result is added to the shortcut before a final ReLU is applied.
ResNet v1 Block (Post-Activation)
This design worked exceptionally well β ResNet-152 won ILSVRC 2015. But the authors noticed a subtle issue: that final ReLU prevents the shortcut from being a true identity mapping.
2. The Hidden Problem: Broken Identity Path
In the original design, the shortcut connection feeds directly into an addition β but then a ReLU is applied to the sum. This means the effective shortcut mapping is not h(x) = x but rather:
The ReLU zeroes out all negative values. So any negative component in (x_l + F_l) is silently discarded. This has two harmful consequences:
- Forward pass: information in negative activations cannot propagate β the shortcut is not truly "free"
- Backward pass: when the pre-ReLU sum is negative, the gradient is zero there β gradient flow is blocked
The paper poses the question directly: what if we could guarantee the shortcut mapping h(x) = x is a perfect identity? Would that help very deep networks train more easily?
3. Pre-Activation Design: Moving Activations Before the Convolutions
The solution is elegant: instead of placing BN and ReLU after the convolution (post-activation), move them before (pre-activation). The new block applies BN β ReLU β Conv β BN β ReLU β Conv, then adds the original input without any further nonlinearity:
Now the shortcut is a pure identity: x_{l+1} = x_l + (something). No nonlinearity is applied to x_l itself β it passes through unchanged.
ResNet v1 (Post-Activation)
ResNet v2 (Pre-Activation)
This seemingly small change β reordering normalization and activation β has a profound structural consequence: the shortcut now carries the input signal x_l with no modification whatsoever.
There is also a regularization benefit: since BN is applied before each convolution, the inputs to every weight layer are always normalized. In the original design, the input to the first convolution in each block is the output of a ReLU β already non-negative but not normalized. Pre-activation gives each convolution a properly normalized input.
4. Clean Signal Propagation
With a perfect identity shortcut, we can write a simple closed-form expression for the output at any layer L in terms of any earlier layer l. Start from the block recurrence:
Unrolling this recurrence from layer l to layer L gives:
This equation reveals something remarkable: the output at layer L is simply the output at layer l plus a sum of residuals. Layer L has a direct, unobstructed path to layer l. There is no chain of multiplicative transformations between them β just addition.
A further consequence is that any layer l has access to all previous outputs: x_l is the sum of x_0 and all residuals from layer 0 to lβ1. The network behaves like an ensemble of shallower sub-networks of many different depths, all sharing weights.
5. Gradient Highway Analysis
The clean forward path has a direct analog in the backward pass. Using the chain rule on the clean recurrence x_L = x_l + Ξ£ F_i, the gradient of the loss L with respect to layer l is:
The critical observation is the '1' inside the parentheses. The gradient from layer L to layer l is:
This gradient structure has a concrete practical meaning: even at layer 1 of a 1000-layer network, the gradient signal from the final loss arrives without being filtered through 999 multiplicative gates. Training very deep networks becomes almost as stable as training a shallow one.
6. Ablation Study: Six Variants Tested
The paper systematically studies different ways to arrange the activation functions. Six variants of residual unit design are evaluated on ResNet-110 (CIFAR-10) to isolate which design choice matters:
| Variant | Design | Error (%) |
|---|---|---|
| (a) Original | BN after Add, ReLU after Add | 6.61 |
| (b) BN after Add | ReLU only after Add, BN moved to shortcut path | 8.17 |
| (c) ReLU before Add | ReLU inside residual branch before Add | 7.84 |
| (d) ReLU-only pre-act | ReLU before Conv (no BN before first Conv) | 6.71 |
| (e) Const scaling | λ·x_l + (1βΞ»)Β·F on shortcut | worse for all Ξ» |
| (f) Full pre-act β | BN β ReLU β Conv β BN β ReLU β Conv β Add | 6.37 |
The ablation confirms that neither moving just the ReLU nor just the BN is sufficient. Only the full pre-activation arrangement (variant f) achieves the clean identity shortcut and the best accuracy. Notably, variant (b) β which puts BN on the shortcut β actually hurts performance significantly, confirming that any modification to the shortcut path is harmful.
7. Results on CIFAR-10 and ImageNet
The most dramatic result is on CIFAR-10 with extremely deep networks. The original ResNet-1001 has a known training difficulty β at that depth, the post-activation design starts to degrade. The pre-activation design enables clean training:
| Model | Design | CIFAR-10 Error |
|---|---|---|
| ResNet-110 | Original (post-act) | 6.61% |
| ResNet-110 | Pre-activation (v2) | 6.37% |
| ResNet-164 | Original (post-act) | 5.93% |
| ResNet-164 | Pre-activation (v2) | 5.46% |
| ResNet-1001 | Original β training unstable | 7.61% |
| ResNet-1001 | Pre-activation β trains stably | 4.62% |
The 1001-layer result is the headline finding: nearly 3 percentage points improvement, achieved by a design change (not more compute or data). The gap grows with depth β at 110 layers the improvement is modest (0.24 pp), but at 1001 layers it is dramatic (3.0 pp).
On ImageNet with ResNet-200 (200 layers), pre-activation achieves 21.1% top-1 error compared to 21.8% for the original ResNet-200 β a meaningful improvement at large scale. The gains are consistent but more moderate on shallower models, consistent with the theory that identity shortcuts matter most when the network is very deep.
8. Pre-Activation in Modern Architectures
Pre-activation ResNet (ResNet v2) has become a widely adopted design pattern beyond image classification. Its influence can be seen in several areas:
- Very deep ResNets for medical imaging: Models with 200β500 layers for 3D volumetric data routinely use pre-activation blocks because post-activation designs degrade at those depths.
- Wide ResNets: Zagoruyko and Komodakis (2016) adopt pre-activation in their wide residual networks, where width is scaled up instead of depth.
- Transformer residual design: The Pre-LN (pre-LayerNorm) variant of Transformers β placing LayerNorm before the attention/FFN block rather than after β follows exactly the same philosophy as pre-activation ResNet and is now standard in many large language models (GPT-2, GPT-3, LLaMA).
- Neural ODE and continuous-depth models: These models interpret the residual recurrence x_{l+1} = x_l + F(x_l) as an Euler discretization of an ODE. The clean identity interpretation from this paper is foundational to that line of work.
The core contribution of this paper β that identity shortcuts enable perfect gradient flow and make very deep networks tractable β has proven to be a durable principle that generalizes far beyond its original convolutional setting.
Resources
- Original paper: Identity Mappings in Deep Residual Networks (arXiv 1603.05027)
- Original ResNet paper: Deep Residual Learning for Image Recognition (arXiv 1512.03385)
- Wide Residual Networks (Zagoruyko & Komodakis, 2016) β adopts pre-activation
- On Layer Normalization in the Transformer Architecture (Pre-LN analysis)