PaperTrace — Interactive ML Paper Deep-Dives

TL;DR

Original ResNet places activation (BN + ReLU) after the residual addition, breaking the shortcut into a non-identity path. This paper proposes pre-activation ResNet (ResNet v2): move BN and ReLU before the convolutions, so the shortcut is a perfect identity. The result: a clean additive recurrence x_L = x_l + Σ F_i that lets signals and gradients flow freely through hundreds of layers — enabling ResNet-1001 to achieve 4.62% error on CIFAR-10 vs. 7.61% for the original design.

ResNet v1: Add → ReLU after residual sum — shortcut is NOT identity

Motivates

Problem: final ReLU blocks negative signals from passing through the skip connection

Core question

Key insight: if h(x) = x exactly, forward/backward paths are unobstructed

Solution

Pre-activation: move BN → ReLU before each Conv (inside the residual branch)

Block: BN → ReLU → Conv → BN → ReLU → Conv → Add

Shortcut: pure identity — no activation applied

Enables

Math: x_L = x_l + Σ_{i=l}^{L-1} F(x_i, W_i) — clean additive recurrence

Outcome

Result: ResNet-1001 on CIFAR-10: 4.62% error (vs. 7.61% original)

Perfect identity shortcut

Gradient always contains '1' — never vanishes

ResNet-1001: 4.62% CIFAR-10 error

1. ResNet v1 Recap: Residual Blocks with Post-Activation

The original ResNet (He et al., CVPR 2016) introduced the residual block to allow very deep networks to be trained. The key idea was to learn a residual function rather than the full mapping. For each building block, the output is:

x_{l+1} = \text{ReLU}\bigl(x_l + F(x_l,\, W_l)\bigr)

The building block follows the order: Conv → BN → ReLU → Conv → BN → Add → ReLU. In this design, the residual branch F(x, W) computes a transformation, and the result is added to the shortcut before a final ReLU is applied.

ResNet v1 Block (Post-Activation)

│x (input)

├───── Conv ─── BN ─── ReLU ─── Conv ─── BN ───┐

│(residual branch F)

│↓ Add

└──────────────────────────────────────────ReLU←─┘

│x_{l+1} (output, clipped by ReLU)

This design worked exceptionally well — ResNet-152 won ILSVRC 2015. But the authors noticed a subtle issue: that final ReLU prevents the shortcut from being a true identity mapping.

2. The Hidden Problem: Broken Identity Path

In the original design, the shortcut connection feeds directly into an addition — but then a ReLU is applied to the sum. This means the effective shortcut mapping is not h(x) = x but rather:

x_{l+1} = \text{ReLU}(x_l + F_l) \neq x_l + F_l

The ReLU zeroes out all negative values. So any negative component in (x_l + F_l) is silently discarded. This has two harmful consequences:

Forward pass: information in negative activations cannot propagate — the shortcut is not truly "free"
Backward pass: when the pre-ReLU sum is negative, the gradient is zero there — gradient flow is blocked

The paper poses the question directly: what if we could guarantee the shortcut mapping h(x) = x is a perfect identity? Would that help very deep networks train more easily?

3. Pre-Activation Design: Moving Activations Before the Convolutions

The solution is elegant: instead of placing BN and ReLU after the convolution (post-activation), move them before (pre-activation). The new block applies BN → ReLU → Conv → BN → ReLU → Conv, then adds the original input without any further nonlinearity:

x_{l+1} = x_l + F\bigl(\hat{x}_l,\, W_l\bigr) \quad \text{where } \hat{x}_l = \text{ReLU}(\text{BN}(x_l))

Now the shortcut is a pure identity: x_{l+1} = x_l + (something). No nonlinearity is applied to x_l itself — it passes through unchanged.

ResNet v1 (Post-Activation)

x_l

↓ Conv (3×3)

↓ BN

↓ ReLU

↓ Conv (3×3)

↓ BN

↓ + x_l (Add)

↓ ReLU ← shortcut broken here

x_{l+1}

ResNet v2 (Pre-Activation)

x_l

↓ BN ← activation first

↓ ReLU

↓ Conv (3×3)

↓ BN

↓ ReLU

↓ Conv (3×3)

↓ + x_l (Add) ← clean identity

x_{l+1}

This seemingly small change — reordering normalization and activation — has a profound structural consequence: the shortcut now carries the input signal x_l with no modification whatsoever.

There is also a regularization benefit: since BN is applied before each convolution, the inputs to every weight layer are always normalized. In the original design, the input to the first convolution in each block is the output of a ReLU — already non-negative but not normalized. Pre-activation gives each convolution a properly normalized input.

4. Clean Signal Propagation

With a perfect identity shortcut, we can write a simple closed-form expression for the output at any layer L in terms of any earlier layer l. Start from the block recurrence:

x_{l+1} = x_l + F(x_l, W_l)

Unrolling this recurrence from layer l to layer L gives:

x_L = x_l + \sum_{i=l}^{L-1} F(x_i,\, W_i)

This equation reveals something remarkable: the output at layer L is simply the output at layer l plus a sum of residuals. Layer L has a direct, unobstructed path to layer l. There is no chain of multiplicative transformations between them — just addition.

A further consequence is that any layer l has access to all previous outputs: x_l is the sum of x_0 and all residuals from layer 0 to l−1. The network behaves like an ensemble of shallower sub-networks of many different depths, all sharing weights.

5. Gradient Highway Analysis

The clean forward path has a direct analog in the backward pass. Using the chain rule on the clean recurrence x_L = x_l + Σ F_i, the gradient of the loss L with respect to layer l is:

\frac{\partial \mathcal{L}}{\partial x_l} = \frac{\partial \mathcal{L}}{\partial x_L} \cdot \frac{\partial x_L}{\partial x_l} = \frac{\partial \mathcal{L}}{\partial x_L} \left(1 + \frac{\partial}{\partial x_l}\sum_{i=l}^{L-1} F(x_i, W_i)\right)

The critical observation is the '1' inside the parentheses. The gradient from layer L to layer l is:

\frac{\partial \mathcal{L}}{\partial x_l} = \underbrace{\frac{\partial \mathcal{L}}{\partial x_L}}_{\text{gradient at } L} + \frac{\partial \mathcal{L}}{\partial x_L} \cdot \frac{\partial}{\partial x_l}\sum_{i=l}^{L-1} F_i

This gradient structure has a concrete practical meaning: even at layer 1 of a 1000-layer network, the gradient signal from the final loss arrives without being filtered through 999 multiplicative gates. Training very deep networks becomes almost as stable as training a shallow one.

6. Ablation Study: Six Variants Tested

The paper systematically studies different ways to arrange the activation functions. Six variants of residual unit design are evaluated on ResNet-110 (CIFAR-10) to isolate which design choice matters:

Variant	Design	Error (%)
(a) Original	BN after Add, ReLU after Add	6.61
(b) BN after Add	ReLU only after Add, BN moved to shortcut path	8.17
(c) ReLU before Add	ReLU inside residual branch before Add	7.84
(d) ReLU-only pre-act	ReLU before Conv (no BN before first Conv)	6.71
(e) Const scaling	λ·x_l + (1−λ)·F on shortcut	worse for all λ
(f) Full pre-act ✓	BN → ReLU → Conv → BN → ReLU → Conv → Add	6.37

The ablation confirms that neither moving just the ReLU nor just the BN is sufficient. Only the full pre-activation arrangement (variant f) achieves the clean identity shortcut and the best accuracy. Notably, variant (b) — which puts BN on the shortcut — actually hurts performance significantly, confirming that any modification to the shortcut path is harmful.

7. Results on CIFAR-10 and ImageNet

The most dramatic result is on CIFAR-10 with extremely deep networks. The original ResNet-1001 has a known training difficulty — at that depth, the post-activation design starts to degrade. The pre-activation design enables clean training:

Model	Design	CIFAR-10 Error
ResNet-110	Original (post-act)	6.61%
ResNet-110	Pre-activation (v2)	6.37%
ResNet-164	Original (post-act)	5.93%
ResNet-164	Pre-activation (v2)	5.46%
ResNet-1001	Original — training unstable	7.61%
ResNet-1001	Pre-activation — trains stably	4.62%

The 1001-layer result is the headline finding: nearly 3 percentage points improvement, achieved by a design change (not more compute or data). The gap grows with depth — at 110 layers the improvement is modest (0.24 pp), but at 1001 layers it is dramatic (3.0 pp).

On ImageNet with ResNet-200 (200 layers), pre-activation achieves 21.1% top-1 error compared to 21.8% for the original ResNet-200 — a meaningful improvement at large scale. The gains are consistent but more moderate on shallower models, consistent with the theory that identity shortcuts matter most when the network is very deep.

8. Pre-Activation in Modern Architectures

Pre-activation ResNet (ResNet v2) has become a widely adopted design pattern beyond image classification. Its influence can be seen in several areas:

Very deep ResNets for medical imaging: Models with 200–500 layers for 3D volumetric data routinely use pre-activation blocks because post-activation designs degrade at those depths.
Wide ResNets: Zagoruyko and Komodakis (2016) adopt pre-activation in their wide residual networks, where width is scaled up instead of depth.
Transformer residual design: The Pre-LN (pre-LayerNorm) variant of Transformers — placing LayerNorm before the attention/FFN block rather than after — follows exactly the same philosophy as pre-activation ResNet and is now standard in many large language models (GPT-2, GPT-3, LLaMA).
Neural ODE and continuous-depth models: These models interpret the residual recurrence x_{l+1} = x_l + F(x_l) as an Euler discretization of an ODE. The clean identity interpretation from this paper is foundational to that line of work.

The core contribution of this paper — that identity shortcuts enable perfect gradient flow and make very deep networks tractable — has proven to be a durable principle that generalizes far beyond its original convolutional setting.

Resources

← Back to Papers