PaperTrace — Interactive ML Paper Deep-Dives

TL;DR

Semantic segmentation requires both large receptive fields (to understand context) and full-resolution outputs (to label every pixel). Standard pooling achieves large receptive fields but destroys spatial resolution. Dilated convolutions insert gaps between filter elements, exponentially growing the receptive field with stacked layers — without any downsampling. A 3×3 filter with dilation rate 2 sees a 5×5 region; stacking rates 1, 2, 4, 8, 16 spans a 65×65 region — all at full resolution.

1. Segmentation’s Resolution Dilemma

Semantic segmentation assigns a class label to every pixel in an image. Unlike image classification, which produces a single label for the whole image, segmentation demands a dense, full-resolution prediction map.

There are two conflicting requirements that make this hard:

Large receptive fields — to understand what an object is, a model must see enough surrounding context. A pixel’s class often depends on objects tens or hundreds of pixels away.
Full-resolution output — the final prediction must be the same size as the input image. Every pixel needs a label, so spatial information cannot be discarded.

Standard convolutional networks resolve the receptive-field problem through pooling and strided convolutions. After several pooling layers, a neuron’s receptive field spans a large image region — but the feature map itself has been downsampled, often to 1/32 of the original resolution. To recover spatial detail, these architectures must use upsampling (transposed convolutions, bilinear interpolation), which re-introduces blurriness and loses fine-grained boundaries.

2. Dilated Convolutions Explained

A dilated (or “atrous”) convolution is a convolution in which the filter is applied over an area larger than its footprint by inserting zeros between filter elements. The dilation rate $l$ controls how many zeros are inserted: with rate $l$ , each consecutive filter tap is spaced $l$ pixels apart.

The standard discrete convolution is:

(F * k)(\mathbf{p}) = \sum_{\mathbf{s}+\mathbf{t}=\mathbf{p}} F(\mathbf{s})\, k(\mathbf{t})

The dilated convolution with rate $l$ is:

(F *_l k)(\mathbf{p}) = \sum_{\mathbf{s}+l\mathbf{t}=\mathbf{p}} F(\mathbf{s})\, k(\mathbf{t})

The only difference is the factor $l$ in front of $\mathbf{t}$ . Instead of summing over adjacent filter positions, we sample the input at positions spaced $l$ apart. When $l = 1$ , this reduces to standard convolution.

For a filter of size $k \times k$ applied with dilation rate $r$ , the effective receptive field size is:

k_{\text{eff}} = k + (k-1)(r-1)

For example, a 3×3 filter (k=3) with dilation rate r=2 has an effective size of:

k_{\text{eff}} = 3 + (3-1)(2-1) = 3 + 2 = 5

So a 3×3 filter with r=2 sees a 5×5 region, but only uses 9 parameters (not 25). With r=4, the same 9-parameter filter sees a 9×9 region. The key insight: receptive field grows with dilation rate, but parameter count stays the same.

3. Exponential Receptive Field Growth

The real power of dilated convolutions emerges when they are stacked with exponentially increasing dilation rates: 1, 2, 4, 8, 16, …

After stacking $n$ layers with dilation rates $1, 2, 4, \ldots, 2^{n-1}$ , each using a 3×3 filter, the total effective receptive field is:

\text{Receptive field} = 2^n \times 2^n

Tracing this layer by layer:

Layer	Dilation rate	Effective filter size	Cumulative receptive field
1	1	3×3	3×3
2	2	5×5	7×7
3	4	9×9	15×15
4	8	17×17	31×31
5	16	33×33	63×63
6	32	65×65	127×127

The receptive field grows as 3, 7, 15, 31, 63, 127, … Each doubling of the dilation rate roughly doubles the receptive field diameter — exponential growth with linear cost in layers. Compare this to standard convolutions stacked without dilation: you would need a 63-layer network of 3×3 filters to achieve the same 127×127 receptive field.

4. Multi-Scale Context Module

The paper proposes a standalone context module that can be plugged on top of any existing segmentation network. It takes the network’s feature map as input and passes it through a sequence of dilated convolutions with progressively larger dilation rates to aggregate multi-scale context.

The module stacks 8 convolutional layers with the following dilation schedule:

Context module dilation schedule:

Layer 1r=1local features

Layer 2r=1local features

Layer 3r=2small context

Layer 4r=4medium context

Layer 5r=8large context

Layer 6r=16very large context

Layer 7r=1channel mixing (1×1)

Layer 8r=1output projection (1×1)

All layers except the last two use 3×3 filters. The last two 1×1 layers combine information across channels. This module is trained end-to-end and can be inserted between the body of a segmentation network (e.g., VGG, ResNet) and the final prediction layer.

5. Worked Example: Tracing a 3×3 Filter with r=2, then r=4

Consider a 1D input of width 9 and a filter of size 3. We trace two dilated convolution layers.

Layer 1: dilation r=2, filter size k=3

Input positions: [0, 1, 2, 3, 4, 5, 6, 7, 8]. Computing output at position p=4:

# Filter taps at offsets t=-1, 0, +1

s + l·t = p → s + 2·t = 4

t=-1: s = 4 + 2 = 6 → input[6] × k[-1]

t= 0: s = 4 + 0 = 4 → input[4] × k[0]

t=+1: s = 4 - 2 = 2 → input[2] × k[1]

Samples positions: 2, 4, 6 — a span of 5 (effective 5×1 filter)

Layer 2: dilation r=4, filter size k=3

Now each input to layer 2 already integrates a 5-wide region. Computing output at position p=4:

# Filter taps at offsets t=-1, 0, +1

s + l·t = p → s + 4·t = 4

t=-1: s = 4 + 4 = 8 → L1_out[8] (covers original input 6..10)

t= 0: s = 4 + 0 = 4 → L1_out[4] (covers original input 2..6)

t=+1: s = 4 - 4 = 0 → L1_out[0] (covers original input -2..2)

Combined span in original input: positions 0..8 → 9-wide = effective 9×1 filter ✓

Two 3-tap filters (6 parameters total) see a 9-wide region. Equivalently in 2D: a 3×3 filter with r=2 covers a 5×5 footprint; stacking another 3×3 filter with r=4 on top covers a 13×13 footprint — all without any pooling or striding.

6. Results on Pascal VOC

The authors evaluated on Pascal VOC 2012, the standard benchmark for semantic segmentation with 20 object categories plus background. The metric is mean Intersection over Union (mIoU).

Method	Backbone	val mIoU (%)
FCN-32s (baseline)	VGG-16	59.4
FCN-8s (skip connections)	VGG-16	62.2
DeepLab v1 (CRF post-proc.)	VGG-16	67.6
Basic context module (ours)	VGG-16	66.2
Large context module (ours)	VGG-16	67.1
Large context + CRF (ours)	VGG-16	70.3

The context module achieves competitive mIoU with FCN and DeepLab baselines, without any CRF post-processing. When combined with a CRF, it surpasses prior methods. Crucially, the improvement comes from richer multi-scale features, not architectural tricks or special training procedures.

7. Influence: WaveNet, DeepLab, and Beyond

Dilated convolutions proved to be a general-purpose tool far beyond semantic segmentation. Two especially influential successors:

WaveNet (van den Oord et al., DeepMind 2016)

WaveNet uses causal dilated convolutions for raw audio generation. The 1D signal requires extremely long-range context (audio at 16 kHz needs thousands of samples of context for coherent speech). Stacking dilated layers with rates 1, 2, 4, 8, ..., 512 gives a receptive field of 1024 samples — enough for tens of milliseconds of audio — with only ~30 layers. The exponential receptive field growth was directly inspired by this paper.

DeepLab v2 / v3 / v3+ (Chen et al., Google 2018)

DeepLab extended dilated convolutions with Atrous Spatial Pyramid Pooling (ASPP): applying several dilated convolutions in parallel with different rates (e.g., r=6, 12, 18) and concatenating their outputs. This captures features at multiple scales simultaneously within a single layer, rather than sequentially. DeepLab v3+ achieved state-of-the-art on PASCAL VOC (89.0 mIoU) and Cityscapes, becoming the dominant segmentation paradigm for several years.

Dilated Residual Networks (Yu et al., 2017)

A follow-up by the same authors replaced the final pooling layers of ResNet with dilated convolutions to maintain spatial resolution at 1/8 instead of 1/32. This became the standard practice for dense prediction tasks: use a classification backbone, remove the last stride, add dilation to compensate, and plug in a lightweight prediction head. This pattern underlies most modern segmentation and detection architectures.

← Back to Papers