PaperTrace — Interactive ML Paper Deep-Dives

TL;DR

GPipe splits a large model across K accelerators (model parallelism), then splits each input batch into M micro-batches and pipelines them through the devices. Re-materialization (gradient checkpointing) reduces activation memory from O(N·M) to O(N+M). A pipeline bubble of fraction (K−1)/(M+K−1) is the only overhead — which vanishes as M grows large. Result: a 557M-parameter AmoebaNet-D on 4 TPUs, and an 83B-parameter language model on 8 TPUs.

1. The Memory Wall Problem

Modern neural networks — especially in computer vision and NLP — have grown dramatically. A ResNet-50 has ~25M parameters; an AmoebaNet-D variant has 557M; large language models push into the tens or hundreds of billions. A single accelerator (GPU or TPU) has a fixed amount of high-bandwidth memory (HBM). When the model alone cannot fit in that memory, training becomes impossible without a strategy for distributing it.

Two natural axes of parallelism exist: data parallelism (each worker gets a copy of the full model, but a shard of the data) and model parallelism (each worker gets a shard of the model, but sees the full batch). Data parallelism breaks down the moment the model itself is too large for one device. Model parallelism is the only option — but naive implementations waste most of the hardware.

2. Naive Model Parallelism

The straightforward approach to model parallelism is to assign consecutive groups of layers to consecutive devices. Device 1 holds layers 1–L/K, device 2 holds layers L/K+1–2L/K, and so on. The forward pass runs sequentially: device 1 computes its layers, passes the activation tensor to device 2, which computes its layers and passes forward, etc.

The critical problem: at any point in time, only one device is active. While device 2 processes its layers, devices 1, 3, 4, …, K are idle. The utilization of each device approaches 1/K — so with 4 devices, each is busy only 25% of the time. This is sometimes called the "model parallelism bubble" or just idle time, and it completely negates the purpose of using multiple accelerators.

Naive model parallelism — timeline (K=4 devices, 1 batch)

Device 1: [FWD][ idle ][ idle ][ idle ][BWD][ idle ][ idle ][ idle ]
Device 2: [idle][FWD  ][ idle ][ idle ][idle][BWD ][ idle ][ idle ]
Device 3: [idle][ idle][FWD  ][ idle ][idle][idle][BWD  ][ idle ]
Device 4: [idle][ idle][ idle ][FWD  ][idle][idle][ idle ][BWD  ]

Each device active only 25% of the time (2 of 8 time slots).

3. Pipeline Parallelism with Micro-Batches

GPipe's key insight is to subdivide the input mini-batch into M equal micro-batches. Each device processes micro-batch m, then immediately starts on micro-batch m+1. Device k+1 can begin processing micro-batch m as soon as device k finishes it — so all devices can be active simultaneously (after the initial fill phase).

Critically, the gradient update happens only after all M micro-batches complete their forward and backward passes. The gradients from M micro-batches are accumulated and averaged before the optimizer step. This means the effective batch size equals the original mini-batch size — the micro-batch split is purely a scheduling optimization, invisible to the learning dynamics.

GPipe pipeline schedule — K=4 devices, M=4 micro-batches

         t=1   t=2   t=3   t=4   t=5   t=6   t=7   t=8   t=9   t=10  t=11
Device 1: [F1]  [F2]  [F3]  [F4]  [  ]  [  ]  [B4]  [B3]  [B2]  [B1]  [↑W]
Device 2: [  ]  [F1]  [F2]  [F3]  [F4]  [  ]  [  ]  [B4]  [B3]  [B2]  [B1]
Device 3: [  ]  [  ]  [F1]  [F2]  [F3]  [F4]  [  ]  [  ]  [B4]  [B3]  [B2]
Device 4: [  ]  [  ]  [  ]  [F1]  [F2]  [F3]  [F4]  [  ]  [  ]  [  ]  [B4]

Fm = forward pass for micro-batch m, Bm = backward pass. Empty slots = bubble (pipeline idle). ↑W = weight update using accumulated gradients.

The key property: once the pipeline is full (after K−1 time steps), all K devices are active on different micro-batches simultaneously. The only wasted time is the initial K−1 steps to fill the pipeline and the final K−1 steps to drain it — the "pipeline bubble."

4. Re-materialization (Gradient Checkpointing)

With K devices and M micro-batches, the naive approach stores all activations from every layer of every micro-batch simultaneously to support the backward pass. With N total layers and M micro-batches, this means storing O(N·M) activation tensors — which can easily exceed device memory.

GPipe applies re-materialization: activations are not stored during the forward pass. Instead, only the boundary activations (the output tensors passed between devices) are retained. During the backward pass, each device re-runs its forward computation to recompute the needed activations on the fly, then immediately computes and discards the gradient.

Memory cost comparison

Without re-materialization (store all activations):

\text{Memory} = O(N \cdot M)

N layers × M micro-batches each held in memory simultaneously

With re-materialization (only store partition boundary outputs):

\text{Memory} = O(N + M)

O(N): parameter memory per partition + O(M): M boundary activation tensors (one per micro-batch in-flight)

The trade-off is computation: each partition's forward pass is run twice — once during the forward phase and once during backward. This roughly doubles the computation within each device, but this is far preferable to running out of memory entirely.

5. Bubble Overhead Analysis

The pipeline is not perfectly efficient: there is always a startup bubble (filling the pipeline) and a drain bubble (emptying it). How large is this overhead exactly?

Pipeline efficiency formula

Total time steps in forward+backward schedule:

T_{\text{total}} = (M + K - 1) \cdot t_{\text{step}}

Useful (non-bubble) time steps:

T_{\text{useful}} = M \cdot t_{\text{step}}

Bubble fraction (wasted time):

\text{bubble} = \frac{K - 1}{M + K - 1}

Pipeline efficiency:

\eta = 1 - \frac{K-1}{M+K-1} = \frac{M}{M+K-1}

The bubble fraction decreases as M increases. With K=4 partitions and M=4 micro-batches, the bubble is (4−1)/(4+4−1) = 3/7 ≈ 43%. With M=32 micro-batches, it drops to 3/35 ≈ 9%. With M=128, it is 3/131 ≈ 2.3%. In practice, GPipe recommends M ≥ 4K to keep bubble overhead below 20%.

6. Worked Example: 4 Devices, 8 Micro-Batches

Suppose we have a model with 32 layers split across K=4 devices (8 layers each), and we split each mini-batch into M=8 micro-batches.

Bubble fraction:

\frac{K-1}{M+K-1} = \frac{4-1}{8+4-1} = \frac{3}{11} \approx 27\%

Pipeline efficiency:

\eta = \frac{M}{M+K-1} = \frac{8}{11} \approx 73\%

Memory per device (with re-mat):

Parameters for 8 layers + 8 boundary activation tensors (one per micro-batch). Compare to without re-mat: 8 layers × 8 micro-batch activations = 64 activation tensors simultaneously.

Scaling to M=32 micro-batches:

\text{bubble} = \frac{3}{32+3} = \frac{3}{35} \approx 8.6\%

At M=32 the overhead is acceptable; memory cost increases only linearly with M (O(M) boundary tensors).

K=4, M=8: full pipeline schedule (F=forward, B=backward, _=bubble)

      t:  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19
Dev 1: [F1][F2][F3][F4][F5][F6][F7][F8][ ][ ][ ][B8][B7][B6][B5][B4][B3][B2][B1]
Dev 2: [ ][F1][F2][F3][F4][F5][F6][F7][F8][ ][ ][ ][B8][B7][B6][B5][B4][B3][B2]
Dev 3: [ ][ ][F1][F2][F3][F4][F5][F6][F7][F8][ ][ ][ ][B8][B7][B6][B5][B4][B3]
Dev 4: [ ][ ][ ][F1][F2][F3][F4][F5][F6][F7][F8][ ][ ][ ][B8][B7][B6][B5][B4]

7. Results

GPipe was validated on two drastically different domains: image classification with AmoebaNet-D and language modeling with a Transformer-based LM.

Image Classification

AmoebaNet-D (557M parameters) trained on ImageNet at 480×480 resolution across 4 accelerators — impossible to fit on one.
Top-1 accuracy: 84.3% on ImageNet — state-of-the-art at the time.
Scaled to 1.8B parameters with 8 accelerators.

Language Modeling

Trained a Transformer LM with 83 billion parameters across 8 TPUs — roughly 10× the size of GPT-2 XL (1.5B), trained the same year.
Perplexity improved significantly with scale, validating that pipeline parallelism enables regimes otherwise inaccessible.

Throughput Scaling

Near-linear throughput scaling: doubling the number of devices approximately doubles throughput (for sufficiently large M).
Re-materialization overhead is approximately 25% extra compute — a small constant cost for dramatically reduced memory.

8. Modern Relevance: Tensor and Pipeline Parallelism in Megatron-LM

GPipe's pipeline parallelism is now one of three standard parallelism axes used in large-scale LLM training. Megatron-LM (NVIDIA, 2021+) combines all three simultaneously:

Tensor Parallelism (TP)

Splits individual weight matrices across devices within a single layer. For a linear layer W ∈ ℝ^{d×d}, split into column-parallel and row-parallel shards. Requires all-reduce communication within each layer. Typically used within a node (fast NVLink).

Pipeline Parallelism (PP) — GPipe-style

Splits layers across nodes. Only boundary activation tensors are communicated (much less bandwidth than all-reduce). Used across nodes (slower InfiniBand). Megatron adds "interleaved scheduling" which reduces the bubble further by splitting each stage into non-contiguous layer chunks.

Data Parallelism (DP)

Outer loop: replicate the full (TP+PP) model across data-parallel groups, each processing different data. Gradients are all-reduced across DP groups. ZeRO (DeepSpeed) further shards optimizer states, gradients, and parameters across DP ranks.

A typical configuration for training a 175B parameter model might be: TP=8 (within node, 8 GPUs), PP=8 (across 8 nodes), DP=64 (64 copies of the TP+PP group). Total: 8 × 8 × 64 = 4096 GPUs. GPipe's micro-batch pipeline schedule is the direct ancestor of the PP dimension here.

← Back to Papers