PaperTrace — Interactive ML Paper Deep-Dives

TL;DR

Standard attention materializes the full N×N attention matrix in GPU HBM, requiring O(N²) memory. FlashAttention reorders the computation using tiling: it computes attention in blocks that fit in SRAM (fast cache), never writing the full N×N matrix to HBM. This achieves exact (not approximate) attention with 2-4× speedup and O(N) memory.

Problem: Standard attention writes N×N matrix to HBM — bottleneck is memory bandwidth

Motivates

Key insight: GPU SRAM (fast) is tiny; GPU HBM (slow) is large. Minimize HBM reads/writes.

Solution strategy

Tiling: Split Q, K, V into blocks that fit in SRAM

Block size B_r × B_c

Compute partial softmax in SRAM

Core mechanism

Online softmax: Rescale running max/sum as new blocks arrive

Enables exact softmax

Backward: Recompute attention from tiles (no store of N×N grad)

Outcome

Result: 3× faster attention, 5-20× memory reduction, longer context

Exact attention (not approximate)

O(N) memory instead of O(N²)

IO-aware algorithm design

1. Background: Why Standard Attention Is Slow

Modern GPUs have two levels of memory with very different bandwidths. SRAM (on-chip cache) is fast but tiny (∼20 MB). HBM (high-bandwidth memory, i.e., the main GPU RAM) is large (∼40 GB on A100) but comparatively slow.

Memory type	Bandwidth	Size (A100)
SRAM (on-chip)	~19 TB/s	~20 MB
HBM (off-chip)	~2 TB/s	~40 GB

Standard attention performs three round-trips through HBM for each sequence. Every intermediate result — the raw score matrix S, the softmax probability matrix P — is written to HBM and read back:

Step 1: Load Q, K from HBM → compute S = QKᵀ → write S to HBM

Step 2: Load S from HBM → compute P = softmax(S) → write P to HBM

Step 3: Load P, V from HBM → compute O = PV → write O to HBM

Total HBM reads/writes: O(N²) — this is the bottleneck, not compute!

For sequence length N = 4096, the attention matrix S has 4096² = 16,777,216 elements. At FP16, that is 32 MB just for S alone — already larger than SRAM. The bottleneck is not arithmetic throughput but memory bandwidth.

2. The Tiling Approach

FlashAttention tiles Q, K, V into blocks that each fit in SRAM, computes attention entirely within SRAM for each pair of blocks, and accumulates into the output without ever materializing the full N×N matrix in HBM.

Q = [Q_1, \ldots, Q_{T_r}], \quad K = [K_1, \ldots, K_{T_c}], \quad V = [V_1, \ldots, V_{T_c}]

# Load Q block into SRAM once

for i in 1..T_r:

Load Q_i from HBM to SRAM

for j in 1..T_c:

Load K_j, V_j from HBM to SRAM

Compute S_ij = Q_i K_jᵀ (in SRAM)

Update running max m and sum ℓ (online softmax)

Accumulate O_i += rescaled(softmax(S_ij) · V_j) (in SRAM)

Write O_i to HBM (once per Q block)

HBM reads/writes: O(N) total — S and P are never materialized in HBM!

3. The Online Softmax Trick

Softmax normally requires a two-pass algorithm: first compute the maximum value across all scores (for numerical stability), then compute the exponentials and sum. This requires seeing all N values before producing any output — incompatible with tiling.

The online softmax trick maintains a running maximum m and a running sum ℓ as blocks arrive, and rescales the accumulated output O each time the maximum estimate is updated:

m_\text{new} = \max(m_\text{old},\; \max(s_\text{block}))

\ell_\text{new} = e^{m_\text{old} - m_\text{new}} \cdot \ell_\text{old} + \sum_j e^{s_{\text{block},j} - m_\text{new}}

O_\text{new} = \frac{e^{m_\text{old} - m_\text{new}} \cdot \ell_\text{old} \cdot O_\text{old} + e^{s_\text{block} - m_\text{new}} \cdot V_\text{block}}{\ell_\text{new}}

Why this is exact: Each update is a mathematically equivalent rescaling of the previous partial result. When all blocks have been processed, O contains exactly the same value as standard attention — not an approximation. The trick only reorganizes the order of arithmetic operations.

4. Concrete Example: Tiling with N=4, Block Size=2

Query sequence: [q1, q2, q3, q4], Key sequence: [k1, k2, k3, k4]

Block 1 (q1,q2 attend to k1,k2):

Load Q[1:2], K[1:2], V[1:2] into SRAM

Compute S_11 = Q[1:2] · K[1:2]ᵀ

Set m1 = max(S_11), ℓ_1 = sum of exp(S_11 - m1)

O[1:2] = exp(S_11 - m1) · V[1:2] / ℓ_1

Block 2 (q1,q2 attend to k3,k4):

Load K[3:4], V[3:4] into SRAM (Q[1:2] stays)

Compute S_12 = Q[1:2] · K[3:4]ᵀ

m2 = max(m1, max(S_12)), rescale ℓ_1, merge into ℓ_2

Rescale O[1:2] by exp(m1 - m2), add new contribution

Final O[1:2] = exact softmax attention over all 4 keys!

Never wrote S or P to HBM. Only read each K, V block once.

5. Backward Pass: Recomputation

For backpropagation, standard attention needs to store the N×N attention matrix P to compute gradients. FlashAttention instead stores only the output O and the per-row softmax statistics (m, ℓ) — O(N) total — and recomputes the attention tiles on the fly during the backward pass.

Trade-off: Recomputation requires additional FLOPs in the backward pass (roughly 2× the FLOPs of the forward pass), but this is cheaper than the HBM bandwidth cost of storing and loading the N×N matrix. On modern hardware, attention is memory-bandwidth-bound, not compute-bound.

6. Memory Complexity

What is stored	Standard attention	FlashAttention
Score matrix S	O(N²)	never stored
Probability matrix P	O(N²)	never stored
Output O	O(Nd)	O(Nd)
Softmax statistics (m, ℓ)	implicit in P	O(N)
Total	O(N²)	O(N)

7. Results

Method	Seq len 2K	Seq len 4K	Seq len 8K	Memory (2K)
Standard	1.0×	1.0×	OOM	O(N²)
FlashAttention	3.1×	3.8×	3.5×	O(N)

BERT: 15% end-to-end training speedup, 3× memory reduction
GPT-2: 3× faster attention operation with identical perplexity
Long context: Enables 64K context length training, compared to the ~2K practical limit of standard attention on the same hardware

8. Limitations

CUDA-specific: The original FlashAttention is a hand-written CUDA kernel. It is not automatically available in all deep learning frameworks without explicit integration.
Hardware-dependent block sizes: Optimal block sizes depend on the specific GPU's SRAM capacity, requiring tuning per hardware target.
GPU utilization: FlashAttention-1 did not fully utilize GPU compute units. FlashAttention-2 and FlashAttention-3 addressed this with further parallelism and warp-level optimizations.

9. Connections to Other Work

Attention Is All You Need

FlashAttention is a drop-in replacement for the O(N²) scaled dot-product attention introduced in this paper. It computes exactly the same mathematical operation, just with a dramatically better IO pattern.

LoRA

Frequently combined in practice: LoRA reduces the number of trainable parameters while FlashAttention reduces the memory and time cost of each forward pass. Together they enable fine-tuning large models on limited hardware.

LLaDA

Masked diffusion language models like LLaDA rely on FlashAttention for efficient training at scale, since they perform many forward passes per training step.

10. Additional Resources

FlashAttention (arXiv)Original paper FlashAttention-2 (Dao 2023)Better parallelism and work partitioning for higher GPU utilization FlashAttention-3 (2024)Hopper GPU optimizations: TMA, warp specialization, FP8 support