PaperTrace — Interactive ML Paper Deep-Dives

TL;DR

The Transformer replaces recurrence entirely with self-attention. Each position can directly attend to every other position in a single step — no sequential computation required. At the time, it achieved state-of-the-art on WMT translation tasks while training 3× faster than the best RNN models.

Problem: RNNs process tokens sequentially

O(n) sequential steps, gradient vanishing, slow

Motivation

Insight: Attention can connect all positions in O(1) steps

No recurrence needed — pure parallelism

Key insight

Self-Attention: Each token queries all others

Three learned projections per token

Q = XW^Q

K = XW^K

V = XW^V

Core mechanism

Scale + Softmax: Attention(Q,K,V) = softmax(QKᵀ/√d_k)V

Scaling prevents vanishing gradients in softmax

Normalization

Multi-Head: Run h attention heads in parallel, concatenate

h=8 heads, each with d_k=64; captures diverse relationships

Parallelism

Encoder-Decoder: 6 encoder layers + 6 decoder layers

Each layer: multi-head attention + feed-forward + LayerNorm

Full architecture

Result: BLEU 28.4 on WMT EN-DE (new SOTA)

+2.0 BLEU over previous best, 3.5 days on 8 P100 GPUs

O(1) path length between any two tokens

Full parallelism during training

New SOTA on WMT EN-DE and EN-FR

1. Background: Why Not RNNs?

Before the Transformer, sequence-to-sequence models relied on RNNs (LSTMs, GRUs). These models process tokens one at a time, left-to-right, making them inherently sequential. Three core problems motivated a fundamentally different approach:

Sequential computation prevents parallelization: To compute the hidden state at position t, you need the hidden state at t−1. This means no GPU parallelism across the sequence — training is slow.
Gradient vanishing over long sequences: Gradients must flow backwards through many time steps. Even with LSTMs, signals from early tokens get diluted over long sequences.
O(n) path length between distant tokens: To connect token 1 and token 100, information must pass through 99 intermediate states. Long-range dependencies are hard to learn.

The table below compares the maximum path length between two tokens across different architectures — shorter paths mean easier learning of long-range dependencies:

Model	Complexity / layer	Sequential ops	Max path length
Self-Attention	$O(n^2 \cdot d)$	$O(1)$	$O(1)$
Recurrent (RNN)	$O(n \cdot d^2)$	$O(n)$	$O(n)$
Convolutional (CNN)	$O(k \cdot n \cdot d^2)$	$O(1)$	$O(\log_k n)$

n = sequence length, d = model dimension, k = kernel size. Self-attention achieves O(1) path length at the cost of O(n²) memory — the core trade-off.

2. Scaled Dot-Product Attention

The core operation of the Transformer is Scaled Dot-Product Attention. Given queries Q, keys K, and values V, the output is a weighted sum of the values, where the weights come from the compatibility of queries and keys:

\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V

Q

Queries — "What am I looking for?" Shape [n, d_k]. Each row is one token's query vector.

K

Keys — "What do I have to offer?" Shape [n, d_k]. Each row is one token's key vector.

V

Values — "What I will actually give you." Shape [n, d_v]. The actual content that gets aggregated.

d_k

Dimension of key (and query) vectors. In the base Transformer, d_k = 64 per head.

\sqrt{d_k}

Scaling factor — dot products grow in magnitude with d_k (variance scales with d_k). Dividing by √d_k keeps softmax inputs in a reasonable range and prevents gradient vanishing.

We have 4 tokens: "The", "cat", "sat", "on". Let's trace how "cat" attends to all tokens with d_k = 4.

Step 1: Raw dot-product scores (Q_cat · K_token)

Q_cat · K_The = 0.8

Q_cat · K_cat = 2.1

Q_cat · K_sat = 0.3

Q_cat · K_on = 0.1

Step 2: Divide by √d_k = √4 = 2

[0.8, 2.1, 0.3, 0.1] / 2 = [0.40, 1.05, 0.15, 0.05]

Step 3: Softmax → attention weights (sum = 1.0)

softmax([0.40, 1.05, 0.15, 0.05]) = [0.21, 0.51, 0.17, 0.11]

"cat" attends most strongly to itself (0.51), then "The" (0.21)

Step 4: Weighted sum of value vectors

output = 0.21 × V_The + 0.51 × V_cat + 0.17 × V_sat + 0.11 × V_on

The output for "cat" is dominated by its own value (51%) and the preceding context "The" (21%).

3. Multi-Head Attention

Instead of performing a single attention function with d_model-dimensional queries, keys, and values, the paper found it beneficial to linearly project them h times to d_k, d_k, and d_v dimensions respectively and run attention in parallel:

\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)\,W^O

\text{head}_i = \text{Attention}(QW_i^Q,\; KW_i^K,\; VW_i^V)

h = 8

8 attention heads in parallel

d_{\text{model}} = 512

Total model dimension — the size of each token's representation

d_k = d_v = 64

Dimension per head: 512 / 8 = 64. Each head operates in a 64-dimensional subspace.

W_i^Q, W_i^K \in \mathbb{R}^{d_{\text{model}} \times d_k}

Learned projection matrices — one set per head

W^O \in \mathbb{R}^{h d_v \times d_{\text{model}}}

Output projection — maps concatenated heads back to d_model

Why multiple heads? Each head can specialize in a different type of relationship. One head might learn syntactic dependencies (verb-subject agreement), another semantic similarity (synonyms), another coreference resolution (pronouns), and yet another local positional context. Single-head attention must average over all these signals at once.

4. Positional Encoding

Self-attention is permutation-equivariant: shuffling the tokens produces shuffled outputs with no other change. The model has no built-in notion of order. To inject position information, the paper adds a positional encoding to each token embedding before the first layer:

\text{PE}_{(pos,\, 2i)} = \sin\!\left(\frac{pos}{10000^{2i / d_{\text{model}}}}\right)

\text{PE}_{(pos,\, 2i+1)} = \cos\!\left(\frac{pos}{10000^{2i / d_{\text{model}}}}\right)

With d_model = 4, we have 4 encoding dimensions (i = 0 and i = 1, each with sin and cos). The frequencies are 1/10000^0 = 1 and 1/10000^(2/4) = 1/100 = 0.01.

pos	sin(pos)	cos(pos)	sin(0.01·pos)	cos(0.01·pos)
0	0.000	1.000	0.000	1.000
1	0.841	0.540	0.010	1.000
2	0.909	-0.416	0.020	1.000

The high-frequency dimensions (sin/cos with frequency 1) change rapidly and distinguish nearby positions. The low-frequency dimensions (frequency 0.01) change slowly and encode coarser position information. Together, they form a unique vector for every position — like a binary clock but continuous.

5. Full Architecture

The Transformer follows an encoder-decoder structure. Both encoder and decoder are stacks of N=6 identical layers.

Encoder (6 layers)

1. Multi-head self-attention
2. Add & LayerNorm
3. Feed-forward network (d_ff = 2048)
4. Add & LayerNorm

Decoder (6 layers)

1. Masked multi-head self-attention
2. Add & LayerNorm
3. Multi-head cross-attention (over encoder output)
4. Add & LayerNorm
5. Feed-forward network (d_ff = 2048)
6. Add & LayerNorm

The decoder uses masked self-attention to prevent positions from attending to future positions (causal masking), which is essential for autoregressive generation.

Model	Complexity / layer	Sequential ops	Max path length
Self-Attention	$O(n^2 \cdot d)$	$O(1)$	$O(1)$
Recurrent	$O(n \cdot d^2)$	$O(n)$	$O(n)$
Convolutional	$O(k \cdot n \cdot d^2)$	$O(1)$	$O(\log_k n)$

6. Experiments & Results

The Transformer was evaluated on WMT 2014 English-German and English-French translation tasks. Results were striking — not just a new SOTA, but achieved at a fraction of the training cost:

WMT 2014 EN-DE28.4 BLEU

New state-of-the-art, +2.0 BLEU over the previous best ensemble model. First single model to surpass the best ensemble.

WMT 2014 EN-FR41.0 BLEU

New state-of-the-art, surpassing all previous models with a single model at lower training cost.

Training Cost3.5 days

8 NVIDIA P100 GPUs, 3.5 days (Big model: 213M params). Previous best RNN models required weeks. ~3× faster training than comparable RNN models.

7. Limitations

Quadratic memory in sequence length: The attention matrix is n × n — for a sequence of 1000 tokens, that's 1,000,000 attention weights per head per layer. Long documents (books, code repositories) quickly exhaust GPU memory.
No inherent notion of order: Self-attention is permutation-equivariant. Position must be injected manually via positional encoding. If the encoding is removed or disrupted, the model becomes a bag-of-words.
Fixed context window: The model can only attend to n tokens at once. There is no mechanism for processing sequences longer than the context window.
Absolute positional encoding: The sinusoidal encoding uses absolute positions. While it can extrapolate to longer sequences, the model struggles when inference-time sequences are longer than those seen during training. Later work (RoPE, ALiBi) addressed this with relative encodings.

8. Connections to Other Work

BERT (coming soon)

Uses the Transformer encoder stack for bidirectional language modeling. Pre-trains by masking random tokens (MLM) and predicting the next sentence (NSP). Became the foundation for NLP fine-tuning.

GPT (coming soon)

Uses the Transformer decoder stack for autoregressive language modeling. Generates text left-to-right using causal (masked) self-attention. The architecture behind GPT-2, GPT-3, and ChatGPT.

LLaDA

Uses the Transformer as backbone for masked diffusion language modeling. Replaces autoregressive decoding with iterative denoising over masked tokens — but the attention mechanism underneath is identical to the original Transformer.

FlashAttention (coming soon)

Optimizes the O(n²) attention computation using IO-aware tiling. Computes the exact same attention as the original Transformer, but avoids materializing the full n × n attention matrix in GPU HBM — enabling much longer context windows.

9. Additional Resources

Attention Is All You Need (arXiv 1706.03762)Original paper — Vaswani et al., NeurIPS 2017 tensor2tensor (GitHub)Original implementation by the authors The Annotated TransformerLine-by-line PyTorch implementation with commentary (Rush et al.)