TL;DR
The Transformer replaces recurrence entirely with self-attention. Each position can directly attend to every other position in a single step β no sequential computation required. At the time, it achieved state-of-the-art on WMT translation tasks while training 3Γ faster than the best RNN models.
1. Background: Why Not RNNs?
Before the Transformer, sequence-to-sequence models relied on RNNs (LSTMs, GRUs). These models process tokens one at a time, left-to-right, making them inherently sequential. Three core problems motivated a fundamentally different approach:
- Sequential computation prevents parallelization: To compute the hidden state at position t, you need the hidden state at tβ1. This means no GPU parallelism across the sequence β training is slow.
- Gradient vanishing over long sequences: Gradients must flow backwards through many time steps. Even with LSTMs, signals from early tokens get diluted over long sequences.
- O(n) path length between distant tokens: To connect token 1 and token 100, information must pass through 99 intermediate states. Long-range dependencies are hard to learn.
The table below compares the maximum path length between two tokens across different architectures β shorter paths mean easier learning of long-range dependencies:
| Model | Complexity / layer | Sequential ops | Max path length |
|---|---|---|---|
| Self-Attention | |||
| Recurrent (RNN) | |||
| Convolutional (CNN) |
n = sequence length, d = model dimension, k = kernel size. Self-attention achieves O(1) path length at the cost of O(nΒ²) memory β the core trade-off.
2. Scaled Dot-Product Attention
The core operation of the Transformer is Scaled Dot-Product Attention. Given queries Q, keys K, and values V, the output is a weighted sum of the values, where the weights come from the compatibility of queries and keys:
We have 4 tokens: "The", "cat", "sat", "on". Let's trace how "cat" attends to all tokens with d_k = 4.
3. Multi-Head Attention
Instead of performing a single attention function with d_model-dimensional queries, keys, and values, the paper found it beneficial to linearly project them h times to d_k, d_k, and d_v dimensions respectively and run attention in parallel:
Why multiple heads? Each head can specialize in a different type of relationship. One head might learn syntactic dependencies (verb-subject agreement), another semantic similarity (synonyms), another coreference resolution (pronouns), and yet another local positional context. Single-head attention must average over all these signals at once.
4. Positional Encoding
Self-attention is permutation-equivariant: shuffling the tokens produces shuffled outputs with no other change. The model has no built-in notion of order. To inject position information, the paper adds a positional encoding to each token embedding before the first layer:
With d_model = 4, we have 4 encoding dimensions (i = 0 and i = 1, each with sin and cos). The frequencies are 1/10000^0 = 1 and 1/10000^(2/4) = 1/100 = 0.01.
| pos | sin(pos) | cos(pos) | sin(0.01Β·pos) | cos(0.01Β·pos) |
|---|---|---|---|---|
| 0 | 0.000 | 1.000 | 0.000 | 1.000 |
| 1 | 0.841 | 0.540 | 0.010 | 1.000 |
| 2 | 0.909 | -0.416 | 0.020 | 1.000 |
The high-frequency dimensions (sin/cos with frequency 1) change rapidly and distinguish nearby positions. The low-frequency dimensions (frequency 0.01) change slowly and encode coarser position information. Together, they form a unique vector for every position β like a binary clock but continuous.
5. Full Architecture
The Transformer follows an encoder-decoder structure. Both encoder and decoder are stacks of N=6 identical layers.
Encoder (6 layers)
- 1. Multi-head self-attention
- 2. Add & LayerNorm
- 3. Feed-forward network (d_ff = 2048)
- 4. Add & LayerNorm
Decoder (6 layers)
- 1. Masked multi-head self-attention
- 2. Add & LayerNorm
- 3. Multi-head cross-attention (over encoder output)
- 4. Add & LayerNorm
- 5. Feed-forward network (d_ff = 2048)
- 6. Add & LayerNorm
The decoder uses masked self-attention to prevent positions from attending to future positions (causal masking), which is essential for autoregressive generation.
| Model | Complexity / layer | Sequential ops | Max path length |
|---|---|---|---|
| Self-Attention | |||
| Recurrent | |||
| Convolutional |
6. Experiments & Results
The Transformer was evaluated on WMT 2014 English-German and English-French translation tasks. Results were striking β not just a new SOTA, but achieved at a fraction of the training cost:
New state-of-the-art, +2.0 BLEU over the previous best ensemble model. First single model to surpass the best ensemble.
New state-of-the-art, surpassing all previous models with a single model at lower training cost.
8 NVIDIA P100 GPUs, 3.5 days (Big model: 213M params). Previous best RNN models required weeks. ~3Γ faster training than comparable RNN models.
7. Limitations
- Quadratic memory in sequence length: The attention matrix is n Γ n β for a sequence of 1000 tokens, that's 1,000,000 attention weights per head per layer. Long documents (books, code repositories) quickly exhaust GPU memory.
- No inherent notion of order: Self-attention is permutation-equivariant. Position must be injected manually via positional encoding. If the encoding is removed or disrupted, the model becomes a bag-of-words.
- Fixed context window: The model can only attend to n tokens at once. There is no mechanism for processing sequences longer than the context window.
- Absolute positional encoding: The sinusoidal encoding uses absolute positions. While it can extrapolate to longer sequences, the model struggles when inference-time sequences are longer than those seen during training. Later work (RoPE, ALiBi) addressed this with relative encodings.
8. Connections to Other Work
Uses the Transformer encoder stack for bidirectional language modeling. Pre-trains by masking random tokens (MLM) and predicting the next sentence (NSP). Became the foundation for NLP fine-tuning.
Uses the Transformer decoder stack for autoregressive language modeling. Generates text left-to-right using causal (masked) self-attention. The architecture behind GPT-2, GPT-3, and ChatGPT.
Uses the Transformer as backbone for masked diffusion language modeling. Replaces autoregressive decoding with iterative denoising over masked tokens β but the attention mechanism underneath is identical to the original Transformer.
Optimizes the O(nΒ²) attention computation using IO-aware tiling. Computes the exact same attention as the original Transformer, but avoids materializing the full n Γ n attention matrix in GPU HBM β enabling much longer context windows.