Attention Is All You Need

Vaswani et al. Β· NeurIPS 2017 Β· arXiv 1706.03762

TL;DR

The Transformer replaces recurrence entirely with self-attention. Each position can directly attend to every other position in a single step β€” no sequential computation required. At the time, it achieved state-of-the-art on WMT translation tasks while training 3Γ— faster than the best RNN models.

β—†Transformer Pipeline
Problem: RNNs process tokens sequentially
O(n) sequential steps, gradient vanishing, slow
Motivation
Insight: Attention can connect all positions in O(1) steps
No recurrence needed β€” pure parallelism
Key insight
Self-Attention: Each token queries all others
Three learned projections per token
Q = XW^Q
K = XW^K
V = XW^V
Core mechanism
Scale + Softmax: Attention(Q,K,V) = softmax(QKα΅€/√d_k)V
Scaling prevents vanishing gradients in softmax
Normalization
Multi-Head: Run h attention heads in parallel, concatenate
h=8 heads, each with d_k=64; captures diverse relationships
Parallelism
Encoder-Decoder: 6 encoder layers + 6 decoder layers
Each layer: multi-head attention + feed-forward + LayerNorm
Full architecture
Result: BLEU 28.4 on WMT EN-DE (new SOTA)
+2.0 BLEU over previous best, 3.5 days on 8 P100 GPUs
O(1) path length between any two tokens
Full parallelism during training
New SOTA on WMT EN-DE and EN-FR

1. Background: Why Not RNNs?

Before the Transformer, sequence-to-sequence models relied on RNNs (LSTMs, GRUs). These models process tokens one at a time, left-to-right, making them inherently sequential. Three core problems motivated a fundamentally different approach:

  • Sequential computation prevents parallelization: To compute the hidden state at position t, you need the hidden state at tβˆ’1. This means no GPU parallelism across the sequence β€” training is slow.
  • Gradient vanishing over long sequences: Gradients must flow backwards through many time steps. Even with LSTMs, signals from early tokens get diluted over long sequences.
  • O(n) path length between distant tokens: To connect token 1 and token 100, information must pass through 99 intermediate states. Long-range dependencies are hard to learn.

The table below compares the maximum path length between two tokens across different architectures β€” shorter paths mean easier learning of long-range dependencies:

ModelComplexity / layerSequential opsMax path length
Self-AttentionO(n2β‹…d)O(n^2 \cdot d)O(1)O(1)O(1)O(1)
Recurrent (RNN)O(nβ‹…d2)O(n \cdot d^2)O(n)O(n)O(n)O(n)
Convolutional (CNN)O(kβ‹…nβ‹…d2)O(k \cdot n \cdot d^2)O(1)O(1)O(log⁑kn)O(\log_k n)

n = sequence length, d = model dimension, k = kernel size. Self-attention achieves O(1) path length at the cost of O(nΒ²) memory β€” the core trade-off.

2. Scaled Dot-Product Attention

The core operation of the Transformer is Scaled Dot-Product Attention. Given queries Q, keys K, and values V, the output is a weighted sum of the values, where the weights come from the compatibility of queries and keys:

Scaled Dot-Product Attention
Attention(Q,K,V)=softmax ⁣(QK⊀dk)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V
QQQueries β€” "What am I looking for?" Shape [n, d_k]. Each row is one token's query vector.KKKeys β€” "What do I have to offer?" Shape [n, d_k]. Each row is one token's key vector.VVValues β€” "What I will actually give you." Shape [n, d_v]. The actual content that gets aggregated.dkd_kDimension of key (and query) vectors. In the base Transformer, d_k = 64 per head.dk\sqrt{d_k}Scaling factor β€” dot products grow in magnitude with d_k (variance scales with d_k). Dividing by √d_k keeps softmax inputs in a reasonable range and prevents gradient vanishing.

We have 4 tokens: "The", "cat", "sat", "on". Let's trace how "cat" attends to all tokens with d_k = 4.

Step 1: Raw dot-product scores (Q_cat Β· K_token)
Q_cat Β· K_The = 0.8
Q_cat Β· K_cat = 2.1
Q_cat Β· K_sat = 0.3
Q_cat Β· K_on = 0.1
Step 2: Divide by √d_k = √4 = 2
[0.8, 2.1, 0.3, 0.1] / 2 = [0.40, 1.05, 0.15, 0.05]
Step 3: Softmax β†’ attention weights (sum = 1.0)
softmax([0.40, 1.05, 0.15, 0.05]) = [0.21, 0.51, 0.17, 0.11]
"cat" attends most strongly to itself (0.51), then "The" (0.21)
Step 4: Weighted sum of value vectors
output = 0.21 Γ— V_The + 0.51 Γ— V_cat + 0.17 Γ— V_sat + 0.11 Γ— V_on
The output for "cat" is dominated by its own value (51%) and the preceding context "The" (21%).

3. Multi-Head Attention

Instead of performing a single attention function with d_model-dimensional queries, keys, and values, the paper found it beneficial to linearly project them h times to d_k, d_k, and d_v dimensions respectively and run attention in parallel:

Multi-Head Attention
MultiHead(Q,K,V)=Concat(head1,…,headh) WO\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)\,W^O
Each head
headi=Attention(QWiQ,β€…β€ŠKWiK,β€…β€ŠVWiV)\text{head}_i = \text{Attention}(QW_i^Q,\; KW_i^K,\; VW_i^V)
h=8h = 88 attention heads in paralleldmodel=512d_{\text{model}} = 512Total model dimension β€” the size of each token's representationdk=dv=64d_k = d_v = 64Dimension per head: 512 / 8 = 64. Each head operates in a 64-dimensional subspace.WiQ,WiK∈RdmodelΓ—dkW_i^Q, W_i^K \in \mathbb{R}^{d_{\text{model}} \times d_k}Learned projection matrices β€” one set per headWO∈RhdvΓ—dmodelW^O \in \mathbb{R}^{h d_v \times d_{\text{model}}}Output projection β€” maps concatenated heads back to d_model

Why multiple heads? Each head can specialize in a different type of relationship. One head might learn syntactic dependencies (verb-subject agreement), another semantic similarity (synonyms), another coreference resolution (pronouns), and yet another local positional context. Single-head attention must average over all these signals at once.

4. Positional Encoding

Self-attention is permutation-equivariant: shuffling the tokens produces shuffled outputs with no other change. The model has no built-in notion of order. To inject position information, the paper adds a positional encoding to each token embedding before the first layer:

Positional encoding (even dimensions)
PE(pos, 2i)=sin⁑ ⁣(pos100002i/dmodel)\text{PE}_{(pos,\, 2i)} = \sin\!\left(\frac{pos}{10000^{2i / d_{\text{model}}}}\right)
Positional encoding (odd dimensions)
PE(pos, 2i+1)=cos⁑ ⁣(pos100002i/dmodel)\text{PE}_{(pos,\, 2i+1)} = \cos\!\left(\frac{pos}{10000^{2i / d_{\text{model}}}}\right)

With d_model = 4, we have 4 encoding dimensions (i = 0 and i = 1, each with sin and cos). The frequencies are 1/10000^0 = 1 and 1/10000^(2/4) = 1/100 = 0.01.

possin(pos)cos(pos)sin(0.01Β·pos)cos(0.01Β·pos)
00.0001.0000.0001.000
10.8410.5400.0101.000
20.909-0.4160.0201.000

The high-frequency dimensions (sin/cos with frequency 1) change rapidly and distinguish nearby positions. The low-frequency dimensions (frequency 0.01) change slowly and encode coarser position information. Together, they form a unique vector for every position β€” like a binary clock but continuous.

5. Full Architecture

The Transformer follows an encoder-decoder structure. Both encoder and decoder are stacks of N=6 identical layers.

Encoder (6 layers)

  • 1. Multi-head self-attention
  • 2. Add & LayerNorm
  • 3. Feed-forward network (d_ff = 2048)
  • 4. Add & LayerNorm

Decoder (6 layers)

  • 1. Masked multi-head self-attention
  • 2. Add & LayerNorm
  • 3. Multi-head cross-attention (over encoder output)
  • 4. Add & LayerNorm
  • 5. Feed-forward network (d_ff = 2048)
  • 6. Add & LayerNorm

The decoder uses masked self-attention to prevent positions from attending to future positions (causal masking), which is essential for autoregressive generation.

ModelComplexity / layerSequential opsMax path length
Self-AttentionO(n2β‹…d)O(n^2 \cdot d)O(1)O(1)O(1)O(1)
RecurrentO(nβ‹…d2)O(n \cdot d^2)O(n)O(n)O(n)O(n)
ConvolutionalO(kβ‹…nβ‹…d2)O(k \cdot n \cdot d^2)O(1)O(1)O(log⁑kn)O(\log_k n)

6. Experiments & Results

The Transformer was evaluated on WMT 2014 English-German and English-French translation tasks. Results were striking β€” not just a new SOTA, but achieved at a fraction of the training cost:

WMT 2014 EN-DE28.4 BLEU

New state-of-the-art, +2.0 BLEU over the previous best ensemble model. First single model to surpass the best ensemble.

WMT 2014 EN-FR41.0 BLEU

New state-of-the-art, surpassing all previous models with a single model at lower training cost.

Training Cost3.5 days

8 NVIDIA P100 GPUs, 3.5 days (Big model: 213M params). Previous best RNN models required weeks. ~3Γ— faster training than comparable RNN models.

7. Limitations

  • Quadratic memory in sequence length: The attention matrix is n Γ— n β€” for a sequence of 1000 tokens, that's 1,000,000 attention weights per head per layer. Long documents (books, code repositories) quickly exhaust GPU memory.
  • No inherent notion of order: Self-attention is permutation-equivariant. Position must be injected manually via positional encoding. If the encoding is removed or disrupted, the model becomes a bag-of-words.
  • Fixed context window: The model can only attend to n tokens at once. There is no mechanism for processing sequences longer than the context window.
  • Absolute positional encoding: The sinusoidal encoding uses absolute positions. While it can extrapolate to longer sequences, the model struggles when inference-time sequences are longer than those seen during training. Later work (RoPE, ALiBi) addressed this with relative encodings.

8. Connections to Other Work

BERT (coming soon)

Uses the Transformer encoder stack for bidirectional language modeling. Pre-trains by masking random tokens (MLM) and predicting the next sentence (NSP). Became the foundation for NLP fine-tuning.

GPT (coming soon)

Uses the Transformer decoder stack for autoregressive language modeling. Generates text left-to-right using causal (masked) self-attention. The architecture behind GPT-2, GPT-3, and ChatGPT.

LLaDA

Uses the Transformer as backbone for masked diffusion language modeling. Replaces autoregressive decoding with iterative denoising over masked tokens β€” but the attention mechanism underneath is identical to the original Transformer.

FlashAttention (coming soon)

Optimizes the O(nΒ²) attention computation using IO-aware tiling. Computes the exact same attention as the original Transformer, but avoids materializing the full n Γ— n attention matrix in GPU HBM β€” enabling much longer context windows.

9. Additional Resources