TL;DR
Vanilla seq2seq forces the encoder to compress an entire source sentence into a single fixed-size vector β a catastrophic bottleneck for long sentences. Bahdanau et al. fix this by letting the decoder dynamically attend to all encoder hidden states at every output step. The decoder learns a soft alignment over source positions, forming a weighted context vector rather than relying on one frozen summary. This is the origin of attention in deep learning β the direct ancestor of the Transformer's cross-attention.
1. The Fixed-Length Bottleneck
The standard encoder-decoder architecture (Cho et al. 2014, Sutskever et al. 2014) works as follows: an RNN encoder reads the source sequence token by token, updating a hidden state at each step. When the last token is read, the final hidden state is passed to the decoder as the sole context. The decoder then generates the target sequence from this one vector.
The problem
Everything the decoder will ever use from the source sentence must be encoded in a single vector of fixed dimension (e.g., 1000 floats). For a 30-word sentence that is already tight; for a 50-word sentence performance degrades noticeably. The encoder must decide upfront which details to preserve β without knowing what the decoder will ask for.
2. The Attention Solution
Instead of compressing everything into one vector, Bahdanau et al. keep all encoder hidden states and let the decoder attend to them. Think of it as giving the decoder a dynamic spotlight: at each output step, it chooses which parts of the source to focus on.
The key insight: instead of a single context vector c shared across all decoder steps, compute a distinct context vector c_t at every decoder step t. This c_t is a weighted combination of all encoder hidden states, where the weights are learned automatically and reflect which source positions are most relevant for generating the current target word.
Bidirectional RNN encoder
The paper introduces a bidirectional RNN encoder. Each source position j gets a hidden state h_j that is the concatenation of a forward and backward pass:
This means h_j contains context from both directions β what comes before and after position j. The annotation h_j is a richer representation than a unidirectional hidden state.
3. Computing Alignment Scores
The alignment model assigns a scalar score e_{tj} to each (decoder step t, encoder position j) pair. This score measures how relevant encoder state h_j is when the decoder is at step t with hidden state s_{t-1}.
Additive (concat) alignment model
Where W_a and U_a are learned weight matrices and v_a is a learned weight vector. This is a small single-hidden-layer MLP that takes both the previous decoder state s_{t-1} and an encoder state h_j as inputs.
This is sometimes called "additive attention" or "concat attention" β as opposed to the dot-product attention used in the Transformer. The MLP allows the model to learn complex nonlinear interactions between the decoder state and encoder state before collapsing to a scalar score.
4. Attention Weights and Context Vector
The raw alignment scores e_{tj} are turned into a probability distribution over source positions via softmax. These attention weights Ξ±_{tj} tell us how much the decoder should focus on encoder position j when generating output at step t.
Step 1: Softmax over alignment scores
The sum is over all T_x source positions. The result Ξ±_{tj} β (0,1) and Ξ£_j Ξ±_{tj} = 1 β it is a proper probability distribution.
Step 2: Weighted sum β context vector
c_t is the context vector for decoder step t. It is a weighted average of all encoder hidden states, where the weights are the attention probabilities. If Ξ±_{t3} = 0.8, the context vector is dominated by encoder state h_3.
Step 3: Context-aware decoder update
The decoder hidden state s_t depends on: the previous hidden state s_{t-1}, the previously generated word y_{t-1}, and the step-specific context c_t. The output word y_t is then predicted from s_t and c_t jointly.
5. Worked Example: Translating "how are you"
Let's trace through computing attention for the first decoder step when translating "how are you" β "comment allez-vous". This is a toy example to build intuition β real models have higher dimensions and more nuanced weights.
Setup
- Source: ["how", "are", "you"] β positions j = 1, 2, 3
- Encoder produces hidden states h_1, h_2, h_3 (bidirectional)
- Decoder initial state: s_0 (from encoder final state)
- We want to generate y_1 = "comment"
Step A: Compute alignment scores
Feed (s_0, h_j) into the alignment MLP for each j:
(scores are illustrative β higher means more relevant)
Step B: Apply softmax
The model is placing 70% of its attention on "how" (j=1) β sensible, since "comment" is the French translation of "how".
Step C: Compute context vector
c_1 is dominated by h_1 (the encoding of "how"). The decoder uses this context-rich vector to predict "comment".
Step D: Update decoder state
The new decoder state s_1 incorporates the focused context. Then for y_2 = "allez", we recompute new alignment scores using s_1 and a fresh set of weights, likely attending more to h_2 ("are") and h_3 ("you").
6. The Alignment Matrix Visualization
One of the most striking results in the paper is Figure 3: a heatmap of attention weights Ξ±_{tj} for an English-to-French sentence pair. Rows are target words, columns are source words. High Ξ± means the decoder attended strongly to that source word when generating that target word.
What the alignment matrix reveals
- Mostly monotonic alignment for English-French (similar word order), shown as a diagonal band
- Non-monotonic alignment visible at adjective-noun inversions (French often reverses English adj-noun order)
- The model learned this alignment structure entirely from parallel text β no explicit alignment supervision
- "[EOSf]" (French end token) attends broadly β consistent with the decoder knowing the sentence is done
The alignment matrix is not just an interpretability tool β it validates that the mechanism is doing what we intended. The model really is learning to soft-align source and target words, recovering structure that linguists have studied for decades.
7. Results on WMT English-French
The experiments compare three systems on WMT 2014 English-French translation: a vanilla RNNenc-dec without attention, an RNNsearch model with attention (this paper), and Moses β a mature phrase-based statistical MT system representing state of the art at the time.
| Model | BLEU | Notes |
|---|---|---|
| RNNenc-dec (no attention) | 26.71 | Degrades on sentences > 20 words |
| RNNsearch-50 (with attention) | 28.45 | Robust on long sentences |
| Moses (phrase-based SMT) | 33.30 | Highly engineered, uses large n-gram LMs |
The key finding
RNNsearch-50 surpasses the vanilla RNNenc-dec by 1.7 BLEU on known-vocabulary test sets, and by a larger margin on sentences longer than 20 words. Critically, the performance gap between with- and without-attention grows with sentence length β directly confirming the bottleneck hypothesis. The model without attention degrades sharply; the model with attention maintains quality even for 50+ word sentences.
8. Connection to the Transformer and Self-Attention
Bahdanau attention is cross-attention: the query comes from the decoder, the keys and values come from the encoder. The Transformer (Vaswani et al. 2017) generalizes this in two ways:
- Self-attention: queries, keys, and values all come from the same sequence β allowing a sequence to attend to itself
- Dot-product scoring instead of additive MLP scoring β more computationally efficient
- Multi-head attention: run attention multiple times in parallel with different projections
- No recurrence at all: the Transformer replaces the RNN entirely with attention
Conceptual lineage
The Transformer's encoder-decoder cross-attention is exactly Bahdanau attention with dot-product scoring and linear projections (Q, K, V matrices). The decoder's cross-attention layer computes attention weights between the decoder's current state and all encoder outputs β the same computation, just generalized and scaled.
9. Why This Paper Matters
As of early 2025, this paper has over 20,000 citations. But citation counts undersell its impact. Essentially every modern large language model β GPT-4, Claude, Gemini, LLaMA β descends directly from the ideas in this 2014 paper. The attention mechanism is the core computational primitive of modern AI.
Conceptual leap
Broke the fixed-vector bottleneck. Showed that dynamic, input-conditioned context is learnable end-to-end.
Interpretability
Introduced alignment visualization as a window into model behavior β the first widely used neural interpretability tool.
Engineering template
Established score β softmax β weighted-sum as the standard attention recipe, still used in every Transformer today.
Scalability foundation
Attention is differentiable and parallelizable. These properties made the Transformer possible and enabled the scaling laws era.
Further Reading
- Original paper β Bahdanau et al. (2014)
- Luong et al. (2015) β Effective Approaches to Attention (dot-product & concat variants)
- Vaswani et al. (2017) β Attention Is All You Need (the Transformer)
- Distill.pub β Attention and Augmented Recurrent Neural Networks (excellent visual explanation)
- Jay Alammar β Visualizing seq2seq with attention (interactive diagrams)