Neural Machine Translation by Jointly Learning to Align and Translate

Bahdanau, Cho, Bengio Β· ICLR 2015 Β· arXiv 1409.0473

TL;DR

Vanilla seq2seq forces the encoder to compress an entire source sentence into a single fixed-size vector β€” a catastrophic bottleneck for long sentences. Bahdanau et al. fix this by letting the decoder dynamically attend to all encoder hidden states at every output step. The decoder learns a soft alignment over source positions, forming a weighted context vector rather than relying on one frozen summary. This is the origin of attention in deep learning β€” the direct ancestor of the Transformer's cross-attention.

1. The Fixed-Length Bottleneck

The standard encoder-decoder architecture (Cho et al. 2014, Sutskever et al. 2014) works as follows: an RNN encoder reads the source sequence token by token, updating a hidden state at each step. When the last token is read, the final hidden state is passed to the decoder as the sole context. The decoder then generates the target sequence from this one vector.

The problem

Everything the decoder will ever use from the source sentence must be encoded in a single vector of fixed dimension (e.g., 1000 floats). For a 30-word sentence that is already tight; for a 50-word sentence performance degrades noticeably. The encoder must decide upfront which details to preserve β€” without knowing what the decoder will ask for.

2. The Attention Solution

Instead of compressing everything into one vector, Bahdanau et al. keep all encoder hidden states and let the decoder attend to them. Think of it as giving the decoder a dynamic spotlight: at each output step, it chooses which parts of the source to focus on.

The key insight: instead of a single context vector c shared across all decoder steps, compute a distinct context vector c_t at every decoder step t. This c_t is a weighted combination of all encoder hidden states, where the weights are learned automatically and reflect which source positions are most relevant for generating the current target word.

Bidirectional RNN encoder

The paper introduces a bidirectional RNN encoder. Each source position j gets a hidden state h_j that is the concatenation of a forward and backward pass:

hj=[hβ†’jβŠ€β€…β€Š;β€…β€Šh←j⊀]⊀h_j = \left[ \overrightarrow{h}_j^\top \; ; \; \overleftarrow{h}_j^\top \right]^\top

This means h_j contains context from both directions β€” what comes before and after position j. The annotation h_j is a richer representation than a unidirectional hidden state.

3. Computing Alignment Scores

The alignment model assigns a scalar score e_{tj} to each (decoder step t, encoder position j) pair. This score measures how relevant encoder state h_j is when the decoder is at step t with hidden state s_{t-1}.

Additive (concat) alignment model

etj=vaΒ opΒ anh\igl(Wastβˆ’1+Uahj\igr)e_{tj} = v_a^\ op \ anh\igl(W_a s_{t-1} + U_a h_j\igr)

Where W_a and U_a are learned weight matrices and v_a is a learned weight vector. This is a small single-hidden-layer MLP that takes both the previous decoder state s_{t-1} and an encoder state h_j as inputs.

This is sometimes called "additive attention" or "concat attention" β€” as opposed to the dot-product attention used in the Transformer. The MLP allows the model to learn complex nonlinear interactions between the decoder state and encoder state before collapsing to a scalar score.

4. Attention Weights and Context Vector

The raw alignment scores e_{tj} are turned into a probability distribution over source positions via softmax. These attention weights Ξ±_{tj} tell us how much the decoder should focus on encoder position j when generating output at step t.

Step 1: Softmax over alignment scores

Ξ±tj=\ racexp⁑(etj)βˆ‘k=1Txexp⁑(etk)\alpha_{tj} = \ rac{\exp(e_{tj})}{\displaystyle\sum_{k=1}^{T_x} \exp(e_{tk})}

The sum is over all T_x source positions. The result Ξ±_{tj} ∈ (0,1) and Ξ£_j Ξ±_{tj} = 1 β€” it is a proper probability distribution.

Step 2: Weighted sum β†’ context vector

ct=βˆ‘j=1TxΞ±tj hjc_t = \sum_{j=1}^{T_x} \alpha_{tj}\, h_j

c_t is the context vector for decoder step t. It is a weighted average of all encoder hidden states, where the weights are the attention probabilities. If Ξ±_{t3} = 0.8, the context vector is dominated by encoder state h_3.

Step 3: Context-aware decoder update

st=f(stβˆ’1, ytβˆ’1, ct)s_t = f(s_{t-1},\, y_{t-1},\, c_t)

The decoder hidden state s_t depends on: the previous hidden state s_{t-1}, the previously generated word y_{t-1}, and the step-specific context c_t. The output word y_t is then predicted from s_t and c_t jointly.

5. Worked Example: Translating "how are you"

Let's trace through computing attention for the first decoder step when translating "how are you" β†’ "comment allez-vous". This is a toy example to build intuition β€” real models have higher dimensions and more nuanced weights.

Setup

  • Source: ["how", "are", "you"] β†’ positions j = 1, 2, 3
  • Encoder produces hidden states h_1, h_2, h_3 (bidirectional)
  • Decoder initial state: s_0 (from encoder final state)
  • We want to generate y_1 = "comment"

Step A: Compute alignment scores

Feed (s_0, h_j) into the alignment MLP for each j:

e1,1=vaΒ opΒ anh(Was0+Uah1)β‰ˆ2.1e_{1,1} = v_a^\ op \ anh(W_a s_0 + U_a h_1) \approx 2.1
e1,2=vaΒ opΒ anh(Was0+Uah2)β‰ˆ0.3e_{1,2} = v_a^\ op \ anh(W_a s_0 + U_a h_2) \approx 0.3
e1,3=vaΒ opΒ anh(Was0+Uah3)β‰ˆ0.8e_{1,3} = v_a^\ op \ anh(W_a s_0 + U_a h_3) \approx 0.8

(scores are illustrative β€” higher means more relevant)

Step B: Apply softmax

Ξ±1,1=\ race2.1e2.1+e0.3+e0.8β‰ˆ\ rac8.178.17+1.35+2.23β‰ˆ0.70\alpha_{1,1} = \ rac{e^{2.1}}{e^{2.1} + e^{0.3} + e^{0.8}} \approx \ rac{8.17}{8.17 + 1.35 + 2.23} \approx 0.70
Ξ±1,2β‰ˆ\ rac1.3511.75β‰ˆ0.11\alpha_{1,2} \approx \ rac{1.35}{11.75} \approx 0.11
Ξ±1,3β‰ˆ\ rac2.2311.75β‰ˆ0.19\alpha_{1,3} \approx \ rac{2.23}{11.75} \approx 0.19

The model is placing 70% of its attention on "how" (j=1) β€” sensible, since "comment" is the French translation of "how".

Step C: Compute context vector

c1=0.70β‹…h1+0.11β‹…h2+0.19β‹…h3c_1 = 0.70 \cdot h_1 + 0.11 \cdot h_2 + 0.19 \cdot h_3

c_1 is dominated by h_1 (the encoding of "how"). The decoder uses this context-rich vector to predict "comment".

Step D: Update decoder state

s1=f(s0,β€…β€ŠβŸ¨BOS⟩,β€…β€Šc1)s_1 = f(s_0,\; \langle\text{BOS}\rangle,\; c_1)

The new decoder state s_1 incorporates the focused context. Then for y_2 = "allez", we recompute new alignment scores using s_1 and a fresh set of weights, likely attending more to h_2 ("are") and h_3 ("you").

6. The Alignment Matrix Visualization

One of the most striking results in the paper is Figure 3: a heatmap of attention weights Ξ±_{tj} for an English-to-French sentence pair. Rows are target words, columns are source words. High Ξ± means the decoder attended strongly to that source word when generating that target word.

What the alignment matrix reveals

  • Mostly monotonic alignment for English-French (similar word order), shown as a diagonal band
  • Non-monotonic alignment visible at adjective-noun inversions (French often reverses English adj-noun order)
  • The model learned this alignment structure entirely from parallel text β€” no explicit alignment supervision
  • "[EOSf]" (French end token) attends broadly β€” consistent with the decoder knowing the sentence is done

The alignment matrix is not just an interpretability tool β€” it validates that the mechanism is doing what we intended. The model really is learning to soft-align source and target words, recovering structure that linguists have studied for decades.

7. Results on WMT English-French

The experiments compare three systems on WMT 2014 English-French translation: a vanilla RNNenc-dec without attention, an RNNsearch model with attention (this paper), and Moses β€” a mature phrase-based statistical MT system representing state of the art at the time.

ModelBLEUNotes
RNNenc-dec (no attention)26.71Degrades on sentences > 20 words
RNNsearch-50 (with attention)28.45Robust on long sentences
Moses (phrase-based SMT)33.30Highly engineered, uses large n-gram LMs

The key finding

RNNsearch-50 surpasses the vanilla RNNenc-dec by 1.7 BLEU on known-vocabulary test sets, and by a larger margin on sentences longer than 20 words. Critically, the performance gap between with- and without-attention grows with sentence length β€” directly confirming the bottleneck hypothesis. The model without attention degrades sharply; the model with attention maintains quality even for 50+ word sentences.

8. Connection to the Transformer and Self-Attention

Bahdanau attention is cross-attention: the query comes from the decoder, the keys and values come from the encoder. The Transformer (Vaswani et al. 2017) generalizes this in two ways:

  • Self-attention: queries, keys, and values all come from the same sequence β€” allowing a sequence to attend to itself
  • Dot-product scoring instead of additive MLP scoring β€” more computationally efficient
  • Multi-head attention: run attention multiple times in parallel with different projections
  • No recurrence at all: the Transformer replaces the RNN entirely with attention

Conceptual lineage

RNN seq2seq (2014)β†’Bahdanau Attention (2015)β†’Luong Attention (2015)β†’Transformer (2017)β†’BERT, GPT, LLMs...

The Transformer's encoder-decoder cross-attention is exactly Bahdanau attention with dot-product scoring and linear projections (Q, K, V matrices). The decoder's cross-attention layer computes attention weights between the decoder's current state and all encoder outputs β€” the same computation, just generalized and scaled.

9. Why This Paper Matters

As of early 2025, this paper has over 20,000 citations. But citation counts undersell its impact. Essentially every modern large language model β€” GPT-4, Claude, Gemini, LLaMA β€” descends directly from the ideas in this 2014 paper. The attention mechanism is the core computational primitive of modern AI.

Conceptual leap

Broke the fixed-vector bottleneck. Showed that dynamic, input-conditioned context is learnable end-to-end.

Interpretability

Introduced alignment visualization as a window into model behavior β€” the first widely used neural interpretability tool.

Engineering template

Established score β†’ softmax β†’ weighted-sum as the standard attention recipe, still used in every Transformer today.

Scalability foundation

Attention is differentiable and parallelizable. These properties made the Transformer possible and enabled the scaling laws era.

Further Reading