PaperTrace — Interactive ML Paper Deep-Dives

TL;DR

Standard seq2seq models output tokens from a fixed vocabulary — they cannot refer to positions in the input sequence. Pointer Networks solve this by turning the attention distribution itself into the output: at each decoder step, a softmax over input positions directly selects which input element to "point to." This enables solving combinatorial problems (convex hull, TSP, sorting) where the output is a permutation of the input.

Problem: seq2seq output vocab is fixed — can't index input positions

Sorting n numbers, TSP tour, convex hull — output size varies with input

Motivates

Encoder: LSTM reads input sequence, produces hidden states (e₁, ..., eₙ)

One hidden vector per input element

Encode input

Decoder: at step i, compute alignment score for each input position j

uⱼⁱ = vᵀ tanh(W₁eⱼ + W₂dᵢ) — combines encoder and decoder hidden states

Score positions

Output: p(Cᵢ | C₁,...,Cᵢ₋₁, P) = softmax(uⁱ) — a distribution over input positions

The attention IS the output — no context vector summation

Point to input

Training: supervised with optimal solutions from exact solvers

Convex hull exact; TSP exact for n≤10, heuristics for larger

Supervision

Results: perfect on convex hull, near-optimal TSP up to n=10, generalizes to unseen sizes

Attention distribution = output distribution

Variable-length output over input positions

Foundation for CopyNet and Transformer copy mechanisms

1. The Variable-Output Vocabulary Problem

Standard sequence-to-sequence models define a fixed output vocabulary V. At each decoding step, the model produces a distribution over V and samples the next token. This works well for machine translation (fixed target language vocabulary) but breaks down for a whole class of combinatorial problems.

Consider three tasks the paper studies:

Sorting: Given numbers (3.1, 0.7, 2.4), output the sorted permutation — but "sorted order" is not a fixed vocabulary, it depends on the input.
Convex hull: Given n 2D points, output the subset that forms the convex hull in order — the output is a variable-length subset of input points.
Travelling Salesman Problem (TSP): Given n cities, output a tour (permutation) that minimizes total distance — the tour length equals the number of input cities.

In all three cases, the output is a permutation or subset of the input elements themselves. If you train a model on instances of size n = 10 with a fixed 10-class output, it cannot generalize to n = 15 — the output vocabulary would need to change. The key insight of Pointer Networks is to reframe the output as: "which input position do I select next?"

2. The Pointer Mechanism

A Pointer Network is an encoder-decoder LSTM where the decoder's output at each step is a probability distribution over input positions — not over a fixed vocabulary. Here is the full mechanism:

Step 1 — Encode the input:

(e_1, e_2, \ldots, e_n) = \text{Encoder}(P)

The encoder LSTM processes the input sequence P = (p₁, …, pₙ) and produces one hidden state eⱼ per input element. Each eⱼ is a dense vector summarizing the context around position j.

Step 2 — Compute alignment scores:

u_j^i = v^\top \tanh\!\left(W_1 e_j + W_2 d_i\right), \quad j \in \{1, \ldots, n\}

At decoder step i, the decoder hidden state dᵢ is combined with each encoder hidden state eⱼ. The matrices W₁, W₂ and vector v are learned parameters. The result uⱼⁱ is a scalar alignment score: how relevant is input position j when generating the i-th output?

Step 3 — Output distribution over input positions:

p(C_i \mid C_1, \ldots, C_{i-1}, P) = \text{softmax}(u^i)

The softmax is applied across all n input positions. This gives a proper probability distribution over {1, …, n}, and the model selects (or samples) whichever input index has the highest probability. The selected element becomes Cᵢ in the output sequence.

3. Attention vs. Pointer Networks

Pointer Networks look almost identical to sequence-to-sequence with Bahdanau attention. The critical difference is what happens after the softmax. This is the key conceptual distinction:

	Standard Attention	Pointer Network
Alignment scores	uⱼⁱ = vᵀ tanh(W₁eⱼ + W₂dᵢ)	uⱼⁱ = vᵀ tanh(W₁eⱼ + W₂dᵢ)
After softmax	Weighted sum: cᵢ = Σⱼ αⱼⁱ · eⱼ	This IS the output: p(Cᵢ) = softmax(uⁱ)
Context vector cᵢ used for?	Predicts next token from fixed vocab	No context vector — pointer is the output
Output vocabulary	Fixed size \|V\| (e.g. 50,000 words)	Dynamic: n (size of current input)

In formal terms, standard attention computes:

c_i = \sum_j \underbrace{\text{softmax}(u_j^i)}_{\alpha_j^i} \cdot e_j \quad \text{(context vector — used to predict vocab token)}

Pointer Networks skip the weighted sum entirely and use the softmax weights as the final answer:

p(C_i \mid C_1, \ldots, C_{i-1}, P) = \text{softmax}(u^i) \quad \text{(this is the output)}

This subtle shift — using attention weights as output rather than as intermediate aggregation weights — is what makes the output vocabulary dynamic. The model is literally "pointing" to one of the n input positions at each step.

4. Training on Combinatorial Tasks

Pointer Networks are trained with supervised learning. For each input instance, the training target is an optimal or near-optimal output sequence — a ground-truth ordering of input positions.

The loss is standard cross-entropy over the pointer distribution at each step:

\mathcal{L} = -\sum_{i=1}^{m} \log p(C_i^* \mid C_1^*, \ldots, C_{i-1}^*, P)

where C* is the ground-truth sequence of input indices. Getting these ground-truth labels requires running an exact solver:

Sorting: The ground-truth tour is just the sorted order — trivially computed.
Convex hull: Standard computational geometry algorithms (gift wrapping, Graham scan) produce the exact convex hull in O(n log n).
TSP: For small n (≤10), exact TSP solvers (e.g. dynamic programming, Concorde) are feasible. For n > 10, the paper uses the Christofides heuristic to generate near-optimal tours.

5. Worked Example: Convex Hull Ordering

Let's trace through a small convex hull example to make the pointer mechanism concrete. Suppose we have 5 points in 2D:

P = {

p₁ = (0.1, 0.2), ← interior point

p₂ = (0.9, 0.1), ← hull vertex

p₃ = (0.5, 0.5), ← interior point

p₄ = (0.1, 0.9), ← hull vertex

p₅ = (0.8, 0.8), ← hull vertex

}

The convex hull visits points p₂, p₅, p₄ in counterclockwise order (p₁ and p₃ are interior). The encoder processes all 5 points and produces hidden states e₁, …, e₅.

Now the decoder runs step by step, each time computing a softmax over all 5 input positions:

Decoder step i=1 (initial decoder state d₁)

u¹ = [u₁¹, u₂¹, u₃¹, u₄¹, u₅¹]

softmax(u¹) ≈ [0.03, 0.72, 0.05, 0.08, 0.12]

→ Points to p₂ (bottom-right hull vertex, starting point)

Decoder step i=2 (d₂ updated with selected p₂)

u² = [u₁², u₂², u₃², u₄², u₅²]

softmax(u²) ≈ [0.04, 0.06, 0.03, 0.08, 0.79]

→ Points to p₅ (top-right hull vertex)

Decoder step i=3 (d₃ updated with selected p₅)

u³ = [u₁³, u₂³, u₃³, u₄³, u₅³]

softmax(u³) ≈ [0.05, 0.07, 0.04, 0.81, 0.03]

→ Points to p₄ (top-left hull vertex) → output sequence complete

The model never explicitly computed "is this point on the hull?" — it learned from examples to assign high pointer probability to hull vertices in traversal order. Interior points p₁ and p₃ consistently receive low softmax weight because their encoder representations, combined with the decoder state, produce low alignment scores.

6. Results

The paper evaluates Pointer Networks on three tasks. All results use greedy decoding (argmax at each step) unless otherwise noted.

Convex Hull

Pointer Networks achieve near-perfect accuracy on convex hull prediction across all tested sizes (n = 5 to 50). The task is somewhat easier because it has a well-defined geometric structure that the model can learn. The paper reports >99% area coverage vs. ground truth.

Travelling Salesman Problem

For n ≤ 10 cities, Pointer Networks find near-optimal tours — within ~1% of the optimal tour length found by exact TSP solvers. For larger n (trained on n = 5–20, tested on n = 50), solution quality degrades: the model finds ~5–10% longer tours than optimal. The model also shows surprising generalization: trained on n = 5–20, it can handle n = 50 without retraining, though with reduced quality.

Sorting

Pointer Networks achieve 100% sequence accuracy on sorting tasks up to n = 20, and generalize to larger n with high accuracy. This is an easier task (unique correct answer, no combinatorial search), serving mainly to validate the mechanism.

7. Legacy: Copy Mechanisms in Modern LLMs

Pointer Networks introduced a key idea: models should be able to refer to input positions directly, not just generate from a fixed vocabulary. This concept has propagated through the field in several important ways:

CopyNet (Gu et al., 2016)

Extends Pointer Networks to a hybrid model: at each decoding step, the model can either generate a word from the vocabulary OR copy a word from the input sequence. This is critical for tasks like abstractive summarization (copy named entities, rare words) and dialogue (repeat user's phrases). The output probability mixes a generation distribution and a copy distribution.

Pointer-Generator Networks (See et al., 2017)

Applied to abstractive summarization with a learned "copy gate" pᶜₒₚᵧ ∈ [0,1] that interpolates between generating and pointing. This enabled models to handle out-of-vocabulary words by copying them from the source document — a critical capability for summarizing documents with proper nouns and technical terms.

Transformer Copy Mechanisms

Modern LLMs like GPT-2/3 implicitly learn to copy from context via their attention mechanisms — the self-attention heads in later layers often show high attention to specific previous tokens, effectively "pointing" to them. Retrieval-augmented generation (RAG) makes this more explicit: the model attends over retrieved documents and incorporates their content. Extractive QA models point to answer spans in the passage.

Neural Combinatorial Optimization

The Pointer Network architecture is now a standard building block for learned combinatorial solvers. Key follow-ups: Bello et al. (2016) trained with RL instead of supervised labels; Kool et al. (2019) built an attention-based model (without RNNs) for vehicle routing problems, achieving competitive results with classical solvers on 100-city TSP.

The core limitation remains: Pointer Networks (and their descendants) only work when the output comes from the input. You cannot use a pointer to generate a word that does not appear in the source. This is why hybrid generation+copy approaches like CopyNet are so popular in practice — they get the best of both worlds.

← Back to Papers