PaperTrace — Interactive ML Paper Deep-Dives

TL;DR

Neural Turing Machines couple a neural network controller with an external differentiable memory matrix. The controller reads from and writes to memory via soft, attention-like addressing — making every operation end-to-end trainable with gradient descent. NTMs learn to copy, sort, and recall sequences far better than LSTMs, and generalize to sequence lengths never seen during training.

Problem: RNNs lack explicit external memory for algorithmic tasks

Copy, sort, associative recall — RNNs fail to generalize

Motivation

Solution: Controller (LSTM/FF) + Memory Matrix M of size N×M

Memory is external, persistent, and differentiable

Architecture

Addressing: Content-based + Location-based weighting

Full pipeline: content → interpolation → shift → sharpen

Content: cosine similarity key lookup

Location: shift convolution + sharpening

Core mechanism

Read: r_t = Σ_i w_t(i) · M_t(i) — differentiable weighted sum

Soft read — every memory slot contributes proportionally

Read operation

Write: Erase then Add — M_t(i) ← M_{t-1}(i)[1 − w_t(i)·e_t] + w_t(i)·a_t

Selective erase with e_t, selective add with a_t

Write operation

Result: NTM generalizes to 2× unseen sequence lengths; LSTM fails

Demonstrated on copy, repeated copy, associative recall, priority sort

Fully differentiable — trained end-to-end with BPTT

External memory decouples storage from computation

Generalizes beyond training distribution on sequence length

1. Motivation: Why RNNs Aren't Enough

Standard RNNs and LSTMs compress all past information into a fixed-size hidden state vector. For algorithmic tasks — copying a sequence, sorting by priority, recalling a stored pattern — this bottleneck is catastrophic. The network must simultaneously remember what it has seen and figure out what to do next, all within a single hidden vector.

The key insight of NTMs is that computers solve these tasks easily because they have two separate resources: a processor (controller) and working memory (RAM). The controller reads from and writes to memory as needed. NTMs implement exactly this separation — but with differentiable, soft operations so the whole system can be trained end-to-end.

2. NTM Architecture

An NTM has two main components:

Controller: An LSTM or feedforward network that receives external input x_t and read vectors from memory. It outputs write parameters (erase vector e_t, add vector a_t) and addressing parameters (key k_t, strength β_t, gate g_t, shift kernel s_t, sharpening factor γ_t).
Memory matrix: M_t of size N×M — N memory locations, each a vector of dimension M. Initialized to small constants, updated by write operations each timestep.

At each timestep, the controller outputs attention weights w_t over the N memory locations. These weights are used for both reading and writing. The full output of the NTM is the concatenation of the controller's internal output and the read vector r_t.

3. Reading from Memory

Reading is a differentiable weighted sum over all memory locations. Given attention weights w_t(i) (a distribution summing to 1), the read vector is:

r_t = \sum_i w_t(i) \, M_t(i)

This is a soft read — every memory slot contributes to the result, weighted by how much attention it receives. When w_t is concentrated on a single location (sharp), it approximates a hard read. When w_t is diffuse, it produces a blended average. Because this is a linear combination, gradients flow back through w_t and M_t seamlessly.

4. Writing to Memory

Writing is split into two sub-operations, applied in sequence: erase then add. The controller emits an erase vector e_t ∈ (0,1)^M and an add vector a_t ∈ R^M.

Erase step:

\tilde{M}_t(i) = M_{t-1}(i) \left[\mathbf{1} - w_t(i) \, e_t\right]

Add step:

M_t(i) = \tilde{M}_t(i) + w_t(i) \, a_t

The combined write operation is:

M_t(i) = M_{t-1}(i)\left[\mathbf{1} - w_t(i)\,e_t\right] + w_t(i)\,a_t

When e_t = 1 (all ones), the erase completely resets the attended location before adding a_t — a full overwrite. When e_t = 0, the erase does nothing and a_t is simply added to existing content. Locations with low w_t(i) are barely touched, implementing selective writes.

5. Addressing Mechanisms

Attention weights w_t are computed through a four-step pipeline that combines content-based and location-based addressing. This allows the controller to either look up a specific value by content (like a hash map) or iterate sequentially through memory locations (like a pointer).

Step 1 — Content Addressing

The controller emits a query key k_t and a strength scalar β_t > 0. Content similarity between k_t and each row M_t(i) is measured by cosine similarity:

K(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\| \cdot \|\mathbf{v}\|}

The content-based weight distribution is:

w_t^c(i) = \frac{\exp\!\left(\beta_t \, K(\mathbf{k}_t,\, M_t(i))\right)}{\sum_j \exp\!\left(\beta_t \, K(\mathbf{k}_t,\, M_t(j))\right)}

β_t controls the sharpness of the lookup: large β_t → near-one-hot lookup (hard retrieval); small β_t → diffuse across all locations (soft averaging). This is analogous to temperature in softmax.

Step 2 — Interpolation

A scalar gate g_t ∈ (0,1) interpolates between the content weight and the previous timestep's weight. This lets the controller decide whether to attend based on content (g_t ≈ 1) or maintain its previous location (g_t ≈ 0):

\mathbf{w}_t^g = g_t \, \mathbf{w}_t^c + (1 - g_t) \, \mathbf{w}_{t-1}

Step 3 — Shift

A shift distribution s_t (over integer offsets, e.g., {−1, 0, +1}) is convolved with w_t^g to allow the controller to move the attention head forward or backward. This enables sequential iteration over memory:

\tilde{w}_t(i) = \sum_j w_t^g(j) \, s_t(i - j)

For the copy task, the network learns to set s_t to shift +1 every step, implementing a sequential scan.

Step 4 — Sharpening

The shift convolution can blur the weight distribution. A sharpening factor γ_t ≥ 1 re-concentrates the weights:

w_t(i) = \frac{\tilde{w}_t(i)^{\gamma_t}}{\sum_j \tilde{w}_t(j)^{\gamma_t}}

When γ_t = 1, sharpening is a no-op. When γ_t >> 1, the distribution approaches a one-hot vector over the argmax location.

6. Tasks Demonstrated

The paper evaluates NTMs against LSTMs on five algorithmic tasks. In every case, NTMs learn faster, achieve lower error, and generalize far better to out-of-distribution sequence lengths.

Copy Task

Read a sequence of random binary vectors, then reproduce them. NTM learns to write each vector to memory sequentially, then read back in order. Trained on length 1–20; NTM generalizes to length 120 almost perfectly. LSTM degrades rapidly above training length.

Repeated Copy

Copy a sequence K times (K given as input). NTM learns an outer loop (repeat K times) and inner loop (scan each position), implemented via location-based addressing. Demonstrates composable looping behavior.

Associative Recall

Store a list of item–item pairs, then given a query item, retrieve the associated item. NTM uses content-based addressing to look up the query key, then shifts +1 to fetch the paired item — implementing a primitive key-value store.

Dynamic N-Grams

Predict the next symbol in a stream drawn from an n-gram distribution that changes over time. NTM can update its memory to track the current distribution, acting as an adaptive statistical model. Performance approaches the Bayesian optimal predictor.

Priority Sort

Given a sequence of (value, priority) pairs, output the values sorted by descending priority. NTM learns to write each value to the memory location indexed by its priority, then read back sequentially — a differentiable bucket sort.

7. NTM vs. LSTM

The fundamental difference is not in expressivity (both are Turing-complete in theory) but in inductive bias and data efficiency.

Property	LSTM	NTM
Memory capacity	Fixed (hidden state size)	Scalable (N×M matrix)
Memory access	Implicit (all gates)	Explicit (addressed reads/writes)
Length generalization	Poor (degrades sharply)	Strong (2× training length)
Addressing style	None	Content + location hybrid
Training speed on copy	~100k steps to converge	~30k steps to converge

8. Legacy and Influence

NTMs were one of the first demonstrations that differentiable memory could be bolted onto a neural network and trained end-to-end — a template that influenced many later architectures.

Differentiable Neural Computer (DNC) (Graves et al., 2016): Extended NTM with dynamic memory allocation, temporal link matrices for tracking write order, and stronger generalization. Used to navigate London Underground maps.
Memory Networks (Weston et al., 2014): Concurrent work with similar external memory idea, applied to QA tasks.
Attention mechanisms: The content-based addressing in NTMs is conceptually identical to the attention mechanism later formalized in seq2seq models and Transformers. NTMs can be seen as an early attention-over-memory system.
Meta-learning: NTMs demonstrated that a neural network can implement an algorithm rather than just fit a function — foundational to later work on learning to learn.

Resources

Original Paper — arXiv 1410.5401

Graves, Wayne, Danihelka · DeepMind · 2014

Hybrid Computing using a Neural Network with Dynamic External Memory (DNC)

Graves et al. · Nature 2016 · Follow-up to NTM

Distill: Attention and Augmented RNNs

Olah & Carter · 2016 · Interactive visual explanation of NTMs, DNCs, and attention

NTM Deep Dive — Rylan Schaeffer

Detailed mathematical walkthrough with implementation notes

← Back to all papers