Neural Turing Machines

Graves, Wayne, Danihelka Β· DeepMind Β· 2014 Β· arXiv 1410.5401

TL;DR

Neural Turing Machines couple a neural network controller with an external differentiable memory matrix. The controller reads from and writes to memory via soft, attention-like addressing β€” making every operation end-to-end trainable with gradient descent. NTMs learn to copy, sort, and recall sequences far better than LSTMs, and generalize to sequence lengths never seen during training.

β—†NTM Architecture Pipeline
Problem: RNNs lack explicit external memory for algorithmic tasks
Copy, sort, associative recall β€” RNNs fail to generalize
Motivation
Solution: Controller (LSTM/FF) + Memory Matrix M of size NΓ—M
Memory is external, persistent, and differentiable
Architecture
Addressing: Content-based + Location-based weighting
Full pipeline: content β†’ interpolation β†’ shift β†’ sharpen
Content: cosine similarity key lookup
Location: shift convolution + sharpening
Core mechanism
Read: r_t = Ξ£_i w_t(i) Β· M_t(i) β€” differentiable weighted sum
Soft read β€” every memory slot contributes proportionally
Read operation
Write: Erase then Add β€” M_t(i) ← M_{t-1}(i)[1 βˆ’ w_t(i)Β·e_t] + w_t(i)Β·a_t
Selective erase with e_t, selective add with a_t
Write operation
Result: NTM generalizes to 2Γ— unseen sequence lengths; LSTM fails
Demonstrated on copy, repeated copy, associative recall, priority sort
Fully differentiable β€” trained end-to-end with BPTT
External memory decouples storage from computation
Generalizes beyond training distribution on sequence length

1. Motivation: Why RNNs Aren't Enough

Standard RNNs and LSTMs compress all past information into a fixed-size hidden state vector. For algorithmic tasks β€” copying a sequence, sorting by priority, recalling a stored pattern β€” this bottleneck is catastrophic. The network must simultaneously remember what it has seen and figure out what to do next, all within a single hidden vector.

The key insight of NTMs is that computers solve these tasks easily because they have two separate resources: a processor (controller) and working memory (RAM). The controller reads from and writes to memory as needed. NTMs implement exactly this separation β€” but with differentiable, soft operations so the whole system can be trained end-to-end.

2. NTM Architecture

An NTM has two main components:

  • Controller: An LSTM or feedforward network that receives external input x_t and read vectors from memory. It outputs write parameters (erase vector e_t, add vector a_t) and addressing parameters (key k_t, strength Ξ²_t, gate g_t, shift kernel s_t, sharpening factor Ξ³_t).
  • Memory matrix: M_t of size NΓ—M β€” N memory locations, each a vector of dimension M. Initialized to small constants, updated by write operations each timestep.

At each timestep, the controller outputs attention weights w_t over the N memory locations. These weights are used for both reading and writing. The full output of the NTM is the concatenation of the controller's internal output and the read vector r_t.

3. Reading from Memory

Reading is a differentiable weighted sum over all memory locations. Given attention weights w_t(i) (a distribution summing to 1), the read vector is:

rt=βˆ‘iwt(i) Mt(i)r_t = \sum_i w_t(i) \, M_t(i)

This is a soft read β€” every memory slot contributes to the result, weighted by how much attention it receives. When w_t is concentrated on a single location (sharp), it approximates a hard read. When w_t is diffuse, it produces a blended average. Because this is a linear combination, gradients flow back through w_t and M_t seamlessly.

4. Writing to Memory

Writing is split into two sub-operations, applied in sequence: erase then add. The controller emits an erase vector e_t ∈ (0,1)^M and an add vector a_t ∈ R^M.

Erase step:

M~t(i)=Mtβˆ’1(i)[1βˆ’wt(i) et]\tilde{M}_t(i) = M_{t-1}(i) \left[\mathbf{1} - w_t(i) \, e_t\right]

Add step:

Mt(i)=M~t(i)+wt(i) atM_t(i) = \tilde{M}_t(i) + w_t(i) \, a_t

The combined write operation is:

Mt(i)=Mtβˆ’1(i)[1βˆ’wt(i) et]+wt(i) atM_t(i) = M_{t-1}(i)\left[\mathbf{1} - w_t(i)\,e_t\right] + w_t(i)\,a_t

When e_t = 1 (all ones), the erase completely resets the attended location before adding a_t β€” a full overwrite. When e_t = 0, the erase does nothing and a_t is simply added to existing content. Locations with low w_t(i) are barely touched, implementing selective writes.

5. Addressing Mechanisms

Attention weights w_t are computed through a four-step pipeline that combines content-based and location-based addressing. This allows the controller to either look up a specific value by content (like a hash map) or iterate sequentially through memory locations (like a pointer).

Step 1 β€” Content Addressing

The controller emits a query key k_t and a strength scalar Ξ²_t > 0. Content similarity between k_t and each row M_t(i) is measured by cosine similarity:

K(u,v)=uβ‹…vβˆ₯uβˆ₯β‹…βˆ₯vβˆ₯K(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\| \cdot \|\mathbf{v}\|}

The content-based weight distribution is:

wtc(i)=exp⁑ ⁣(Ξ²t K(kt, Mt(i)))βˆ‘jexp⁑ ⁣(Ξ²t K(kt, Mt(j)))w_t^c(i) = \frac{\exp\!\left(\beta_t \, K(\mathbf{k}_t,\, M_t(i))\right)}{\sum_j \exp\!\left(\beta_t \, K(\mathbf{k}_t,\, M_t(j))\right)}

Ξ²_t controls the sharpness of the lookup: large Ξ²_t β†’ near-one-hot lookup (hard retrieval); small Ξ²_t β†’ diffuse across all locations (soft averaging). This is analogous to temperature in softmax.

Step 2 β€” Interpolation

A scalar gate g_t ∈ (0,1) interpolates between the content weight and the previous timestep's weight. This lets the controller decide whether to attend based on content (g_t β‰ˆ 1) or maintain its previous location (g_t β‰ˆ 0):

wtg=gt wtc+(1βˆ’gt) wtβˆ’1\mathbf{w}_t^g = g_t \, \mathbf{w}_t^c + (1 - g_t) \, \mathbf{w}_{t-1}

Step 3 β€” Shift

A shift distribution s_t (over integer offsets, e.g., {βˆ’1, 0, +1}) is convolved with w_t^g to allow the controller to move the attention head forward or backward. This enables sequential iteration over memory:

w~t(i)=βˆ‘jwtg(j) st(iβˆ’j)\tilde{w}_t(i) = \sum_j w_t^g(j) \, s_t(i - j)

For the copy task, the network learns to set s_t to shift +1 every step, implementing a sequential scan.

Step 4 β€” Sharpening

The shift convolution can blur the weight distribution. A sharpening factor Ξ³_t β‰₯ 1 re-concentrates the weights:

wt(i)=w~t(i)Ξ³tβˆ‘jw~t(j)Ξ³tw_t(i) = \frac{\tilde{w}_t(i)^{\gamma_t}}{\sum_j \tilde{w}_t(j)^{\gamma_t}}

When Ξ³_t = 1, sharpening is a no-op. When Ξ³_t >> 1, the distribution approaches a one-hot vector over the argmax location.

6. Tasks Demonstrated

The paper evaluates NTMs against LSTMs on five algorithmic tasks. In every case, NTMs learn faster, achieve lower error, and generalize far better to out-of-distribution sequence lengths.

Copy Task

Read a sequence of random binary vectors, then reproduce them. NTM learns to write each vector to memory sequentially, then read back in order. Trained on length 1–20; NTM generalizes to length 120 almost perfectly. LSTM degrades rapidly above training length.

Repeated Copy

Copy a sequence K times (K given as input). NTM learns an outer loop (repeat K times) and inner loop (scan each position), implemented via location-based addressing. Demonstrates composable looping behavior.

Associative Recall

Store a list of item–item pairs, then given a query item, retrieve the associated item. NTM uses content-based addressing to look up the query key, then shifts +1 to fetch the paired item β€” implementing a primitive key-value store.

Dynamic N-Grams

Predict the next symbol in a stream drawn from an n-gram distribution that changes over time. NTM can update its memory to track the current distribution, acting as an adaptive statistical model. Performance approaches the Bayesian optimal predictor.

Priority Sort

Given a sequence of (value, priority) pairs, output the values sorted by descending priority. NTM learns to write each value to the memory location indexed by its priority, then read back sequentially β€” a differentiable bucket sort.

7. NTM vs. LSTM

The fundamental difference is not in expressivity (both are Turing-complete in theory) but in inductive bias and data efficiency.

PropertyLSTMNTM
Memory capacityFixed (hidden state size)Scalable (NΓ—M matrix)
Memory accessImplicit (all gates)Explicit (addressed reads/writes)
Length generalizationPoor (degrades sharply)Strong (2Γ— training length)
Addressing styleNoneContent + location hybrid
Training speed on copy~100k steps to converge~30k steps to converge

8. Legacy and Influence

NTMs were one of the first demonstrations that differentiable memory could be bolted onto a neural network and trained end-to-end β€” a template that influenced many later architectures.

  • Differentiable Neural Computer (DNC) (Graves et al., 2016): Extended NTM with dynamic memory allocation, temporal link matrices for tracking write order, and stronger generalization. Used to navigate London Underground maps.
  • Memory Networks (Weston et al., 2014): Concurrent work with similar external memory idea, applied to QA tasks.
  • Attention mechanisms: The content-based addressing in NTMs is conceptually identical to the attention mechanism later formalized in seq2seq models and Transformers. NTMs can be seen as an early attention-over-memory system.
  • Meta-learning: NTMs demonstrated that a neural network can implement an algorithm rather than just fit a function β€” foundational to later work on learning to learn.

Resources