Recurrent Neural Network Regularization

Zaremba, Sutskever, Vinyals Β· 2014 Β· arXiv 1409.2329

TL;DR

Applying dropout naively to recurrent connections in LSTMs destroys long-range memory and hurts performance. The fix is simple: apply dropout only on the vertical (non-recurrent) connections β€” between layers and at input/output β€” but never on the horizontal recurrent connections across timesteps. This achieved 68.0 test perplexity on PTB, beating the previous SOTA of 78.4, and became the standard recipe for regularizing RNNs for years.

1. Dropout in Feedforward Networks

Dropout (Srivastava et al. 2014) is one of the most effective regularization techniques for deep neural networks. During training, it randomly zeros out activations with probability p, then rescales survivors to keep expected values unchanged. At test time, all units are kept but weights are scaled by (1-p).

The standard dropout formulation applies a Bernoulli mask to a hidden layer h:

Standard dropout mask
h~=m1βˆ’pβŠ™h,m∼Bernoulli(1βˆ’p)\tilde{h} = \frac{\mathbf{m}}{1-p} \odot h, \quad \mathbf{m} \sim \text{Bernoulli}(1-p)

Each element of m is independently drawn: 1 (keep) with probability 1-p, and 0 (drop) with probability p. The factor 1/(1-p) is the inverted dropout scaling that keeps expected activations the same as at test time. In practice, modern frameworks implement this as inverted dropout so no rescaling is needed at test time.

For feedforward networks, dropout between every pair of layers is straightforward and highly effective. The question is: can the same trick be applied to recurrent neural networks?

2. The Naive RNN Dropout Mistake

A naive approach would apply dropout uniformly to all connections in an RNN β€” both the vertical connections (between layers at the same timestep) and the horizontal recurrent connections (carrying hidden state from one timestep to the next). This turns out to be a serious mistake.

In a sequence model, information must flow across T timesteps. If dropout is applied to recurrent connections at each timestep, the hidden state signal must survive repeated random erasure across the entire sequence. For a sequence of length T with dropout probability p applied at each step, the expected number of times any given unit survives all T timestep connections is (1-p)^T β€” which goes to 0 exponentially fast as T grows.

Signal survival probability over T steps
P(signalΒ survivesΒ TΒ steps)=(1βˆ’p)Tβ†’Tβ†’βˆž0P(\text{signal survives } T \text{ steps}) = (1-p)^T \xrightarrow{T \to \infty} 0

The result: long-range dependencies are effectively severed. The gradient signal for events far back in time gets multiplied by (1-p)^T times the recurrent weight Jacobian β€” a product that vanishes rapidly. The network can no longer learn from context beyond a few steps, which is precisely what RNNs are designed to capture.

3. The Fix: Only Drop Vertical Connections

The key contribution of Zaremba et al. is elegantly simple: apply dropout only on the non-recurrent (vertical) connections. Specifically:

  • Drop: Input to the first hidden layer (embedding β†’ hidden layer 1)
  • Drop: Between stacked LSTM layers (hidden layer l β†’ hidden layer l+1)
  • Drop: From the final hidden layer to the output (hidden layer L β†’ softmax)
  • Never drop: The recurrent connections h_t^l β†’ h_{t+1}^l within each layer across timesteps

With this scheme, each dropout mask is applied at most once per training step along the time dimension, not T times. The recurrent connections remain intact, preserving gradient flow and long-range memory.

Correct dropout: applied to vertical connections only
htl=f ⁣(dropout(htlβˆ’1)β‹…Wx+htβˆ’1lβ‹…Wh+b)h_t^l = f\!\left(\text{dropout}(h_t^{l-1}) \cdot W_{x} + h_{t-1}^l \cdot W_{h} + b\right)

Notice that h_t^{l-1} (the vertical input from the layer below) passes through dropout, but h_{t-1}^l (the horizontal recurrent input from the previous timestep at the same layer) does not.

4. LSTM Architecture with Dropout

The paper uses a standard multi-layer LSTM. Let's write out the full LSTM equations to see precisely where dropout is and isn't applied. For layer l at timestep t, the LSTM computes four gates:

LSTM input gate
it=σ ⁣(Wxi h~tlβˆ’1+Whi htβˆ’1l+bi)i_t = \sigma\!\left(W_{xi}\,\tilde{h}_t^{l-1} + W_{hi}\,h_{t-1}^l + b_i\right)
LSTM forget gate
ft=σ ⁣(Wxf h~tlβˆ’1+Whf htβˆ’1l+bf)f_t = \sigma\!\left(W_{xf}\,\tilde{h}_t^{l-1} + W_{hf}\,h_{t-1}^l + b_f\right)
LSTM output gate
ot=σ ⁣(Wxo h~tlβˆ’1+Who htβˆ’1l+bo)o_t = \sigma\!\left(W_{xo}\,\tilde{h}_t^{l-1} + W_{ho}\,h_{t-1}^l + b_o\right)
LSTM cell candidate (g gate)
gt=tanh⁑ ⁣(Wxg h~tlβˆ’1+Whg htβˆ’1l+bg)g_t = \tanh\!\left(W_{xg}\,\tilde{h}_t^{l-1} + W_{hg}\,h_{t-1}^l + b_g\right)
LSTM cell state update
ctl=ftβŠ™ctβˆ’1l+itβŠ™gtc_t^l = f_t \odot c_{t-1}^l + i_t \odot g_t
LSTM hidden state output
htl=otβŠ™tanh⁑(ctl)h_t^l = o_t \odot \tanh(c_t^l)

The key notation is the tilde: h~tlβˆ’1=dropout(htlβˆ’1)\tilde{h}_t^{l-1} = \text{dropout}(h_t^{l-1}) β€” the dropped-out version of the layer below's output. All four gate computations receive this dropped version as vertical input. By contrast, the recurrent term htβˆ’1lh_{t-1}^l carries no mask β€” it is always used intact.

input β†’ h^1dropout appliedh^1 β†’ h^2dropout appliedh^L β†’ outputdropout appliedh_t^l β†’ h_{t+1}^lNO dropoutc_t^l β†’ c_{t+1}^lNO dropout

The cell state c and the hidden state h both flow horizontally without any mask, preserving the LSTM's ability to carry long-range information.

5. Why Recurrent Connections Are Special

The fundamental reason is the role of recurrent connections in gradient flow during backpropagation through time (BPTT). For a sequence of length T, the gradient of the loss with respect to an early hidden state requires multiplying through T Jacobians of the recurrent transition:

Gradient flow in BPTT
βˆ‚Lβˆ‚h1=βˆ‚Lβˆ‚hT∏t=2Tβˆ‚htβˆ‚htβˆ’1\frac{\partial \mathcal{L}}{\partial h_1} = \frac{\partial \mathcal{L}}{\partial h_T} \prod_{t=2}^{T} \frac{\partial h_t}{\partial h_{t-1}}

This product of T Jacobians is already fragile β€” it is the source of the vanishing and exploding gradient problems that motivated the LSTM's design. Adding dropout to recurrent connections multiplies each Jacobian by an additional sparse binary mask, making the product even more likely to vanish. The LSTM's cell state with its additive update was designed specifically to maintain a stable gradient highway; disrupting that highway with dropout defeats the purpose.

More intuitively: the recurrent connections implement the network's working memory. A language model must remember that it is inside a relative clause, or that a number was singular three words ago for subject-verb agreement. These are the dependencies that make language modeling hard. Randomly erasing recurrent activations is like randomly scrambling someone's working memory mid-sentence β€” they lose the thread of what they were computing.

6. Results on PTB, Machine Translation, and Speech

The paper evaluates on three tasks. The headline result is Penn Treebank (PTB) language modeling.

Penn Treebank Language Modeling

PTB is the standard benchmark for language model perplexity at the time. Lower perplexity is better β€” it measures how surprised the model is by the test set on average (perplexity = exp(average negative log-likelihood per token)). The paper trains a 2-layer LSTM with 1500 hidden units per layer, approximately 65M parameters total.

ModelValid PPLTest PPLParams
KN5 (Mikolov 2012)β€”141.22M
RNN-LDA (Mikolov & Zweig 2012)β€”92.07M
Deep RNN (Pascanu et al. 2013)β€”107.56M
Previous SOTA (Mikolov & Zweig 2012)β€”78.4β€”
LSTM medium (no dropout)86.282.720M
LSTM medium (dropout 0.65)81.077.420M
LSTM large (dropout 0.65)73.468.065M

The large LSTM with dropout achieves 68.0 test perplexity, a massive improvement over the previous SOTA of 78.4 β€” a reduction of over 10 perplexity points. Comparing the medium model with and without dropout (82.7 vs 77.4) confirms that the dropout is directly responsible for the improvement, not just the larger model size.

Machine Translation and Speech Recognition

The paper also validates on English-to-French translation (WMT'14) using an encoder-decoder LSTM. Adding dropout improves BLEU score from 14.5 to 16.5 β€” a meaningful gain for MT systems at the time. The same principle applies: dropout on the embedding β†’ encoder layer and encoder β†’ decoder layer, but not on the encoder's recurrent connections.

For speech recognition, tested on TIMIT, the paper reports reduced phone error rate (PER) from 18.0% without dropout to 17.7% with dropout on an LSTM acoustic model. While more modest than the PTB gains, it shows the technique generalizes across sequential domains.

7. Influence: Variational Dropout and Beyond

Zaremba et al.'s approach dominated practice for roughly two years. The next significant advance came from Gal & Ghahramani (2016), who placed dropout in a Bayesian framework and derived a theoretically principled version called Variational Dropout (also called MC Dropout).

The key difference in Variational Dropout: the same dropout mask is reused at every timestep within a single forward pass. In Zaremba et al., a fresh mask is sampled at each timestep for vertical connections. Gal & Ghahramani showed that sharing the mask across timesteps corresponds to approximate variational inference in a Bayesian neural network, and this formulation also allows dropout to be applied safely to recurrent connections (with a fixed-per-sequence mask).

Variational dropout: same mask repeated across timesteps
h~tl=m1βˆ’pβŠ™htl,m∼Bernoulli(1βˆ’p)Β fixedΒ forΒ allΒ t\tilde{h}_t^l = \frac{\mathbf{m}}{1-p} \odot h_t^l, \quad \mathbf{m} \sim \text{Bernoulli}(1-p) \text{ fixed for all } t

Subsequent work further refined RNN regularization:

  • Zoneout (Krueger et al. 2016): Instead of zeroing hidden units, randomly keeps the previous timestep's value β€” like a stochastic identity shortcut across time.
  • Merity et al. AWD-LSTM (2017): Combines Zaremba-style dropout with DropConnect on recurrent weights, embedding dropout, and AR/TAR regularization to achieve 57.3 test perplexity on PTB.
  • Transformer era (post-2017): With the rise of attention-based models, recurrent dropout became less central β€” Transformers handle long-range dependencies differently and apply dropout on attention weights and feedforward connections instead.

Despite being superseded by Transformers for many sequence modeling tasks, the conceptual insight of Zaremba et al. β€” that regularization must respect the structure of information flow, not just blindly apply noise to all connections β€” remains a fundamental principle of neural network design.

8. Additional Resources