PaperTrace — Interactive ML Paper Deep-Dives

TL;DR

LSTMs solve the vanishing gradient problem of vanilla RNNs by introducing a cell state — a "conveyor belt" that carries information across many timesteps with minimal modification. Four learned gates (forget, input, candidate, output) decide what to erase, write, and read from this state. The key trick is the additive cell state update: C_t = f_t ⊙ C_{t-1} + i_t ⊙ C̃_t, which creates a near-constant gradient highway through time.

1. The Vanishing Gradient Problem

Recurrent Neural Networks (RNNs) process sequences by maintaining a hidden state h_t that is updated at each timestep. The hidden state is supposed to carry information from the past that is relevant to predicting the next token or label. In principle, an RNN should be able to use context from arbitrarily far back in the sequence.

In practice, vanilla RNNs fail to learn long-range dependencies. The reason is the vanishing gradient problem. During backpropagation through time (BPTT), the gradient of the loss with respect to an early hidden state involves a long chain of matrix multiplications — one per timestep. If the eigenvalues of the recurrent weight matrix are less than 1, these multiplications drive the gradient exponentially toward zero. If they are greater than 1, gradients explode.

Concrete failure: Consider predicting the verb "are" in: "The cats that the dog chased are hungry." A vanilla RNN must bridge a gap of 7 tokens between "cats" (plural subject) and "are". By the time the gradient flows back from the prediction to the representation of "cats", it has been multiplied by the recurrent weight matrix 7 times and has effectively vanished.

Hochreiter & Schmidhuber (1997) identified this problem formally and proposed Long Short-Term Memory as the solution. The key insight: instead of propagating information multiplicatively through a hidden state, use an additive cell state that can carry information forward unchanged over many steps.

2. LSTM's Cell State: The Conveyor Belt

The key innovation of LSTM is the cell state C_t — a separate memory track that runs alongside the hidden state. Olah's blog post describes it as a "conveyor belt": information can ride this belt across many timesteps with only minor, deliberate modifications.

Unlike the hidden state h_t, which is passed through tanh at every step, the cell state is modified only through element-wise addition and multiplication. This additive structure is what prevents gradient vanishing: the gradient of C_t with respect to C_{t-1} is just the forget gate value f_t, which can stay close to 1.

The full LSTM state at each timestep:

Cell state C_t — long-term memory, the conveyor belt
Hidden state h_t — short-term output, passed to the next step and used for predictions

Three gates regulate what information flows into and out of the cell state. Each gate is a sigmoid layer (output in [0, 1]) combined with pointwise multiplication — 0 means "block everything", 1 means "let everything through". This gating mechanism gives the network fine-grained control over what to remember and forget.

3. Forget Gate

The first operation the LSTM performs is deciding what to erase from the cell state. This is done by the forget gate f_t, a sigmoid layer that looks at the previous hidden state h_{t-1} and the current input x_t, and outputs a number between 0 and 1 for each element of C_{t-1}.

f_t = \sigma(W_f \cdot [h_{t-1},\, x_t] + b_f)

Here [h_{t-1}, x_t] denotes concatenation of the previous hidden state and current input. W_f is the learned weight matrix for the forget gate, and b_f is the bias. The sigmoid output f_t ∈ (0, 1)^n is then multiplied element-wise with C_{t-1} during the cell state update.

Example — gender tracking: In a language model generating "She went to the store. She bought..." — when the model encounters a new subject pronoun, the forget gate should fire to erase the gender stored in the cell state from the previous subject, making room for the new subject's gender.

4. Input Gate and Candidate Values

Next, the LSTM decides what new information to write into the cell state. This involves two parallel operations: the input gate i_t (how much to write) and the candidate layer C̃_t (what values to write).

i_t = \sigma(W_i \cdot [h_{t-1},\, x_t] + b_i)

\tilde{C}_t = \tanh(W_C \cdot [h_{t-1},\, x_t] + b_C)

The input gate i_t is a sigmoid (values in [0, 1]) that decides which dimensions of the candidate are worth writing. The candidate C̃_t is a tanh layer (values in [−1, 1]) that proposes new values to add to the cell state. Together they determine the update: i_t ⊙ C̃_t.

Example — subject tracking: When the model reads a new noun phrase "the cats" that could serve as the subject of an upcoming verb, the input gate opens to write the plurality and gender of "cats" into the cell state. The candidate layer proposes the actual encoded values, and the input gate decides how strongly to write them.

5. Cell State Update

With the forget gate and input+candidate computed, the cell state is updated by combining them:

C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t

This is the heart of the LSTM. The old cell state C_{t-1} is selectively erased by multiplying with f_t, then new information i_t ⊙ C̃_t is added in. The operation is purely additive (after the forget scaling) — no tanh squashing, no weight matrix multiplication.

Why addition prevents vanishing gradients:

The gradient of C_t with respect to C_{t-1} is simply f_t. As long as the forget gate stays open (f_t ≈ 1), gradients flow through the cell state unchanged across many timesteps. This is analogous to ResNet's skip connections — the additive path creates a gradient highway through time.

6. Output Gate

Finally, the LSTM decides what to output from the cell state to the hidden state h_t. The output is a filtered, compressed version of the cell state — not everything stored is relevant to the current prediction.

o_t = \sigma(W_o \cdot [h_{t-1},\, x_t] + b_o)

h_t = o_t \odot \tanh(C_t)

The output gate o_t is another sigmoid that decides which parts of the cell state to expose. The cell state C_t is first squashed through tanh (to bring values into [−1, 1]), then multiplied by o_t. The result h_t is the hidden state passed to the next timestep and used for predictions.

Example — verb output: If the cell state has stored that the current subject is plural, the output gate can expose this information when the model needs to decide the form of an upcoming verb ("are" vs. "is"). Other stored information (e.g., the topic of conversation) may remain in the cell state but be gated out of h_t if it is not relevant to the immediate prediction.

7. Concrete Example: Tracking Subject in a Sentence

Let us walk through how an LSTM would handle the sentence: "The clouds in the sky are beautiful." A language model processing this sentence needs to predict "are" (not "is") because the subject "clouds" is plural.

Step 1: Read "The"

An article is detected. The input gate may weakly activate some dimensions to indicate that a noun phrase is starting. The cell state barely changes.

Step 2: Read "clouds"

A plural noun is detected. The input gate opens strongly for the "subject plurality" dimension. The candidate layer proposes a high value for "plural". This is written into the cell state: C_t[plurality] ← 1.

Step 3: Read "in", "the", "sky"

Prepositional phrase. The forget gate keeps f_t[plurality] ≈ 1 — the plurality information is preserved unchanged across these three timesteps. The conveyor belt carries the information forward.

Step 4: Predict "are"

The output gate opens for the "subject plurality" dimension. h_t reflects that the subject is plural, which drives the softmax to assign high probability to "are" over "is".

This example illustrates the power of the cell state: the plurality of "clouds" must survive 3 intervening tokens with no grammatical reinforcement. A vanilla RNN would have difficulty; an LSTM handles it naturally by keeping the forget gate open for the relevant memory slot.

8. GRU: The Simplified Version

The Gated Recurrent Unit (GRU), introduced by Cho et al. (2014), is a streamlined variant of the LSTM. It makes two simplifications: (1) it merges the cell state and hidden state into a single h_t, and (2) it uses only two gates — a reset gate r_t and an update gate z_t — instead of LSTM's three.

z_t = \sigma(W_z [h_{t-1},\, x_t])

r_t = \sigma(W_r [h_{t-1},\, x_t])

\tilde{h}_t = \tanh(W [r_t \odot h_{t-1},\, x_t])

h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t

The update gate z_t plays the combined role of the LSTM's forget and input gates. When z_t ≈ 1, the new candidate h̃_t mostly replaces the old state (like a high input gate and zero forget gate). When z_t ≈ 0, the old state is mostly preserved (like a high forget gate and zero input gate).

The reset gate r_t controls how much of the previous hidden state is used when computing the candidate. When r_t ≈ 0, the candidate is computed almost entirely from the input x_t, effectively resetting the memory. When r_t ≈ 1, the candidate uses the full previous state, allowing it to track long-term dependencies.

Property	LSTM	GRU
States	Two: C_t (cell) + h_t (hidden)	One: h_t only
Gates	Three: forget, input, output	Two: update, reset
Parameters	More (4 × hidden_size × input_size)	Fewer (3 × hidden_size × input_size)
Training speed	Slower per step	Faster per step
Performance	Slightly better on very long sequences	Often competitive, sometimes better

9. Why LSTMs Dominated 2015–2017

In the years following Colah's blog post, LSTMs became the default architecture for virtually every sequence modeling task: language modeling, machine translation, speech recognition, sentiment analysis, and time series forecasting. Several factors made them dominant.

The additive update solves vanishing gradients

The key equation C_t = f_t ⊙ C_{t-1} + i_t ⊙ C̃_t creates an additive path through time. The gradient ∂C_t/∂C_{t-1} = f_t can remain close to 1 across hundreds of timesteps, enabling the model to learn dependencies over sequences of length 100–1000 — something completely out of reach for vanilla RNNs.

Gating provides expressive memory management

The three gates give the network fine-grained control: forget specific information, write specific new information, and read selectively. This expressiveness allows a single LSTM layer to simultaneously track multiple pieces of state — subject, tense, topic, and more.

Empirical success across many domains

LSTMs achieved state-of-the-art on machine translation (seq2seq with attention, Bahdanau et al. 2015), language modeling (Zaremba et al. 2014), and speech recognition (Graves et al. 2013). The consistency of these results across very different tasks gave practitioners confidence in the architecture.

Practical trainability

LSTMs are stable to train with standard gradient descent + gradient clipping, unlike vanilla RNNs which require careful initialization and learning rate tuning to avoid gradient explosion. This practical reliability made them the default choice.

Why Transformers eventually replaced LSTMs

LSTMs are inherently sequential — h_t depends on h_{t-1}, so you cannot parallelize across timesteps during training. Transformers, introduced in 2017, process all positions in parallel via self-attention, enabling much faster training on GPUs/TPUs and scaling to far larger datasets and models. For tasks with very long contexts (>1000 tokens), Transformers also generally outperform LSTMs.

Understanding LSTM Networks