PaperTrace — Interactive ML Paper Deep-Dives

TL;DR

In 2015, Karpathy trained simple character-level RNNs on raw text and showed they spontaneously learn structure far beyond what anyone expected: line breaks, quotation tracking, indentation depth, even rudimentary syntax. Training on Shakespeare produces plausible Shakespeare. Training on Linux source produces compilable-looking C. This wasn't just impressive — it directly seeded the intuition behind GPT and all modern token-level language models.

1. RNN Architecture Recap

A Recurrent Neural Network maintains a hidden state vector that gets updated at every time step. At step t, the network takes the current input x_t and the previous hidden state h_{t-1}, mixes them through a learned weight matrix, and squashes through tanh to produce the new hidden state h_t:

h_t = \tanh(W_{hh}\, h_{t-1} + W_{xh}\, x_t + b_h)

y_t = W_{hy}\, h_t + b_y

The final output y_t is passed through a softmax to get a probability distribution over the next character. Three weight matrices are learned end-to-end:

W_xhInput-to-hidden: maps the current input character (one-hot vector of size V) into the hidden space. Encodes "what the current character contributes to state."W_hhHidden-to-hidden: maps the previous hidden state forward. This is the recurrent connection — it carries memory from the past. Crucially, the same W_hh is applied at every step.W_hyHidden-to-output: decodes the hidden state into logits over the vocabulary. One row per vocabulary character — softmax over these gives the next-character distribution.

The output distribution for the next token is:

\hat{y}_t = \text{softmax}(W_{hy}\, h_t + b_y)

Training minimizes the cross-entropy loss summed over all time steps — the negative log-probability of the correct next character at each position:

\mathcal{L} = -\sum_t \log\, p(x_{t+1} \mid x_1, x_2, \ldots, x_t)

Gradients flow backward through time via Backpropagation Through Time (BPTT). In practice, gradients are truncated after K steps to avoid memory blowup and vanishing/exploding gradients:

\frac{\partial \mathcal{L}}{\partial W} = \sum_{t} \frac{\partial \mathcal{L}_t}{\partial W} \quad \text{(truncated at } K \text{ steps)}

2. Character-Level Language Modeling

The key setup: feed the model one character at a time, and ask it to predict the next character. No tokenization, no subwords — just raw bytes of text. At test time, you sample from the predicted distribution at each step and feed the result back as input.

Word-level models need a fixed vocabulary — any out-of-vocabulary word becomes an <UNK> token. Character-level models have no such problem: the alphabet is fixed (26 letters + punctuation), and the model can generate any string, including URLs, code, proper nouns, or words it never saw during training.

The tradeoff: sequences are much longer (one step per character vs. per word), and the model must learn to group characters into meaningful units entirely implicitly. The fact that it succeeds is the "unreasonable" part.

The training loop is simple: encode each character as a one-hot vector of size V (vocabulary size, typically ~100 for ASCII), run it through the RNN, compute softmax over next characters, backprop through K steps. Karpathy used a two-layer LSTM with 512 hidden units — tiny by today's standards.

3. What the Model Learns: Activation Analysis

The most striking part of the blog post: Karpathy visualized individual neuron activations as the RNN processed text, and found cells that had learned to track specific, interpretable properties of the sequence — with no supervision other than "predict the next character."

One neuron activates near zero at the start of a line, then smoothly ramps up as more characters are written, and fires strongly when the line is getting long — nudging the model toward predicting a newline character. This emergent "line length counter" wasn't programmed; the model discovered it because newlines are predictable from position within a line.

Another neuron switches states sharply when the model encounters an opening quotation mark, stays in that state throughout the quoted string, and switches back on the closing quote. The cell essentially implements a flip-flop — a binary memory of "am I inside quotes?". This is necessary for correct generation because characters inside quotes follow different statistical patterns than outside.

When trained on Linux kernel C code, one neuron tracks indentation depth: it activates proportionally to how many tabs or spaces the model is currently inside. As the model enters a nested block, the cell activates more strongly; as blocks close, it decreases. This lets the RNN generate matching closing braces at the correct depth.

Another cell fires when the model is inside an if-statement body — between the opening { after an if condition and its matching }. This allows the model to generate syntactically plausible code where if-statements contain consistent, indented bodies. Again: learned purely from character-level prediction, with no parse tree or grammar specification.

This analysis was groundbreaking: it showed that RNNs aren't just memorizing n-gram statistics. They're building internal representations of document structure. This foreshadowed what we now call "emergent capabilities" in large language models.

4. Generated Shakespeare

After training on the complete works of Shakespeare (~1MB of text), the model generates text that looks like this:

PANDARUS: Alas, I think he shall be come approached and the day When little sobs, shalt of best doth been here thou this the doth, And the grace with which the others they that be well. VIOLA: How the time is far with him that you stand, my lord, And we meet and that I should speak'd of you, mine own.
Generated by a character-level LSTM trained on Shakespeare. Note: character names in ALL CAPS, consistent stage-direction formatting, and rough iambic pentameter rhythm.

The model has learned, without any explicit supervision:

Character names in ALL CAPS followed by a colon — the model correctly formats speaker labels like PANDARUS: and VIOLA:.
Consistent indentation of dialogue — lines of speech are indented relative to the speaker label.
Approximate iambic pentameter — lines tend to have 10 syllables with alternating stress, matching Shakespeare's verse form.
Plausible Elizabethan vocabulary — words like "doth", "shalt", "mine own" appear in grammatically correct positions.

5. Generated Linux Kernel Code

Trained on the Linux kernel source code (~430MB of C code), the model generates text like this:

/*
 * Increment the size file of the new incorrect UI_FILTER group information
 * and target is the total.  So increment the request server tool
 * Temporary so that the start node depends to the file
 * all  units into the block device
 */
static int num_math(void) {
#ifdef CONFIG_BLOCK
        int block_read;
        mutex_lock(&dblock->mutex_io);
        if (num_error && !block_path) {
                device_trace(block, DEVICE_TYPE_DISK);
                return -EINVAL;
        }
#endif
        return 0;
}

Generated by a character-level LSTM trained on the Linux kernel. Note: syntactically valid C, correct #ifdef/#endif pairing, proper mutex patterns, realistic error-handling conventions.

The generated code exhibits remarkably realistic structure:

Header guards (#ifdef CONFIG_BLOCK ... #endif) — the model opens and closes #ifdef blocks correctly, matching real kernel conventions.
Function prototypes with realistic names — function names follow kernel naming conventions (snake_case, verb_noun patterns).
Consistent indentation and brace style — the model learned Linux kernel's specific code style (tabs, Allman-adjacent bracing).
Mutex, error code, and return value patterns — mutex_lock/unlock, -EINVAL returns, and NULL checks appear in the right contexts.

6. Temperature and Diversity

When sampling from the model, a temperature parameter T controls how peaked or spread-out the distribution is. At each step, instead of sampling from the raw softmax probabilities p(x), the model rescales the log-probabilities by T before exponentiating:

p_T(x) \propto \exp\!\left(\frac{\log p(x)}{T}\right)

$T \to 0$ — Most likely character always picked (greedy)

The distribution collapses to a point mass at argmax. Output is deterministic and repetitive — the model gets stuck in loops. You get highly consistent but boring text.

$T = 1$ — Sample from the raw model distribution

Samples exactly from what the model learned. Good balance of coherence and variety. Karpathy used T ≈ 0.5–1.0 for his examples.

$T \to \infty$ — Uniform random sampling over vocabulary

All characters become equally likely — pure noise. The model's learned structure is completely washed out. You get random character soup.

Temperature is still used in essentially the same way in modern LLMs (ChatGPT, Claude, Gemini). The math is unchanged — just divide logits by T before softmax. The practical insight from 2015 still holds: low T for factual/code tasks, higher T for creative writing.

7. Why This Was Surprising in 2015

To understand why this caused such a stir, consider the prevailing assumptions in 2015:

Character models were considered toy models — the consensus was that useful sequence models needed word-level tokens at minimum. Character models couldn't possibly learn long-range structure.
Linguistic structure required explicit encoding — most NLP systems used hand-crafted features: POS tags, parse trees, named entity labels. The idea that a model could discover quotation tracking from raw text was heretical.
Scale skepticism — the pre-deep-learning intuition was that simple optimization (SGD on a fixed-width RNN) couldn't possibly learn hierarchical structure. You'd need specially designed architectures for each type of structure.
The training signal seemed too weak — predicting the next character is a very weak, local supervision signal. It seemed implausible that this could encode document-level knowledge like "I am currently inside an if-statement."

The blog post's title was deliberate: Karpathy was echoing Wigner's "unreasonable effectiveness of mathematics" and Norvig's "unreasonable effectiveness of data" — the pattern that simple, general principles keep working at scales and domains where you'd expect them to fail.

8. Path to GPT

Karpathy's 2015 blog post established a template that GPT-1 (2018), GPT-2 (2019), and GPT-3 (2020) would follow almost exactly:

Concept	Karpathy 2015 (RNN)	GPT (2018–2020)
Prediction objective	Next character	Next token (BPE)
Architecture	2-layer LSTM, 512 hidden	Transformer decoder, 96 layers (GPT-3)
Training data	~1–430 MB raw text	~45 TB text (GPT-3)
Supervision	Self-supervised only	Self-supervised + RLHF
Emergent behavior	Quotation/indent tracking	Reasoning, math, code, translation
Sampling	Temperature T	Temperature T + top-p/top-k

The core intellectual leap — that predicting the next token with a large enough model on enough data will spontaneously learn everything about the world embedded in that text — was already visible in Karpathy's 2015 Shakespeare experiment. GPT-3 is, in a direct sense, the same idea at 10,000× scale.

9. Additional Resources

The Unreasonable Effectiveness of Recurrent Neural NetworksOriginal blog post — Andrej Karpathy, 2015 char-rnn (GitHub)Original Lua/Torch implementation by Karpathy nanoGPT (GitHub)Karpathy's minimal GPT implementation — the direct successor to char-rnn Karpathy — Building makemore (YouTube)Video lecture building a character-level language model from scratch in PyTorch GPT-2 deep dive (PaperTrace)The direct descendant of char-rnn — same idea, Transformer architecture, 1.5B params

← Back to all papers