PaperTrace — Interactive ML Paper Deep-Dives

TL;DR

GPT-2 shows that a large language model trained purely on next-token prediction can perform many NLP tasks zero-shot — without any fine-tuning. The key insight: natural language tasks can be framed as conditional text generation. At 1.5B parameters, GPT-2 was the first model OpenAI refused to release fully due to misuse concerns.

BERT pre-trains then fine-tunes — requires labeled data per task

Fine-tuning pipeline needs human annotations for every new task

Key insight

Insight: Tasks = text conditioning. Translation = "text in French: text in English:"

Every NLP task can be expressed as: given this text, generate this text

Architecture

Architecture: Transformer decoder-only, causal attention mask

Each token attends only to previous tokens — left-to-right generation

Training

Training: 40GB WebText (Reddit outbound links with 3+ upvotes)

High-quality web text covering diverse topics and styles

Inference

Zero-shot: Give task description in prompt, generate answer

No gradient updates — just prompt engineering at inference time

Evaluation

Result: SOTA on 7/8 language modeling benchmarks, zero-shot

Emergent multitask abilities without any task-specific training

No fine-tuning needed

Zero-shot task transfer

Tasks as text generation

1. Background: From Fine-Tuning to Zero-Shot

GPT-1 (2018) demonstrated the pretrain-then-fine-tune paradigm: train a large Transformer LM on unlabeled text, then fine-tune on labeled task data. This worked well but still required labeled data for every task.

GPT-2's central claim: if a language model is trained on a sufficiently large and diverse corpus, it implicitly learns to perform many tasks as a natural consequence of learning to predict text well. The model need not be explicitly told it is doing NLP tasks.

The core hypothesis: "The natural language decathlon: multitask learning as question answering." Any NLP task can be cast as: given some context text, produce the appropriate output text. Translation, summarization, QA — all reduce to p(output | input, task_description).

2. Autoregressive Language Modeling

GPT-2 is trained on the standard language model objective: predict each token given all previous tokens. The joint probability of a sequence factorizes as:

P(x_1, \ldots, x_n) = \prod_{i=1}^{n} P(x_i \mid x_1, \ldots, x_{i-1})

The training loss is the negative log-likelihood summed over all token positions:

\mathcal{L} = -\sum_{i=1}^{n} \log P(x_i \mid x_{<i})

x_1, \ldots, x_n

The token sequence — each x_i is a token ID from the 50,257-token BPE vocabulary

P(x_i \mid x_{<i})

The probability the model assigns to token x_i given all previous tokens x_1…x_{i-1}. This is computed by a softmax over the full vocabulary.

\prod

The chain rule of probability — the joint probability of the whole sequence equals the product of each conditional probability.

\mathcal{L}

Training loss — minimizing this maximizes the likelihood of the training data. Equivalent to minimizing per-token cross-entropy.

Sequence: "The sky is blue"

Tokens: [The, sky, is, blue]. Training computes four conditional probabilities:

Step 1: P("The" | <BOS>) — learn common sentence starters

Step 2: P("sky" | "The") — "sky" commonly follows "The"

Step 3: P("is" | "The sky") — "is" follows "The sky" (sky is...)

Step 4: P("blue" | "The sky is") — "blue" is a likely color for sky

The model is trained on billions of such sequences. After training, it has implicitly learned grammar, facts, reasoning patterns, and task formats — all from predicting the next token.

3. Zero-Shot Task Transfer

Since GPT-2 is trained to predict text, and task instructions are text, the model can perform tasks simply by conditioning on an appropriate prompt — no gradient updates required.

Translation (English → French)

Prompt: "Translate English to French: sea otter => loutre de mer; cheese =>"

GPT-2 generates: "fromage"

Question Answering

Prompt: "Q: Who was president of the United States in 1955? A:"

GPT-2 generates: "Dwight D. Eisenhower"

Summarization

Prompt: "[article text] TL;DR:"

GPT-2 generates: [a plausible summary of the article]

The key is that WebText contains many documents with these formats — news articles, Q&A pages, translations — so the model has seen these patterns and can continue them.

4. Architecture: GPT-2 vs GPT-1

GPT-2 uses the same Transformer decoder-only architecture as GPT-1, but with several modifications:

Pre-norm (Layer norm moved)Layer normalization is moved to the input of each sub-block (pre-norm), rather than after (post-norm). This stabilizes training at large scale.Extra layer normAn additional layer normalization is added after the final self-attention block.Modified weight initWeights of residual layers are scaled by 1/√N at initialization, where N is the number of residual layers. Prevents gradient explosion in deep models.

\frac{1}{\sqrt{N}} W_{\text{init}}

The residual path weight scaling — N = total number of residual layers in the modelLarger vocabulary50,257 BPE tokens (vs 40,478 in GPT-1). Handles out-of-vocabulary words more gracefully.Longer context1,024 token context window (vs 512 in GPT-1). Allows reasoning over longer passages.

Four model sizes were trained, scaling from 117M to 1.5B parameters:

Model	Layers	Hidden Dim	Attn Heads	Parameters
GPT-2 Small	12	768	12	117M
GPT-2 Medium	24	1024	16	345M
GPT-2 Large	36	1280	20	762M
GPT-2 XL	48	1600	25	1.5B

5. Training Data: WebText

GPT-2 is trained on WebText: a new dataset of 40GB of text scraped from Reddit. Specifically, they scraped all outbound links from Reddit that received at least 3 upvotes — a heuristic for human-judged quality and interest.

45M links scraped, filtered to 8M documents after de-duplication and quality filtering
Wikipedia was explicitly excluded (it appears in many test sets — including it would inflate performance)
Diverse domains: news, fiction, Q&A, technical writing, forums — crucial for zero-shot generalization

6. Results

Benchmark	Previous SOTA	GPT-2 (zero-shot)
PTB perplexity	35.8	35.76
WikiText-2	39.14	29.41
CBT-CN (accuracy)	85.7%	93.3%
CBT-NE (accuracy)	82.3%	89.1%

The remarkable finding: GPT-2 achieves state-of-the-art on 7 out of 8 language modeling benchmarks — zero-shot, with no fine-tuning. This suggests that the model has implicitly learned the structure of language and tasks from raw text alone. The authors note GPT-2 is "still underfitting" on WebText, suggesting further scaling would help.

7. Limitations

Inconsistent zero-shot: Zero-shot performance is strong on some tasks (language modeling, cloze) but poor on others (reading comprehension, summarization). Performance is highly sensitive to prompt wording.
No alignment: No RLHF or instruction tuning — outputs can be harmful, biased, or repetitive. This was a key motivation for InstructGPT and later alignment research.
Decoder-only limitations: No encoder — suboptimal for classification and understanding tasks where BERT excels. Cannot attend to future context.
Limited context: 1,024 tokens — insufficient for long documents. GPT-3 extended this to 2,048; modern models use 128K+.
Release withheld: OpenAI initially released only the 117M model due to misuse concerns about synthetic disinformation. The full 1.5B model was released 9 months later (Nov 2019) after limited misuse was observed.

8. Connections to Other Work

Attention Is All You Need

GPT-2 uses the Transformer decoder architecture defined in this paper, with causal (masked) self-attention to enforce left-to-right generation order.

PPO

InstructGPT (2022) fine-tunes GPT-3 (the successor to GPT-2) using PPO with human feedback — transforming the raw language model into a helpful assistant. GPT-2's decoder-only architecture is the backbone.

DPO

DPO is an alternative to PPO for aligning GPT-style autoregressive models. The same decoder-only architecture that GPT-2 pioneered is what DPO fine-tunes.

GRPO

DeepSeek-R1 uses GRPO on a GPT-style model (decoder-only, autoregressive) to achieve strong reasoning. The zero-shot scaling hypothesis of GPT-2 is vindicated: scale the model, scale the data, add alignment.

9. Additional Resources

GPT-2 (OpenAI Blog)Original blog post + paper PDF GPT-3 (Brown et al. 2020)175B parameter successor — few-shot learning at scale InstructGPT (Ouyang et al. 2022)Fine-tunes GPT-3 with RLHF — the missing alignment piece GPT-2 lacked