Language Models are Unsupervised Multitask Learners (GPT-2)

Radford et al. Β· OpenAI 2019 Β· OpenAI Blog

TL;DR

GPT-2 shows that a large language model trained purely on next-token prediction can perform many NLP tasks zero-shot β€” without any fine-tuning. The key insight: natural language tasks can be framed as conditional text generation. At 1.5B parameters, GPT-2 was the first model OpenAI refused to release fully due to misuse concerns.

β—†GPT-2: Zero-Shot Multitask Learning Pipeline
BERT pre-trains then fine-tunes β€” requires labeled data per task
Fine-tuning pipeline needs human annotations for every new task
Key insight
Insight: Tasks = text conditioning. Translation = "text in French: text in English:"
Every NLP task can be expressed as: given this text, generate this text
Architecture
Architecture: Transformer decoder-only, causal attention mask
Each token attends only to previous tokens β€” left-to-right generation
Training
Training: 40GB WebText (Reddit outbound links with 3+ upvotes)
High-quality web text covering diverse topics and styles
Inference
Zero-shot: Give task description in prompt, generate answer
No gradient updates β€” just prompt engineering at inference time
Evaluation
Result: SOTA on 7/8 language modeling benchmarks, zero-shot
Emergent multitask abilities without any task-specific training
No fine-tuning needed
Zero-shot task transfer
Tasks as text generation

1. Background: From Fine-Tuning to Zero-Shot

GPT-1 (2018) demonstrated the pretrain-then-fine-tune paradigm: train a large Transformer LM on unlabeled text, then fine-tune on labeled task data. This worked well but still required labeled data for every task.

GPT-2's central claim: if a language model is trained on a sufficiently large and diverse corpus, it implicitly learns to perform many tasks as a natural consequence of learning to predict text well. The model need not be explicitly told it is doing NLP tasks.

The core hypothesis: "The natural language decathlon: multitask learning as question answering." Any NLP task can be cast as: given some context text, produce the appropriate output text. Translation, summarization, QA β€” all reduce to p(output | input, task_description).

2. Autoregressive Language Modeling

GPT-2 is trained on the standard language model objective: predict each token given all previous tokens. The joint probability of a sequence factorizes as:

Autoregressive factorization
P(x1,…,xn)=∏i=1nP(xi∣x1,…,xiβˆ’1)P(x_1, \ldots, x_n) = \prod_{i=1}^{n} P(x_i \mid x_1, \ldots, x_{i-1})

The training loss is the negative log-likelihood summed over all token positions:

Language model training loss
L=βˆ’βˆ‘i=1nlog⁑P(xi∣x<i)\mathcal{L} = -\sum_{i=1}^{n} \log P(x_i \mid x_{<i})
x1,…,xnx_1, \ldots, x_nThe token sequence β€” each x_i is a token ID from the 50,257-token BPE vocabularyP(xi∣x<i)P(x_i \mid x_{<i})The probability the model assigns to token x_i given all previous tokens x_1…x_{i-1}. This is computed by a softmax over the full vocabulary.∏\prodThe chain rule of probability β€” the joint probability of the whole sequence equals the product of each conditional probability.L\mathcal{L}Training loss β€” minimizing this maximizes the likelihood of the training data. Equivalent to minimizing per-token cross-entropy.

Sequence: "The sky is blue"

Tokens: [The, sky, is, blue]. Training computes four conditional probabilities:

Step 1: P("The" | <BOS>) β€” learn common sentence starters
Step 2: P("sky" | "The") β€” "sky" commonly follows "The"
Step 3: P("is" | "The sky") β€” "is" follows "The sky" (sky is...)
Step 4: P("blue" | "The sky is") β€” "blue" is a likely color for sky

The model is trained on billions of such sequences. After training, it has implicitly learned grammar, facts, reasoning patterns, and task formats β€” all from predicting the next token.

3. Zero-Shot Task Transfer

Since GPT-2 is trained to predict text, and task instructions are text, the model can perform tasks simply by conditioning on an appropriate prompt β€” no gradient updates required.

Translation (English β†’ French)
Prompt: "Translate English to French: sea otter => loutre de mer; cheese =>"
GPT-2 generates: "fromage"
Question Answering
Prompt: "Q: Who was president of the United States in 1955? A:"
GPT-2 generates: "Dwight D. Eisenhower"
Summarization
Prompt: "[article text] TL;DR:"
GPT-2 generates: [a plausible summary of the article]

The key is that WebText contains many documents with these formats β€” news articles, Q&A pages, translations β€” so the model has seen these patterns and can continue them.

4. Architecture: GPT-2 vs GPT-1

GPT-2 uses the same Transformer decoder-only architecture as GPT-1, but with several modifications:

Pre-norm (Layer norm moved)Layer normalization is moved to the input of each sub-block (pre-norm), rather than after (post-norm). This stabilizes training at large scale.Extra layer normAn additional layer normalization is added after the final self-attention block.Modified weight initWeights of residual layers are scaled by 1/√N at initialization, where N is the number of residual layers. Prevents gradient explosion in deep models.1NWinit\frac{1}{\sqrt{N}} W_{\text{init}}The residual path weight scaling β€” N = total number of residual layers in the modelLarger vocabulary50,257 BPE tokens (vs 40,478 in GPT-1). Handles out-of-vocabulary words more gracefully.Longer context1,024 token context window (vs 512 in GPT-1). Allows reasoning over longer passages.

Four model sizes were trained, scaling from 117M to 1.5B parameters:

ModelLayersHidden DimAttn HeadsParameters
GPT-2 Small1276812117M
GPT-2 Medium24102416345M
GPT-2 Large36128020762M
GPT-2 XL481600251.5B

5. Training Data: WebText

GPT-2 is trained on WebText: a new dataset of 40GB of text scraped from Reddit. Specifically, they scraped all outbound links from Reddit that received at least 3 upvotes β€” a heuristic for human-judged quality and interest.

  • 45M links scraped, filtered to 8M documents after de-duplication and quality filtering
  • Wikipedia was explicitly excluded (it appears in many test sets β€” including it would inflate performance)
  • Diverse domains: news, fiction, Q&A, technical writing, forums β€” crucial for zero-shot generalization

6. Results

BenchmarkPrevious SOTAGPT-2 (zero-shot)
PTB perplexity35.835.76
WikiText-239.1429.41
CBT-CN (accuracy)85.7%93.3%
CBT-NE (accuracy)82.3%89.1%

The remarkable finding: GPT-2 achieves state-of-the-art on 7 out of 8 language modeling benchmarks β€” zero-shot, with no fine-tuning. This suggests that the model has implicitly learned the structure of language and tasks from raw text alone. The authors note GPT-2 is "still underfitting" on WebText, suggesting further scaling would help.

7. Limitations

  • Inconsistent zero-shot: Zero-shot performance is strong on some tasks (language modeling, cloze) but poor on others (reading comprehension, summarization). Performance is highly sensitive to prompt wording.
  • No alignment: No RLHF or instruction tuning β€” outputs can be harmful, biased, or repetitive. This was a key motivation for InstructGPT and later alignment research.
  • Decoder-only limitations: No encoder β€” suboptimal for classification and understanding tasks where BERT excels. Cannot attend to future context.
  • Limited context: 1,024 tokens β€” insufficient for long documents. GPT-3 extended this to 2,048; modern models use 128K+.
  • Release withheld: OpenAI initially released only the 117M model due to misuse concerns about synthetic disinformation. The full 1.5B model was released 9 months later (Nov 2019) after limited misuse was observed.

8. Connections to Other Work

Attention Is All You Need

GPT-2 uses the Transformer decoder architecture defined in this paper, with causal (masked) self-attention to enforce left-to-right generation order.

PPO

InstructGPT (2022) fine-tunes GPT-3 (the successor to GPT-2) using PPO with human feedback β€” transforming the raw language model into a helpful assistant. GPT-2's decoder-only architecture is the backbone.

DPO

DPO is an alternative to PPO for aligning GPT-style autoregressive models. The same decoder-only architecture that GPT-2 pioneered is what DPO fine-tunes.

GRPO

DeepSeek-R1 uses GRPO on a GPT-style model (decoder-only, autoregressive) to achieve strong reasoning. The zero-shot scaling hypothesis of GPT-2 is vindicated: scale the model, scale the data, add alignment.

9. Additional Resources