PaperTrace — Interactive ML Paper Deep-Dives

TL;DR

BERT pre-trains a deep bidirectional Transformer by jointly conditioning on left and right context. It uses two self-supervised tasks: Masked Language Modeling (predict masked tokens) and Next Sentence Prediction (classify sentence pairs). Fine-tuning BERT on 11 NLP tasks set new state-of-the-art across the board in 2018.

Problem: GPT is left-to-right only; ELMo is shallow bidirectional

Prior models cannot see full context simultaneously

Key insight

Insight: Mask tokens and predict them — forces bidirectional context

Model must use both left and right context to fill in masked positions

Task 1

Pre-training Task 1: MLM — mask 15% tokens, predict original

80/10/10 masking strategy

[MASK] token added

80/10/10 masking strategy

Task 2

Pre-training Task 2: NSP — predict if sentence B follows sentence A

Binary classification on sentence pairs

[CLS] token

[SEP] separator

Model

Architecture: 12 Transformer encoder layers, 110M params (BERT-Base)

Stacked self-attention + feed-forward blocks

Downstream

Fine-tuning: Add 1 output layer, fine-tune ALL parameters

Task-specific head on top of [CLS] or token representations

Evaluation

Results: GLUE +7.7%, SQuAD F1 +1.5%, 11/11 tasks SOTA

State-of-the-art across the board in 2018

True bidirectionality

No task-specific architecture

Pretrain once, fine-tune many

1. Background: The Bidirectionality Problem

Before BERT, the two dominant approaches to contextual language representations had a fundamental limitation:

GPT: Autoregressive left-to-right Transformer. At position i, it can only attend to positions 1…i-1. Rich context — but only half of it.
ELMo: Runs a forward LSTM and a backward LSTM separately, then concatenates. Bidirectional in principle, but shallow — the two directions never interact within a layer.

BERT's solution: jointly condition on ALL tokens at every layer using a Masked Language Model (MLM) objective, enabling every token to attend to every other token in both directions simultaneously.

2. Core Method 1: Masked Language Modeling (MLM)

Randomly select 15% of tokens in each sequence. For each selected token, apply the following strategy:

Case	Probability	Action	Why
[MASK]	80%	Replace with [MASK] token	Main learning signal
random	10%	Replace with random token	Forces model to maintain robust token representations
unchanged	10%	Keep original token	Reduces train/test mismatch for [MASK]

The training loss is standard cross-entropy, computed only at the masked positions:

\mathcal{L}_{\text{MLM}} = -\sum_{i \in \mathcal{T}_{\text{mask}}} \log P(x_i \mid \tilde{x})

\mathcal{T}_{\text{mask}}

Set of token positions selected for masking (15% of all positions)

x_i

The original token at position i — the label the model must predict

\tilde{x}

The corrupted input sequence (with [MASK] tokens, random replacements, or unchanged tokens applied)

P(x_i \mid \tilde{x})

Probability that the model assigns to the correct original token at position i, given the full corrupted context

Original sentence: "The cat sat on the mat"

Tokens selected for masking (15%): position 2 ("cat"), position 6 ("mat")

"cat" (pos 2) → 80% rule → [MASK]

Model input at pos 2: [MASK]

Model target at pos 2: "cat"

"mat" (pos 6) → 10% rule → random token "dog"

Model input at pos 6: "dog"

Model target at pos 6: "mat"

Final model input: "The [MASK] sat on the dog"

Predictions required: Position 2 → "cat", Position 6 → "mat" (loss only at these two positions)

To predict "cat", the model uses context: "The __ sat on the dog" — it must attend both left ("The") and right ("sat on the dog") context simultaneously.

3. Core Method 2: Next Sentence Prediction (NSP)

Many downstream tasks (QA, natural language inference) require understanding relationships between sentence pairs. NSP pre-trains this capability directly.

Given a sentence pair (A, B), 50% of the time B is the actual next sentence (IsNext), 50% of the time B is a random sentence from the corpus (NotNext). The model sees:

[CLS] sentence A tokens [SEP] sentence B tokens [SEP]

The final hidden state of the [CLS] token (a special classification token prepended to every input) is fed into a linear layer for binary classification: IsNext or NotNext.

Token EmbeddingsStandard word embeddings from a 30,000-token WordPiece vocabularySegment EmbeddingsE_A for all tokens in sentence A, E_B for all tokens in sentence B — tells the model which sentence each token belongs toPositional EmbeddingsLearned embeddings for positions 0…511 — encodes position within the sequence

\text{Input} = \text{TokenEmb}(x_i) + \text{SegmentEmb}(s_i) + \text{PositionEmb}(i)

All three embeddings are simply summed element-wise before being fed into the first Transformer layer.

4. Architecture Details

BERT uses a standard Transformer encoder (no decoder). Two model sizes were released:

Model	Layers (L)	Hidden (H)	Attention Heads (A)	Parameters
BERT-Base	12	768	12	110M
BERT-Large	24	1024	16	340M

Pre-training corpus: BooksCorpus (800M words) + English Wikipedia (2,500M words). Pre-training took 4 days on 16/64 Cloud TPUs for BERT-Base/Large.

5. Results

Task	Previous SOTA	BERT-Base	BERT-Large
GLUE	72.8	78.3	80.4
SQuAD v1.1 EM	84.1	84.1	85.1
SQuAD v1.1 F1	90.9	90.9	91.8
MNLI	86.7	84.6	86.7

Impact: BERT achieved state-of-the-art on all 11 NLP tasks it was evaluated on. GLUE score improved from 72.8 to 80.4 (+7.6 points). It took only 1 GPU-hour to fine-tune BERT-Base on most tasks.

6. Limitations

Training inefficiency: MLM only provides signal on 15% of tokens per forward pass — much lower than standard LM objectives that predict every token.
NSP task limitations: NSP was later shown to be less helpful than originally claimed — RoBERTa removes it and achieves better performance.
Encoder-only: BERT's architecture is not directly suited for text generation tasks — it has no autoregressive decoder.
[MASK] token mismatch: [MASK] tokens appear during pre-training but never during fine-tuning — creating a distribution shift. The 10% unchanged + 10% random strategy partially mitigates this.

7. Connections to Other Work

Attention Is All You Need

BERT uses the Transformer encoder from this paper. Every self-attention layer in BERT is the scaled dot-product attention defined here.

DPO

DPO builds on RLHF which fine-tunes pretrained models like BERT and GPT. The "pretrain then align" paradigm that DPO operates in was pioneered by BERT's "pretrain then fine-tune" framework.

GPT-2 (coming)

The decoder-only alternative to BERT. Same Transformer backbone — but uses causal (left-to-right) attention instead of bidirectional. Excels at generation where BERT excels at understanding.

8. Additional Resources

BERT (arXiv 1810.04805)Original paper RoBERTa (Liu et al. 2019)Robustly optimized BERT — removes NSP, longer training, shows NSP is not needed DistilBERT (Sanh et al. 2019)40% smaller, 60% faster, retains 97% of BERT performance via knowledge distillation