BERT: Pre-training of Deep Bidirectional Transformers

Devlin et al. Β· NAACL 2019 Β· arXiv 1810.04805

TL;DR

BERT pre-trains a deep bidirectional Transformer by jointly conditioning on left and right context. It uses two self-supervised tasks: Masked Language Modeling (predict masked tokens) and Next Sentence Prediction (classify sentence pairs). Fine-tuning BERT on 11 NLP tasks set new state-of-the-art across the board in 2018.

β—†BERT Architecture and Training Pipeline
Problem: GPT is left-to-right only; ELMo is shallow bidirectional
Prior models cannot see full context simultaneously
Key insight
Insight: Mask tokens and predict them β€” forces bidirectional context
Model must use both left and right context to fill in masked positions
Task 1
Pre-training Task 1: MLM β€” mask 15% tokens, predict original
80/10/10 masking strategy
[MASK] token added
80/10/10 masking strategy
Task 2
Pre-training Task 2: NSP β€” predict if sentence B follows sentence A
Binary classification on sentence pairs
[CLS] token
[SEP] separator
Model
Architecture: 12 Transformer encoder layers, 110M params (BERT-Base)
Stacked self-attention + feed-forward blocks
Downstream
Fine-tuning: Add 1 output layer, fine-tune ALL parameters
Task-specific head on top of [CLS] or token representations
Evaluation
Results: GLUE +7.7%, SQuAD F1 +1.5%, 11/11 tasks SOTA
State-of-the-art across the board in 2018
True bidirectionality
No task-specific architecture
Pretrain once, fine-tune many

1. Background: The Bidirectionality Problem

Before BERT, the two dominant approaches to contextual language representations had a fundamental limitation:

  • GPT: Autoregressive left-to-right Transformer. At position i, it can only attend to positions 1…i-1. Rich context β€” but only half of it.
  • ELMo: Runs a forward LSTM and a backward LSTM separately, then concatenates. Bidirectional in principle, but shallow β€” the two directions never interact within a layer.

BERT's solution: jointly condition on ALL tokens at every layer using a Masked Language Model (MLM) objective, enabling every token to attend to every other token in both directions simultaneously.

2. Core Method 1: Masked Language Modeling (MLM)

Randomly select 15% of tokens in each sequence. For each selected token, apply the following strategy:

CaseProbabilityActionWhy
[MASK]80%Replace with [MASK] tokenMain learning signal
random10%Replace with random tokenForces model to maintain robust token representations
unchanged10%Keep original tokenReduces train/test mismatch for [MASK]

The training loss is standard cross-entropy, computed only at the masked positions:

MLM loss (sum over masked positions T_mask)
LMLM=βˆ’βˆ‘i∈Tmasklog⁑P(xi∣x~)\mathcal{L}_{\text{MLM}} = -\sum_{i \in \mathcal{T}_{\text{mask}}} \log P(x_i \mid \tilde{x})
Tmask\mathcal{T}_{\text{mask}}Set of token positions selected for masking (15% of all positions)xix_iThe original token at position i β€” the label the model must predictx~\tilde{x}The corrupted input sequence (with [MASK] tokens, random replacements, or unchanged tokens applied)P(xi∣x~)P(x_i \mid \tilde{x})Probability that the model assigns to the correct original token at position i, given the full corrupted context

Original sentence: "The cat sat on the mat"

Tokens selected for masking (15%): position 2 ("cat"), position 6 ("mat")

"cat" (pos 2) β†’ 80% rule β†’ [MASK]
Model input at pos 2: [MASK]
Model target at pos 2: "cat"
"mat" (pos 6) β†’ 10% rule β†’ random token "dog"
Model input at pos 6: "dog"
Model target at pos 6: "mat"
Final model input: "The [MASK] sat on the dog"
Predictions required: Position 2 β†’ "cat", Position 6 β†’ "mat" (loss only at these two positions)
To predict "cat", the model uses context: "The __ sat on the dog" β€” it must attend both left ("The") and right ("sat on the dog") context simultaneously.

3. Core Method 2: Next Sentence Prediction (NSP)

Many downstream tasks (QA, natural language inference) require understanding relationships between sentence pairs. NSP pre-trains this capability directly.

Given a sentence pair (A, B), 50% of the time B is the actual next sentence (IsNext), 50% of the time B is a random sentence from the corpus (NotNext). The model sees:

[CLS] sentence A tokens [SEP] sentence B tokens [SEP]

The final hidden state of the [CLS] token (a special classification token prepended to every input) is fed into a linear layer for binary classification: IsNext or NotNext.

Token EmbeddingsStandard word embeddings from a 30,000-token WordPiece vocabularySegment EmbeddingsE_A for all tokens in sentence A, E_B for all tokens in sentence B β€” tells the model which sentence each token belongs toPositional EmbeddingsLearned embeddings for positions 0…511 β€” encodes position within the sequence
Input=TokenEmb(xi)+SegmentEmb(si)+PositionEmb(i)\text{Input} = \text{TokenEmb}(x_i) + \text{SegmentEmb}(s_i) + \text{PositionEmb}(i)

All three embeddings are simply summed element-wise before being fed into the first Transformer layer.

4. Architecture Details

BERT uses a standard Transformer encoder (no decoder). Two model sizes were released:

ModelLayers (L)Hidden (H)Attention Heads (A)Parameters
BERT-Base1276812110M
BERT-Large24102416340M

Pre-training corpus: BooksCorpus (800M words) + English Wikipedia (2,500M words). Pre-training took 4 days on 16/64 Cloud TPUs for BERT-Base/Large.

5. Results

TaskPrevious SOTABERT-BaseBERT-Large
GLUE72.878.380.4
SQuAD v1.1 EM84.184.185.1
SQuAD v1.1 F190.990.991.8
MNLI86.784.686.7

Impact: BERT achieved state-of-the-art on all 11 NLP tasks it was evaluated on. GLUE score improved from 72.8 to 80.4 (+7.6 points). It took only 1 GPU-hour to fine-tune BERT-Base on most tasks.

6. Limitations

  • Training inefficiency: MLM only provides signal on 15% of tokens per forward pass β€” much lower than standard LM objectives that predict every token.
  • NSP task limitations: NSP was later shown to be less helpful than originally claimed β€” RoBERTa removes it and achieves better performance.
  • Encoder-only: BERT's architecture is not directly suited for text generation tasks β€” it has no autoregressive decoder.
  • [MASK] token mismatch: [MASK] tokens appear during pre-training but never during fine-tuning β€” creating a distribution shift. The 10% unchanged + 10% random strategy partially mitigates this.

7. Connections to Other Work

Attention Is All You Need

BERT uses the Transformer encoder from this paper. Every self-attention layer in BERT is the scaled dot-product attention defined here.

DPO

DPO builds on RLHF which fine-tunes pretrained models like BERT and GPT. The "pretrain then align" paradigm that DPO operates in was pioneered by BERT's "pretrain then fine-tune" framework.

GPT-2 (coming)

The decoder-only alternative to BERT. Same Transformer backbone β€” but uses causal (left-to-right) attention instead of bidirectional. Excels at generation where BERT excels at understanding.

8. Additional Resources