TL;DR
BERT pre-trains a deep bidirectional Transformer by jointly conditioning on left and right context. It uses two self-supervised tasks: Masked Language Modeling (predict masked tokens) and Next Sentence Prediction (classify sentence pairs). Fine-tuning BERT on 11 NLP tasks set new state-of-the-art across the board in 2018.
1. Background: The Bidirectionality Problem
Before BERT, the two dominant approaches to contextual language representations had a fundamental limitation:
- GPT: Autoregressive left-to-right Transformer. At position i, it can only attend to positions 1β¦i-1. Rich context β but only half of it.
- ELMo: Runs a forward LSTM and a backward LSTM separately, then concatenates. Bidirectional in principle, but shallow β the two directions never interact within a layer.
BERT's solution: jointly condition on ALL tokens at every layer using a Masked Language Model (MLM) objective, enabling every token to attend to every other token in both directions simultaneously.
2. Core Method 1: Masked Language Modeling (MLM)
Randomly select 15% of tokens in each sequence. For each selected token, apply the following strategy:
| Case | Probability | Action | Why |
|---|---|---|---|
| [MASK] | 80% | Replace with [MASK] token | Main learning signal |
| random | 10% | Replace with random token | Forces model to maintain robust token representations |
| unchanged | 10% | Keep original token | Reduces train/test mismatch for [MASK] |
The training loss is standard cross-entropy, computed only at the masked positions:
Original sentence: "The cat sat on the mat"
Tokens selected for masking (15%): position 2 ("cat"), position 6 ("mat")
3. Core Method 2: Next Sentence Prediction (NSP)
Many downstream tasks (QA, natural language inference) require understanding relationships between sentence pairs. NSP pre-trains this capability directly.
Given a sentence pair (A, B), 50% of the time B is the actual next sentence (IsNext), 50% of the time B is a random sentence from the corpus (NotNext). The model sees:
The final hidden state of the [CLS] token (a special classification token prepended to every input) is fed into a linear layer for binary classification: IsNext or NotNext.
All three embeddings are simply summed element-wise before being fed into the first Transformer layer.
4. Architecture Details
BERT uses a standard Transformer encoder (no decoder). Two model sizes were released:
| Model | Layers (L) | Hidden (H) | Attention Heads (A) | Parameters |
|---|---|---|---|---|
| BERT-Base | 12 | 768 | 12 | 110M |
| BERT-Large | 24 | 1024 | 16 | 340M |
Pre-training corpus: BooksCorpus (800M words) + English Wikipedia (2,500M words). Pre-training took 4 days on 16/64 Cloud TPUs for BERT-Base/Large.
5. Results
| Task | Previous SOTA | BERT-Base | BERT-Large |
|---|---|---|---|
| GLUE | 72.8 | 78.3 | 80.4 |
| SQuAD v1.1 EM | 84.1 | 84.1 | 85.1 |
| SQuAD v1.1 F1 | 90.9 | 90.9 | 91.8 |
| MNLI | 86.7 | 84.6 | 86.7 |
Impact: BERT achieved state-of-the-art on all 11 NLP tasks it was evaluated on. GLUE score improved from 72.8 to 80.4 (+7.6 points). It took only 1 GPU-hour to fine-tune BERT-Base on most tasks.
6. Limitations
- Training inefficiency: MLM only provides signal on 15% of tokens per forward pass β much lower than standard LM objectives that predict every token.
- NSP task limitations: NSP was later shown to be less helpful than originally claimed β RoBERTa removes it and achieves better performance.
- Encoder-only: BERT's architecture is not directly suited for text generation tasks β it has no autoregressive decoder.
- [MASK] token mismatch: [MASK] tokens appear during pre-training but never during fine-tuning β creating a distribution shift. The 10% unchanged + 10% random strategy partially mitigates this.
7. Connections to Other Work
BERT uses the Transformer encoder from this paper. Every self-attention layer in BERT is the scaled dot-product attention defined here.
DPO builds on RLHF which fine-tunes pretrained models like BERT and GPT. The "pretrain then align" paradigm that DPO operates in was pioneered by BERT's "pretrain then fine-tune" framework.
The decoder-only alternative to BERT. Same Transformer backbone β but uses causal (left-to-right) attention instead of bidirectional. Excels at generation where BERT excels at understanding.