TL;DR
DPR replaces sparse TF-IDF/BM25 retrieval with two separate BERT encoders β one for questions, one for passages β trained so that matching question-passage pairs have high dot-product similarity in 768-dim space. Using in-batch negatives and hard BM25 negatives, DPR achieves 79.4% top-20 accuracy on NaturalQuestions vs. BM25's 59.1%, and scales to 21M Wikipedia passages via a FAISS index.
1. Sparse vs. Dense Retrieval
Open-domain question answering first retrieves relevant passages from a large corpus (e.g., Wikipedia), then feeds them to a reader model that extracts the answer. The retrieval step is the bottleneck.
TF-IDF and BM25 represent both query and document as sparse bag-of-words vectors, scoring by exact term overlap weighted by frequency and inverse document frequency. This works well when the question and answer passage share vocabulary, but fails at semantic mismatches:
The vocabulary mismatch problem
Query: "Where was Einstein born?"
Answer passage: "Ulm is Einstein's birthplace in the Kingdom of WΓΌrttemberg."
BM25 sees: "born" β passage, "birthplace" β query β low overlap score despite being the correct answer.
Dense retrieval solves this by learning continuous vector representations where semantically similar texts are close in embedding space β regardless of surface-level word choice.
| Property | BM25 (Sparse) | DPR (Dense) |
|---|---|---|
| Representation | Sparse TF-IDF vector | Dense 768-dim BERT vector |
| Similarity | Exact term overlap | Dot product in embedding space |
| Training | Unsupervised (no learning) | Supervised with QA pairs |
| Semantic understanding | None (vocabulary mismatch fails) | Yes β synonyms, paraphrases |
| NQ top-20 accuracy | 59.1% | 79.4% |
2. DPR Architecture: The Dual Encoder
DPR uses two independent BERT-base encoders: one for questions (E_Q) and one for passages (E_P). Each encoder maps its input to a 768-dimensional vector by taking the [CLS] token representation from the final layer.
Where E_Q(q) and E_P(p) are 768-dimensional [CLS] representations from their respective BERT encoders. Both encoders are initialized from the same pre-trained BERT-base checkpoint but have separate weights after fine-tuning.
The key architectural choice β two separate encoders rather than one cross-encoder β is essential for scalability. A cross-encoder would concatenate the question and passage and run joint attention over both, making it impossible to pre-compute passage representations. With dual encoders, all passage embeddings can be computed offline once and stored in a FAISS index.
Cross-encoder (not used)
BERT([CLS] q [SEP] p [SEP]) β score. Powerful but must run on every (q, p) pair at query time β O(N) BERT calls for N passages. Infeasible at scale.
Dual encoder (DPR)
E_Q(q)α΅ Β· E_P(p). Passage embeddings pre-computed offline. Query time: 1 BERT call + FAISS lookup. Scales to 21M passages in milliseconds.
3. Training with Hard Negatives
The training objective is to make the positive passage (one containing the answer) score higher than all negative passages for a given question. The loss is negative log-likelihood over a softmax:
The loss is minimized when the positive passage has much higher score than all negatives, pushing the positive embedding close to the question embedding and negatives far away.
The quality of negatives is critical. DPR uses three types:
1. Random negatives
Randomly sampled passages from Wikipedia. Easy to distinguish β low training signal. Used mainly to fill the batch.
2. In-batch negatives (key efficiency trick)
The positive passages of other questions in the same mini-batch serve as negatives. If batch size B = 32, each question gets B-1 = 31 free negatives. This dramatically reduces cost: passage embeddings already computed for the batch are reused.
3. Hard negatives (BM25 top-k) β most important
BM25 retrieves the top passages for the question that do NOT contain the answer. These are lexically similar but semantically wrong β exactly the hard cases the model needs to learn to reject. Adding hard negatives was the single biggest accuracy improvement in ablations.
4. FAISS for Billion-Scale Retrieval
At inference time, DPR must find the top-k most similar passages from a corpus of 21 million Wikipedia passages. Brute-force dot product over 21M 768-dim vectors would be too slow. DPR uses Facebook AI Similarity Search (FAISS).
FAISS is a library for Maximum Inner Product Search (MIPS) and approximate nearest neighbor search. The DPR pipeline works as follows:
5. Worked Example
Let's trace through DPR end-to-end for the Einstein birthplace question:
Step 1 β Encode the question
"Where was Einstein born?" β E_Q β [0.12, -0.34, 0.87, ..., 0.03] β ββ·βΆβΈ
768-dim dense vector representing the question's meaning
Step 2 β Retrieve top-k passages via FAISS
Compute dot product with all 21M pre-encoded passage vectors. Return top 20.
Top passage: "Ulm is Einstein's birthplace, in the Kingdom of WΓΌrttemberg." (score: 0.94)
Step 3 β Reader extracts answer
A separate reader model (e.g., BERT fine-tuned on SQuAD) extracts the answer span from the top passage.
Answer: "Ulm"