PaperTrace — Interactive ML Paper Deep-Dives

TL;DR

DPR replaces sparse TF-IDF/BM25 retrieval with two separate BERT encoders — one for questions, one for passages — trained so that matching question-passage pairs have high dot-product similarity in 768-dim space. Using in-batch negatives and hard BM25 negatives, DPR achieves 79.4% top-20 accuracy on NaturalQuestions vs. BM25's 59.1%, and scales to 21M Wikipedia passages via a FAISS index.

Problem: BM25 matches keywords, misses semantic meaning — "born in Ulm" ≠ "birthplace"

Motivates

Insight: Learn dense vector representations where meaning, not words, determines similarity

Core insight

Dual encoder: E_Q(question) and E_P(passage) — both BERT [CLS] vectors

Question encoder: BERT_Q → 768-dim vector

Passage encoder: BERT_P → 768-dim vector

Architecture

Similarity: sim(q, p) = E_Q(q)ᵀ · E_P(p) — dot product in 768-dim space

Scoring

Training: NLL loss with in-batch negatives + BM25 hard negatives

Optimization

FAISS index: Pre-encode all 21M passages, retrieve top-k in milliseconds

Deployment

Result: NaturalQuestions top-20: DPR 79.4% vs BM25 59.1%

Semantic similarity via dot product

Hard negatives for robust training

FAISS for billion-scale retrieval

1. Sparse vs. Dense Retrieval

Open-domain question answering first retrieves relevant passages from a large corpus (e.g., Wikipedia), then feeds them to a reader model that extracts the answer. The retrieval step is the bottleneck.

TF-IDF and BM25 represent both query and document as sparse bag-of-words vectors, scoring by exact term overlap weighted by frequency and inverse document frequency. This works well when the question and answer passage share vocabulary, but fails at semantic mismatches:

The vocabulary mismatch problem

Query: "Where was Einstein born?"

Answer passage: "Ulm is Einstein's birthplace in the Kingdom of Württemberg."

BM25 sees: "born" ∉ passage, "birthplace" ∉ query — low overlap score despite being the correct answer.

Dense retrieval solves this by learning continuous vector representations where semantically similar texts are close in embedding space — regardless of surface-level word choice.

Property	BM25 (Sparse)	DPR (Dense)
Representation	Sparse TF-IDF vector	Dense 768-dim BERT vector
Similarity	Exact term overlap	Dot product in embedding space
Training	Unsupervised (no learning)	Supervised with QA pairs
Semantic understanding	None (vocabulary mismatch fails)	Yes — synonyms, paraphrases
NQ top-20 accuracy	59.1%	79.4%

2. DPR Architecture: The Dual Encoder

DPR uses two independent BERT-base encoders: one for questions (E_Q) and one for passages (E_P). Each encoder maps its input to a 768-dimensional vector by taking the [CLS] token representation from the final layer.

\text{sim}(q, p) = E_Q(q)^\top \cdot E_P(p)

Where E_Q(q) and E_P(p) are 768-dimensional [CLS] representations from their respective BERT encoders. Both encoders are initialized from the same pre-trained BERT-base checkpoint but have separate weights after fine-tuning.

The key architectural choice — two separate encoders rather than one cross-encoder — is essential for scalability. A cross-encoder would concatenate the question and passage and run joint attention over both, making it impossible to pre-compute passage representations. With dual encoders, all passage embeddings can be computed offline once and stored in a FAISS index.

Cross-encoder (not used)

BERT([CLS] q [SEP] p [SEP]) → score. Powerful but must run on every (q, p) pair at query time — O(N) BERT calls for N passages. Infeasible at scale.

Dual encoder (DPR)

E_Q(q)ᵀ · E_P(p). Passage embeddings pre-computed offline. Query time: 1 BERT call + FAISS lookup. Scales to 21M passages in milliseconds.

3. Training with Hard Negatives

The training objective is to make the positive passage (one containing the answer) score higher than all negative passages for a given question. The loss is negative log-likelihood over a softmax:

L(q_i, p_i^+, p_{i,1}^-, \ldots, p_{i,n}^-) = -\log \frac{e^{\text{sim}(q_i, p_i^+)}}{e^{\text{sim}(q_i, p_i^+)} + \sum_{j=1}^{n} e^{\text{sim}(q_i, p_{i,j}^-)}}

q_i

The i-th training question

p_i^+

The positive passage — one that contains the ground-truth answer span

p_{i,j}^-

The j-th negative passage — does not contain the answer

\text{sim}(q_i, p_i^+)

Dot product score for the correct passage — we want this to be largesoftmax denomSum of exp scores over positive + all negatives — normalizes into a probability

The loss is minimized when the positive passage has much higher score than all negatives, pushing the positive embedding close to the question embedding and negatives far away.

The quality of negatives is critical. DPR uses three types:

1. Random negatives

Randomly sampled passages from Wikipedia. Easy to distinguish — low training signal. Used mainly to fill the batch.

2. In-batch negatives (key efficiency trick)

The positive passages of other questions in the same mini-batch serve as negatives. If batch size B = 32, each question gets B-1 = 31 free negatives. This dramatically reduces cost: passage embeddings already computed for the batch are reused.

3. Hard negatives (BM25 top-k) — most important

BM25 retrieves the top passages for the question that do NOT contain the answer. These are lexically similar but semantically wrong — exactly the hard cases the model needs to learn to reject. Adding hard negatives was the single biggest accuracy improvement in ablations.

4. FAISS for Billion-Scale Retrieval

At inference time, DPR must find the top-k most similar passages from a corpus of 21 million Wikipedia passages. Brute-force dot product over 21M 768-dim vectors would be too slow. DPR uses Facebook AI Similarity Search (FAISS).

\text{top-}k = \underset{p \in \mathcal{P}}{\text{argmax}} \; \text{sim}(q, p) = \underset{p \in \mathcal{P}}{\text{argmax}} \; E_Q(q)^\top E_P(p)

FAISS is a library for Maximum Inner Product Search (MIPS) and approximate nearest neighbor search. The DPR pipeline works as follows:

Offline indexing: Run E_P on all 21M passages. Store the 21M × 768 matrix as a FAISS index. This takes ~8.8 hours on 8 GPUs but only needs to be done once.

Online retrieval: For a new question q, compute E_Q(q) — one BERT forward pass. Then query FAISS for the top-100 nearest passage embeddings by dot product. FAISS uses approximate search (HNSW or IVF) to do this in ~1ms.

Reader: The retrieved top-k passages are passed to a separate BERT-based reader model that extracts the answer span.

5. Worked Example

Let's trace through DPR end-to-end for the Einstein birthplace question:

Step 1 — Encode the question

"Where was Einstein born?" → E_Q → [0.12, -0.34, 0.87, ..., 0.03] ∈ ℝ⁷⁶⁸

768-dim dense vector representing the question's meaning

Step 2 — Retrieve top-k passages via FAISS

Compute dot product with all 21M pre-encoded passage vectors. Return top 20.

Top passage: "Ulm is Einstein's birthplace, in the Kingdom of Württemberg." (score: 0.94)

Step 3 — Reader extracts answer

A separate reader model (e.g., BERT fine-tuned on SQuAD) extracts the answer span from the top passage.

Answer: "Ulm"

← Back to Papers