PaperTrace — Interactive ML Paper Deep-Dives

TL;DR

RAG combines a parametric memory (BART seq2seq generator) with a non-parametric memory (dense retrieval over 21M Wikipedia passages). Given a query, a DPR retriever fetches the top-k relevant passages; BART then generates the answer conditioned on the query and each passage. The model marginalizes over retrieved documents during training and achieves state-of-the-art on open-domain QA benchmarks without any task-specific fine-tuning of the retriever index.

1. The Hallucination Problem

Large language models store knowledge in their parameters during pretraining. This creates two deep problems: they can confidently generate plausible-sounding but factually wrong text (hallucination), and their knowledge is frozen at training time — updating facts requires expensive full retraining.

RAG's core insight: separate what the model knows how to do (language, reasoning, generation) from what it needs to look up (facts, entities, recent events). Keep facts in a searchable external store that can be swapped or updated without retraining.

2. RAG Architecture

RAG is a two-component system: a retriever that fetches relevant passages from a fixed corpus, and a generator that produces the answer conditioned on the query and retrieved passages. The retriever and generator are trained end-to-end by marginalizing over retrieved documents.

RAG Pipeline

Input query x→DPR Retriever: top-k passages z₁…zₖ→BART Generator: output y

The joint probability is defined by marginalizing over the latent document variable z. During training, the retriever index is fixed (not updated by gradient), while both the DPR query encoder and the BART parameters receive gradients through the generator loss.

3. Retriever: Dense Passage Retrieval (DPR)

The retriever uses DPR — a bi-encoder architecture that maps both queries and passages to dense vectors in the same embedding space. Relevance is measured by dot-product similarity, enabling efficient Maximum Inner Product Search (MIPS) over a pre-built index of 21 million Wikipedia passages.

DPR Similarity Score

\ ext{sim}(q, p) = E_Q(q)^\ op \cdot E_P(p)

E_Q: BERT-based query encoder · E_P: BERT-based passage encoder · Both produce 768-dim vectors

All 21M passage embeddings are pre-computed and stored in a FAISS index. At inference, only the query is encoded on the fly; MIPS retrieves the top-k passages in sub-linear time.

The retriever probability over document z given query x is defined as:

p_\eta(z \mid x) \propto \exp\!\left(E_Q(x)^\top E_P(z)\right)

4. Generator: BART

The generator is BART-large — a denoising autoencoder pretrained as an encoder-decoder Transformer. BART's encoder receives a concatenation of the query x and retrieved passage z; its decoder auto-regressively generates the output sequence y token by token.

Generator input format

[BOS] question: {x} context: {z} [EOS] → {y}

Query x and passage z are concatenated as the encoder input. The decoder generates y auto-regressively.

The generator defines the conditional probability of each token given all previous tokens, the query, and the retrieved passage. This factorizes as:

p_\ heta(y_i \mid x, z, y_{1:i-1})

5. RAG-Sequence vs RAG-Token

RAG proposes two ways to marginalize over the k retrieved documents. They differ in where the marginalization happens: once for the whole output sequence, or at every output token.

RAG-Sequence: one document per generation

Treat the retrieved document as fixed for the entire output sequence. Marginalize over documents by summing the full sequence probability for each document:

p_{\ ext{RAG-Seq}}(y \mid x) \approx \sum_{z \,\in\, \ ext{top-}k(p_\eta(\cdot|x))} p_\eta(z \mid x) \prod_{i} p_\ heta(y_i \mid x, z, y_{1:i-1})

Intuition: the model picks a document, fully generates a candidate answer conditioned on it, then averages across documents. Each candidate answer is coherent with one source.

RAG-Token: different document per token

Allow each output token to be generated from a different document. Marginalize at every decoding step:

p_{\ ext{RAG-Tok}}(y \mid x) \approx \prod_{i} \sum_{z \,\in\, \ ext{top-}k(p_\eta(\cdot|x))} p_\eta(z \mid x) \cdot p_\ heta(y_i \mid x, z, y_{1:i-1})

Intuition: at each token, the model effectively aggregates evidence from all k documents. This is more flexible — it can synthesize facts from multiple passages within a single answer.

6. Worked Example: How RAG Answers a Question

Suppose the query is: "Who invented the transformer architecture?"

Encode query: E_Q encodes the question into a 768-dim vector.
MIPS retrieval: FAISS finds the top-k=5 Wikipedia passages with highest dot-product similarity. Passages about Vaswani et al., the 'Attention Is All You Need' paper, and the Transformer's reception are retrieved.
Compute document weights: The retriever assigns a probability p_η(z|x) to each passage proportional to exp(E_Q(x)ᵀ E_P(z)).
BART generates: For each passage z, BART encodes [question; passage] and decodes a candidate answer. In RAG-Token, at each decoding step the next token is sampled from a mixture over all passages weighted by p_η(z|x).
Output: "The transformer architecture was introduced by Vaswani et al. in the 2017 paper 'Attention Is All You Need.'" — grounded in retrieved evidence, not memorized.

Key insight: The document probabilities p_η(z|x) act as soft attention weights over the corpus — the model attends to the entire Wikipedia index, not just its parameters.

7. Training: Joint End-to-End Learning

The full marginal probability is optimized directly with stochastic gradient descent. The loss is the negative log-likelihood of the correct output y* given query x:

\mathcal{L} = -\sum_{(x, y^*)} \log p(y^* \mid x)

Gradients flow through both the BART generator (p_θ) and the DPR query encoder E_Q (inside p_η). The passage encoder E_P and the FAISS index are frozen — updating them would require reindexing 21M passages after every gradient step.

8. Results

RAG achieves state-of-the-art on multiple open-domain QA benchmarks at publication time, outperforming purely parametric models of much larger size.

Benchmark	Metric	Prior SOTA	RAG
NaturalQuestions	EM	44.5	44.5 → 44.5*
TriviaQA	EM	67.8	68.0
WebQuestions	EM	41.5	45.5
CuratedTrec	EM	50.6	72.1

EM = Exact Match. RAG-Token reported for NQ and TriviaQA; RAG-Sequence for WebQ and CuratedTrec.

9. Why RAG Matters Today

RAG has become one of the most practically impactful ideas from NLP research. The retrieval-augmented paradigm is now the default architecture for production LLM applications that require factual accuracy and up-to-date knowledge.

Knowledge stays fresh

Updating the retrieval corpus (e.g., adding new documents) immediately updates the model's effective knowledge without any gradient update.

Hallucination reduction

Grounding generation in retrieved text reduces confabulation — the model's output is anchored to actual documents.

Interpretability

Unlike purely parametric models, RAG's retrieved passages give a direct audit trail — you can see exactly what the model read.

Domain specialization

Swapping the corpus enables instant domain adaptation — medical RAG, legal RAG, code RAG — using the same trained model.

Modern RAG systems extend the original paper in many ways: better retrievers (e.g., ColBERT, E5, BGE), hybrid search (dense + BM25 sparse), re-ranking, multi-hop retrieval, and long-context generation. But the core idea — separate retrieval from generation, marginalize over retrieved documents — comes directly from this 2020 paper.

← Back to all papers