Retrieval-Augmented Generation (RAG)

Although language models are becoming more capable, providing provenence and updating their knowledge are still problematic. Hence, researchers introduced the retrieval-augmented generation (RAG) approach in the paper “Retrieval-augmented generation for knowledge-intensive NLP tasks” in April-2021, as a means to introduce and update the parameteric knowledge pre-trained language models.

Advantages of RAG:

  • Reduce hallucinations: By leveraging non-parametric knowledge (i.e. a corpus of passages such as those from Wikipedia) in generating an answer to a given question, the RAG model allows for the generation to be more strongly grounded in real factual knowledge (per the corpus), thereby reducing halluciations.
  • Introduce knowledge via retrieval corpus: Using different retrieval corpus allows one to focus the generated answers on different domains, e.g. medical, financial, etc., or to knowledge that typically would not be found in the parametric pre-trained models, e.g. company documents.

Here are the main points of the RAG approach proposed in the paper:

  • Non-parametric memory: One of the two main components is the dense neural retriever, which provides latent Wikipedia documents conditoned on the input. Here, the authors used their previous work, the dense passage retrieval (DPR).
  • Parameteric memory: The other main component is a per-trained seq2seq transformer. Here, the authors used the BART seq2seq model. In RAG, this seq2seq model conditions on the latent documents retrieved by DPR together and the input, to generate the output.
  • Like T5 or BART, the RAG model (i.e. both the retriever and the seq2seq) can be jointly fine-tuned on any seq2seq task.

Models

Retriever: DPR

DPR uses a bi-encoder architecture: \(p_{\text{DPR}}(z|x) \approx \text{exp} (d(z)^{T} q(x))\)

  • where $d(z) = \text{BERT}_{d}(z)$ is a dense document representation by the BERT-base document encoder.
  • where $q(x) = \text{BERT}_q(x)$ is a dense query representation by the BERT-base query encoder.

Generator: BART

The authors used BART-large as the generator component: $p_{\theta}(y_i | x, z, y_{1: i-1})$. To combine the input $x$ with the retrieved document $z$ when generating from BART, the authors simply concatenate them.

RAG-Token Model

In the paper, the authors explored a RAG-sequence model, and a RAG-token model. Here, we will just describe the RAG-token model: \(p_{\text{RAG-Token}}(y|x) \approx \prod\limits_{i}^{N} \sum\limits_{z \in \text{top}-k(p(\cdot | x))} p_{\text{DPR}}(z|x) \text{ } p_{\theta}(y_i | x, z, y_{1:i-1})\)

  • Given an input $x$, the top K documents are retrieved using the DPR retreiver, with a score of $p_{\text{DPR}}$ for each retrieved document $z$.
  • The generator produces a distribution $p_{\theta}$ for the next output token for each document.

Training

Given a corpus of input-output example pairs $(x_i, y_i)$, the authors minimize the log-likelihood $\sum_{j} - \text{log} p(y_j | x_j)$.

  • The authors note that updating the document encoder $\text{BERT}_d$ during training is costly as it requires the document index to be periodically updated. The authors did not find this step to be necessary for strong performance, and thus kept the document encoder (and index) fixed, only fine-tuning the query encoder $\text{BERT}_q$ and the BART generator.

Experiments

Abstractive QA: One of the experiments conducted in the paper is abstractive QA using the MSMARCO NLG dataset, which consists of questions, ten gold passages retrieved from a search engine for each question, and a full sentence answer annotated from the retrieved passages. The RAG authors used only the questions and answers (skipping the supplied passages), thereby treating MSMARCO as an open-domain (they implictly threat Wikipedia as open-domain knowledge) abstractive QA task.

Fact verification: The FEVER dataset requires classifying whether a natural language claim is supported or refuted by Wikipedia. Given a claim, the task requires retrieving evidence from Wikipedia and then reasoning over this evidence to classify the claim as true, false, or unverifiable from Wikipedia alone.

Written on August 27, 2023