Dense Passage Retrieval (DPR)

To retrievel relevant passages for answering queries for questions, traditional methods rely on sparse vector space methods based on TF-IDF or BM25. In the paper “Dense passage retrieval for open-domain question answering” published in 2020, researchers show that leveraging the BERT transfomer to encode both the question and passages, then fine-tuning to encoder weights to maximize the dot-product similarity between positive question-passage pairs, result in a “dense passage retrieval” model that significantly out-performs BM25.

DPR

Specifically, the role of a retriever is to take as input a question $q$ and a corpus $C$ of passages, to return a much smaller filter set of passages $C_{F}$, where $|C_{F}| = k \ll |C|$. The paper builds the retriever as follows:

  • Let a training instance be defined as $\langle q_i, p_i^{+}, p_{i,1}^{-}, \ldots, p_{i,n}^{-} \rangle$ which consists of: question $q_i$, one postiive passage $p_i^{+}$, $n$ irrelevant (negative) passages $p_{i,j}^{-}$.
  • The loss function is defined as the negative log likelihood of the positive passage: \(L(q_i, p_i^{+}, p_{i,1}^{-}, \ldots, p_{i,n}^{-}) = - \text{log} \frac{e^{\text{sim}(q_i, p_i^{+})}}{e^{\text{sim}(q_i, p_i^{+})} + \sum\limits_{j=1}^{n} e^{\text{sim}(q_i, p_{i,j}^{-})}}\)
    • $\text{sim}(q, p) = E_{Q}(q)^{T} E_{P}(p)$, where $E_Q$ and $E_P$ are two independent BERT models used to encode each question $q$ and passage $p$ to a $d$ dimensional embedding.
    • In lieu of the above negative log-likelihood based on softmax (NLL), the authors also tried out triplet loss and Euclidean distance (L2), and concluded that NLL performs competively.
  • In-batch negatives: Each training batch consists of a set of $(q_i, p_i^{+})$ examples. For each $q_i$, the positive passages $p_j^{+}$ for other questions in the same batch, will serve as negative training examples.

Experiments

To evaluate the performance of the proposed retrieval model, the authors require two components:

  • Corpus: The authors used the Wikipedia dump from Dec. 20, 2018 as the corpus of passages. Each Wikipedia article was cleaned and split into blocks of 100 words as passages, resulting in 21 million passages. Each passage is also prepended with the title of the Wikipedia article where the passage was drawn from.
  • Evaluation datasets: The authors evaluated on five QA datasets:
    • Natural Questions (NQ): questions were mined from real Google search queries and the answers are spans in Wikipedia articles.
    • TriviaQA: a set of trivia questions with answers scraped from the Web.
    • WebQuestions (WQ): questions selected using Google Suggest API, where the answers are entities in Freebase.
    • CuratedTREC (TREC): questions are from TREC QA tracks as well as various Web sources and is intended for open-domain QA.
    • SQuAD v1.1: a popular dataset for reading comprehension.

Index Building and Run-time Efficiency

The authors profiled their experiments on a server with Intel Xeon CPU E5-2698 v4 @ 2.20GHz and 512GB memory.

Computing dense embeddings for the 21 million passages took 8.8 hours on 8 GPUs. Building the FAISS index on 21 million vectors on a single server took 8.5 hours. In comparison, if one were to take the traditional approach of building an inverted index using Lucene, it would take just 30 minutes.

For passage retrieval, using FAISS in-memory index for real-valued vectors, DPR processed 995 questions/second, returning top-100 passages per question. In contrast, BM25/Lucene (implemented in Java using file index) processed 23.7 questions/second per CPU thread.

  • FAISS configuration: HNSW index type on CPU, neighbors to store per node=512, construction time search depth=200, search depth=128.
Written on August 20, 2023