ColBERTv2 - Efficient Passage Retrieval
The ColBERTv2 neural search model is a follow-up to their 2020 ColBERT work, where the architecture remains similar, but leverages compression techniques to achieve 6-10x storage reduction as compared to ColBERT. It was introduced in the paper “ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction”, published in 2022.
Queries and passages are independently encoded with BERT, and each BERT token embeddings are projected to a lower dimension. This is done offline, and every document is indexed with a set of these (lower dimension) vectors.
At search time, given a query $q$, its similarity to a passage $d$ is computed as follows, where $Q$ (matrix encoding query with $N$ vectors) and $D$ (encodes passage with $M$ vectors): \(S_{q,d} = \sum_{i=1}^{N} \text{max}_{j=1}^{M} Q_i \cdot D_j^T\)
Model Details:
- For each training query, retrieve top-$k$ passages using ColBERT. This provides hard negatives training examples.
- Each of the above $k$ query-passage pair is fed into a cross-encoder to re-rank the passages (w.r.t. the query).
- The cross-encoder was trained on the MS-MARCO dataset by leveraging MiniLM (a 22M parameter distilled model from MSRA where MSRA says it performs better than DistilBERT).
- Using the re-ranking scores, they aggregate 64 passags per query example. These 64 passages consist of (i) a highly ranked passage (or a positive passage), (ii) one or more lower-ranked passages. Each query (with its 64 passages) are trained as a single batch (a single batch may consist of multiple queries each with its own 64 passages).
- They do joint training by adding up two losses: (i) KL-divergence loss from cross-encoder, (ii) losss from the ColBERT architecture.
- The cross-entropy loss is only applied to (false) positives.
Efficient vector representation:
- We define a residual vector $r$ as: $r = v - C_t$, where $v$ is the original vector, and $C_t$ is the centroid vector that $v$ is closest to.
- To reduce space requirements, the authors quantized every dimension of $r$, from 16 or 32 bits into 2 bits, to produce a quantized vector $\tilde{r}$
Comparison to ColBERT:
- ColBERT-v2 results on BEIR is almost always better than ColBERT.
- In terms of storage, ColBERT-v2 is a 6-10x reduction when compared to ColBERT.