Sentence Embeddings Using Siamese Networks and all-mpnet-base-v2

A sentence embedding is a single vector that captures the semantic meaning of a piece of text, usually a single sentence or a paragraph. To derive an effective sentence embedding, the ACL-2019 paper “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks” proposed leveraging a Transformer-based Siamese network.

When BERT was introduced, it included a [CLS] token which was used in the BERT paper as a representation of the entire input sentence. In this Sentence-BERT paper, the authors show that [CLS] is not an effective sentence representation.

Sentence Embeddings Using Siamese Network

To derive an effective sentence embedding, the authors of the Sentence-BERT paper proposed training a Siamese network as follows:

  • The input is a pair of text pieces, e.g. a pair of sentences.
  • Take the first sentence, and apply a pretrained Transformer such as BERT or RoBERTa. Using the output representation of each token from the last Transformer layer, apply mean pooling (compute the mean of all output vectors) to derive a fixed size vector $u$ to represent the input sentence. Likewise, compute a representation $v$ for the other input sentence.
  • Note that the same Transformer network is used to derive representations for both sentences in the input example, thus a “Siamese” network.

Using the computed sentence representations $u$ and $v$, the authors then tried a few different objective functions to train the Siamese network, e.g.:

  • Cross-entropy classification: $\text{softmax}(W_t (u, v, u - v ))$, where $W_t$ is a trainable weight matrix with dimension $3n x k$, $n$ is the dimension of $u$ and $v$, and $k$ is the number of labels.
  • Regression mean-squared-error: let $\hat{d} = \text{cosine-sim}(u, v)$. Then apply mean-squared-error loss as the objective function: $\frac{1}{N} \sum_{i=1}^{N}(d_i - \hat{d}_i)^2$, where the gold lable $d$ could be either $1$ (positive pair) or $-1$ (negative pair).

Updated Sentence Embeddings Model: all-mpnet-base-v2

In a follow-on work in 2021, Hugging face organized a project “Train the Best Sentence Embedding Model Ever with 1B Training Pairs”, and the resultant best English sentence embedding mode is “all-mpnet-base-v2”, available at https://huggingface.co/sentence-transformers/all-mpnet-base-v2:

  • It uses the pretrained MPNet base model, whose pretraining objective had earlier been shown to perform better than masked language modeling (MLM) from BERT, and the permuted language modeling (PLM) pretraining objective.
  • The model was fine-tuned on a combination of multiple datasets, which aggregate to more than 1B sentence pairs.
  • The above URL from Hugging face mentioned that the model was fine-tuned using a contrastive objective, which might mean: \(y d^2 + (1 - y) \text{max}(\epsilon - d, 0)^2\)
    • where $y=1$ if given a positive pair, and $0$ if given a negative pair.
    • $d$ is the distance between the sentence pair representations, e.g. using cosine similarity or Euclidean distance: $\sqrt{\sum_{i} (u_i - v_i)^2}$, where $i$ ranges over the components/dimensions of $u$ and $v$.
    • $\epsilon$ is a margin term meant to tighten the constraint. If given a negative pair of sentences, then their distance should be at least $\epsilon$ or there will be a loss incurred.
  • However, it also mentioned that it calculated the cosine similarity between sentence pairs and then apply cross entropy (CE) loss. CE loss is: $- \sum_{c \in C} y_c \text{ log}(p_c)$
    • $C$ is the set of class labels.
    • $y_c$ is an indicator viarable which equals $1$ if the example has true class label $c$.
    • $p_c$ is the predicted probability of the example being of class $c$.
  • However, combining the description that they are doing contrastive objective, where they calculate cosine similarity and apply CE loss, it most likely refer to:
    • TBD
Written on February 23, 2023