DeBERTa-v3

In Mar-2023, researchers from Microsoft proposed combing DeBERTa and ELECTRA, to result in DeBERTa-v3, describing their approach in the paper “DeBERTa-v3: Improving DeBERTa using ELECTRA-style pre-training with gradient disentangled embedding sharing”. This paper modified the replaced token detection (RTD) objective of ELECTRA, and combined it with the disentangled attention approach of DeBERTa. The authors showed that DeBERTa-v3 performs better than BERT, DeBERTa, RoBERTa, and XLM-R.

Replaced Token Detection (RTD)

BERT uses a transformer encoder and was trained with MLM. In contrast, ELECTRA was trained with two transformer encoders:

Generator $\theta_{G}$: trained with MLM to generate ambiguous tokens to replace masked tokens in the input sequence
Discriminator $\theta_{D}$: trained with a token-level binary classifier. Given the modified/masked input sequence generated by the generator, the discriminator needs to determine if a corresponding token is (i) either an original token, or (ii) a token replaced by the generator. The training objective Replaced Token Detection (RTD) is:

$L_{RTD} = \mathbb{E} \left( - \sum_{i} \text{ log } p_{\theta_{D}} \left( \mathbb{1}(\tilde{x}_{i, D} = x_{i}) | \tilde{X}_{D}, i \right) \right)$

$\tilde{X}_{D}$ is the masked sequence input to the discriminator, which is produced by the Generator, and we are going over all the tokens $i$ in this sequence.

In ELECTRA, the generator and discriminator share token embeddings, which reduces the number of parameters to learn. Let $E$ and $g_{E}$ denote the token embeddings and their gradients, respectively. In ELECTRA, $g_{E}$ is calculated as: $g_{E} = \frac{\partial L_{MLM}}{\partial E} + \lambda \frac{\partial L_{RTD}}{\partial E}$

The above equation means that the token embeddings are updated by balancing the gradients from the two tasks: generator’s MLM and discriminator’s RTD. This can be inefficient if the two tasks have different optimal directions for the embeddings.
- MLM encourages the embeddings of semantically similar tokens to be close to each other.
- RTD tries to separate them to make the binary classification of each token (predict whether the token is original, or was replaced by the generator) easier.

The authors proposed the GDES approach:

The token embeddings between the generator and discriminator are still shared.
However, the RTD loss is not allowed to affect the gradients of the generator. The generator embeddings is only updated with the MLM loss. This ensures the consistency and coherence of the generator output.

Written on September 10, 2023

DeBERTa-v3

Replaced Token Detection (RTD)

Gradient-Disentangled Embedding Sharing (GDES)