MPNet - Masked and Permutated Language Modeling
A new pretraining method, Masked and Permutated Language Modeling (MPNet) was introduced in the paper “MPNet: Masked and Permuted Pre-training for Language Understanding”, published in April 2020. It was meant to fuse and address the deficiencies in masked language modeling (MLM) and Permutation Language Modeling (PLM). Experiments show that it outperforms MLM and PLM, and achieves SOTA performance on various NLP datasets such as GLUE, SQuAD, etc.
Advantage of MPNet over MLM and PLM
The main advantage of MPNet over MLM and PLM and that it conditions on more information when predicting a masked token, as illustrated in the following Table. When given the sentence “the task is sentence classification” and masking two words [sentence, classification], these pretraining objectives perform the following factorization:
Model | Factorization |
---|---|
MLM | log P(sentence | the task is [M] [M]) + log P(classification | the task is [M] [M]) |
PLM | log P(sentence | the task is) + log P(classification | the task is sentence) |
MLM | log P(sentence | the task is [M] [M]) + log P(classification | the task is sentence [M]) |
As shown above, MPNet has the following advantages:
- It conditions on all the position information of the sentence, e.g. enabling the model to know that there are two missing tokens to predict. Contrast this with PLM which does not know how many tokens there are in the sentence.
- When predicting a token, MPNet conditions on all its preceding tokens (in the permuted sequence), no matter whether those tokens are selected for masking or not. Contrast this with MLM that does not condition on other masked tokens.
The following Table shows how much information each pretraining objective use on average to predict a masked token. We assume we mask 15% of the tokens:
Objective | #Tokens | #Positions |
---|---|---|
MLM | 85% | 100% |
PLM | 92.5% | 92.5% |
MPNet | 92.5% | 100% |
- MLM conditions on 85% of the tokens and 100% of the positions since masked tokens contain position information.
- PLM conditions on 85% of unmasked tokens. Plus on average, each masked token will also condition on half of the overall masked tokens. Thus PLM conditions on 92.5% of the tokens and positions on average.
- MPNet similarly conditions on 85% of unmasked tokens, and on average half of the overall masked tokens. Plus, it conditions on all position information in the sequence.
Evaluation
The authors used models at the scale of BERT base: 12 Transformer layers, 768 hidden dimensionality, 12 attention heads, 110M parameters. For MPNet, they used relative position embedding. Batch size is 8192 sentences, and sequence length is 512.
For the pretraining corpus, they follow the data used in RoBERTa: Wikipedia, BooksCorpus, OpenWebText, CC-News, and Stories, for a total of 160GB. For tokenization, they used BPE encoding with 30K vocabulary.
Results on GLUE: The authors showed that MPNet outperforms BERT (using MLM), XLNet (using PLM), RoBERTa, and ELECTRA.
Results on SQuAD: To find the answer span for this QA dataset, the authors added a classification layer on the outputs of MPNet to predict whether each token is a start or end position of the answer span. They also added a binary classification layer to predict whether the answer exists. They showed that MPNet outperforms BERT, XLNet, and RoBERTa.
Results on RACE: RACE is the Reading Comprehension and Examinations dataset collected from the English examinations from middle and high school students. Each passage has multiple questions, each with four choices. Here, MPNet performs better than BERT and XLNet.
Results on IMDB: The IMDB dataset is to perform binary sentiment classification on movie reviews. Here, MPNet performs better than BERT and PLM.