RoBERTa
The RoBERTa encoder model was described in the paper “RoBERTa: A Robustly Optimized BERT Pretraining Approach” published in July 2019. It is a collaborative effort between University of Washington and Facebook. RoBERTa is essentially a replication of BERT, but with (i) longer training, bigger batch size, pretrained over more data, (ii) with Next Sentence Prediction (NSP) objective removed, (iii) training on longer sequences, (iv) dynamic masking. The authors show that RoBERTa significantly outperforms BERT the GLUE, SQuAD, and RACE dataset.
Dynamic masking:
BERT had performed static masking, where the training data was masked during data preprocessing. Although BERT duplicated the data 10 times and applied a different mask each time, but these data was used in 40 training epochs. So the same training mask was still seen four times during pretraining. RoBERTa instead applied dynamic masking, where the masking pattern was generated every time they feed a text sequence to the model.
RoBERTa Training Details
Differences in BERT training and RoBERTa training:
- BERT: 13GB data, 256 batch size, 1M steps, 256 sequence length
- RoBERTa: 160GB data, 8K batch size, 500K steps, 512 sequence length
The authors trained for two RoBERTa model sizes:
- RoBERTa-base: 125M parameters, 12-layer, 768-hidden, 12-heads
- RoBERTa-large: 355M parameters, 24-layer, 1024-hidden, 16-heads
BERT-base has 110M parameters, while BERT-large has 330M parameters.