RoBERTa

The RoBERTa encoder model was described in the paper “RoBERTa: A Robustly Optimized BERT Pretraining Approach” published in July 2019. It is a collaborative effort between University of Washington and Facebook. RoBERTa is essentially a replication of BERT, but with (i) longer training, bigger batch size, pretrained over more data, (ii) with Next Sentence Prediction (NSP) objective removed, (iii) training on longer sequences, (iv) dynamic masking. The authors show that RoBERTa significantly outperforms BERT the GLUE, SQuAD, and RACE dataset.

Dynamic masking:

BERT had performed static masking, where the training data was masked during data preprocessing. Although BERT duplicated the data 10 times and applied a different mask each time, but these data was used in 40 training epochs. So the same training mask was still seen four times during pretraining. RoBERTa instead applied dynamic masking, where the masking pattern was generated every time they feed a text sequence to the model.

RoBERTa Training Details

Differences in BERT training and RoBERTa training:

BERT: 13GB data, 256 batch size, 1M steps, 256 sequence length
RoBERTa: 160GB data, 8K batch size, 500K steps, 512 sequence length

The authors trained for two RoBERTa model sizes:

RoBERTa-base: 125M parameters, 12-layer, 768-hidden, 12-heads
RoBERTa-large: 355M parameters, 24-layer, 1024-hidden, 16-heads

BERT-base has 110M parameters, while BERT-large has 330M parameters.

Written on December 19, 2022