XLM-R

The XLM-R is a multilingual encoder-based language model described in the paper “Unsupervised Cross-lingual Representation Learning at Scale”, published by Facebook in November 2019. The model is based on the multilingual XLM, but inspired by techniques from the RoBERTa model. XLM-R handles 100 languages, was pretrained on 2.5TB of filtered CommonCrawl data, and significantly outperforms multilingual BERT (mBERT) on various multilingual evaluation datasets.

Model Details

  • The model was trained using the masked language modeling (MLM) objective on monolingual data. The authors sampled streams of text from each language and trained the model to predict masked tokens.
  • XLM-R did not use the TLM (translation language modeling) as part of its pretraining objective. Whereas the XLM model uses MLM+TLM, but the authors of XLM-R realized that using validation perplexity as a stopping criterion for pretraining caused multilingual MLM to be under-tuned. When they pretrained for longer, it negated the need for the TLM objective.
  • Subword tokenization was done using SentencePiece with a Unigram language model, with a vocabulary size of 250K.
  • The authors trained for two model sizes (both using 512 sequence length):
    • XLM-R base: 270M parameters, 12 layers, 768 hidden dimensionality, 12 attention heads
    • XLM-R large: 550M parameters, 24 layers, 1024 hidden dimensionality, 16 attention heads

Evaluation

Evaluation was done on: (i) Cross-lingual Natural Language Inference (XNLI) dataset in 15 languages, (ii) NER on CoNLL-2002 and CoNLL-2003 datasets in (English, Dutch, Spanish, German), (iii) Cross-lingual QA using MLQA which extends the English SQuAD to (Spanish, German, Arabic, Hindi, Vietnamese, Chinese), and (iv) the GLUE benchmark dataset.

Model Capacity

  • While low resource language performance can be improved by adding similar higher resource languages during pretraining, but when the model size is fixed, the per-language capacity decreases as the number of languages is increased. So positive transfer and capacity dilution have to be balanced.
  • Scaling the shared vocabulary size of XLM-R can improve its performance on downstream tasks. This means that allocating a higher proportion of its parameters to the embedding layer (even though this reduces the capacity of the model in other areas) is beneficial.
Written on December 23, 2022