mT5 Multilingual Encoder-Decoder Language Model

The mT5 language model was introduced in the paper “mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer” published in October 2020. This is a multilingual version of the T5 model. Their largest model (13B XXL) exceeds SOTA in all classification and QA tasks, and near SOTA for NER. In general, mT5 is relatively weak on NER, requiring usage of the mT5-XL (3.7B) model to exceed XLM-R (550M parameters) on NER.

Model Details

Dataset. The T5 model was trained on the C4 corpus. For mT5, the authors introduced a multingual version of C4 (mC4) which consists of 101 languages text from Common Crawl. They used cld3 (https://github.com/google/cld3) to ID webpage languages. They sampled data from each language according to: $|L|^\alpha$, where $|L|$ is the number of examples in language $L$, and $\alpha=0.3$ (following XLM-R).

Vocabulary. mT5 used vocab size of 250K wordpieces (following XLM-R) with SentencePiece tokenization (following T5).

Models: They trained various model sizes: Small (300M), Base (580M), Large (1.2B), XL (3.7B), XXL (13B). The models are pretrained for 1 million steps on batches of 1024 size length-1024 input sequences, corresponding to 1 trillion input tokens.

Evaluation

The authors evaluated on various tasks:

  • 6 tasks from XTREME multilingual benchmark
  • XNLI entailment task covering 14 languages
  • XQuAD, MLQA, TyDi QA
  • NER dataset of WikiAnn, restricted to the 40 languages from XTREME
  • PAWS-X paraphrase ID dataset with 7 languages

Evaluation results:

  • The largest model (13B XXL) exceeds SOTA in all classification and QA tasks, and near SOTA for NER. The mT5-Large (1.2B) exceeds XLM-R (except NER).
  • In general, mT5 is weak on NER. This might imply that the text-to-text (encoder-decoder) framework of T5 is relatively weak on per-token prediction tasks. They need to use mT5-XL (3.7B) to exceed XLM-R (550M parameters) on NER.
Written on January 30, 2023