U-PaLM - UL2 Mixture of Denoisers for PaLM Decoder
The authors continued training the decoder-only PaLM model with the mixture of denoising UL2 pretraining objective, giving a U-PaLM model. U-PaLM was described in the paper “Transcending Scaling Laws with 0.1% Extra Compute” from Google, published in October 2022.
UL2 for Continued Pretraining
UL2 is a mixture of denoiser objective that mixes Prefix-LM and infilling (span corruption) as pretraining objectives. Briefly, the three types of denoisers are:
- Regular denoising: this is the span corruption task in T5. Spans are typically of mean length 3 and a corruption rate of 15% of the tokens.
- Sequential denoising: this is the Prefix-LM objective.
- Extreme denoising: a large percentage of the original text is masked out. Spans are typically with a mean length of 32 or a corruption rate of up to 50%.
Starting from the trained PaLM models, the authors applied the UL2 denoising objectives and performed additional pretraining steps. They performed experiments using 8B, 64B, and 540B parameters versions of PaLM the generate the U-PaLM models. In particular, the U-PaLM 540B is trained for 20K pretraining steps with a batch size of 32. The number of extra tokens used is 1.3B, which is 0.16% of extra computation, as compared to the pretraining run of PaLM.
Evaluation
They first evaluated on 26 NLU and NLG tasks (TriviaQA, NaturalQuestions, SuperGLUE, PIQA, OpenbookQA, ANLI, etc.). On zero-shot and one-shot evaluations, U-PaLM 540B outperforms PaLM 540B on 21 out of 26 tasks.
Next, using 5-shot prompting, they evaluated on 21 “emergent” tasks from the BigBench dataset. These challenging tasks are selected based on the criteria that performance of PaLM on these tasks remain relatively flat-lined at 8B and 62B scale, but significantly improves when using 540B model. Their experiments show that U-PaLM outperforms PaLM on 19 out of 21 tasks at 540B scale.
Next, they performed experiments on various other datasets: zero-shot commonsense reasoning, zero-shot and few-shot closed booked QA, reasoning and chain-of-thought, etc. U-PaLM consistently outperforms PaLM for both the 62B scale and 540B scale models.