PaLM - Decoder Language Model

The PaLM (Pathways Language Model) model was introduced in the paper “PaLM: Scaling Language Modeling with Pathways” published in April 2022. This is a 540B parameter autoregressive decoder language model, trained on 780B text tokens.

PaLM outperforms GPT-3, GLaM, Megatron-Turing NLG, Gopher, Chinchilla, and LaMDA on 28 out of 29 English benchmarks in few-shot setting. And 24 out of 29 tasks in 1-shot setting.

In terms of decoder-only language model, this is probably the best (and largest dense) decoder-only pretrained model till date. However, the authors acknowledge that decoder-only language models are sub-optimal (compared to encoder-decoder models) for fine-tuning.

Model Details:

Decoder-only (at each timestep attend to itself and past timesteps). Use SwiGLU activations (instead of ReLU, GeLU, or Swish).
Use “parallel” formulation in each Transformer block, rather than the standard “serialized” formulation. This enables 15% faster training speed but no quality degradation at 62B scale.
- Standard: $y = x + \text{MLP}(\text{LayerNorm}(x + \text{Attention}(\text{LayerNorm}(x))))$
- Parallel: $y = x + \text{MLP}(\text{LayerNorm}(x)) + \text{Attention}(\text{LayerNorm}(x))$
Use RoPE (rotary position) embeddings, instead of absolute or relative position embeddings.
Shared input-output embeddings matrics (which is done frequently but not universally in past worrk).
No biases in dense layers or layer norms (they mentioned this results in increased training stability for large models).
Use SentencePiece with 256K vocab.
Trained various model sizes: 8B, 62B, and 540B parameters. The 540B model uses:
- 118 layers, 48 heads, 18432 hidden dimension.
- Batch size start with 512, then increase to 1024, then finally to 2048.
- Why the increase of batch size as training progresses? Smaller batch size are more sample efficient (i.e. helps to improve training loss quicker) earlier in training. Larger batch sizes provide better gradient estimates later in training.

As pretraining dataset, PaLM uses 780 billion tokens from: mulitlingual social media conversations (50% proportion of data), multilingual filtered webpages (27%), English books (13%), GitHub code (5%), multilingual Wikipedia (4%), English news (1%).

Evaluation

One-Shot and Few-Shot Experiments

Evaluate (one-shot, few-shot) on the same set of 29 English datasets as GaLM and GPT-3:

QA, cloze and completion tasks, Winograd-style tasks, common sense reasoning, reading comprehension, SuperGLUE, NLI.
Few-shot: Outperforms GPT-3, GLaM, Megatron-Turing NLG, Gopher, Chinchilla, LaMDA on 28 out of 29 English benchmarks.
One-shot: Outperforms above models in 24 out of 29 tasks.
The paper did not compare to SOTA results achieved by prior fine-tuned models.

On the MMLU (massive multitask language understanding), outperforms SOTA Chinchilla (decoder-only) (70B) by about 2 points on average: 69.3 vs 67.5.

On the BIG-bench dataset, PaLM is better than SOTA (GPT-3, Gopher, Chinchilla) in 5-shot setting, on 44 out of 58 tasks.

Fine-Tuning Experiments

Finetuning results on SuperGLUE shows that PaLM is comparable to SOTA which are encoder-decoder models but much smaller size:

Using peak validation scores, PaLM (92.6) outperforms T5-11B encoder-decoder model (89.9), but lags behind ST-MoE-32B (93.3). ST-MoE-32B is a 269B sparse Mixture-of-Experts encoder-decoder model having comparable computation costs as a 32B dense model.

Discussion

In Section 6.7 on multilingual QA, the PaLM authors state “We conjecture that, similar to SuperGLUE experiments, the causal language model loss objective and the decoder-only architecture might be sub-optimal for discriminative finetuning.”
The T5 paper also showed that encoder-decoders generally outperform decoder-only models on classification tasks finetuning.

Written on February 8, 2023