FLAN - Fine-Tuned Decoder Language Model
The FLAN model (Finetuned Language Net) from Google was introduced in the paper “Finetuned Language Models Are Zero-Shot Learners” published in September 2021. Basically shows that performing multitask fine-tuning improves zero-shot generalization to new tasks (i.e. tasks not included in fine-tuning).
The setup of FLAN is very similar to that of T0, a paper from Huggingface published a month later in October 2021 (both FLAN and T0 explored multitask fine-tuning). The main difference being that FLAN is based on LaMDA-PT (Google’s decoder-only language model), whereas T0 is based on T5 (Google’s encoder-decoder language model).
Important takeaways from the paper:
- FLAN is based on LaMDA-PT (137B model), a decoder-only language model which was pre-trained on (web docs, dialog data, Wikipedia) tokenized into 2.49T BPE tokens with 32K vocab using SentencePiece. Note that LaMDA-PT only has language model pretraining, vs. LaMDA which was finetuned for dialog.
- FLAN improves on the zero-shot performance of the base LaMDA-PT 137B model. Also outperforms GPT-3 zero-shot on 20 out of 25 datasets. But note that GPT-3 is just pre-trained and not fine-tuned.
- Interestingly, the benefits of finetuning is only with larger models. They finetuned on pre-trained models of different sizes: 422M, 2B, 8B, 68B, 137B. Finetuning hurts performance on heldout tasks for models 8B and smaller.
- Showed that given a finetuned FLAN model, doing few-shot in-context learning (i.e. few shot examples provided in prompt), further improves over zero-shot. The standard deviation of performance (i.e. deviation in performance when using different prompt wordings during inference) is also lower, thus few-shot reduces sensitivity to prompt engineering.
Datasets, training and evaluation
- Groups more than 60 NLP datasets into 12 clusters. Hold out each cluster for evaluation while finetuning on all other clusters. The grouping of tasks into clusters is shown below:
- For each dataset, manually compose 10 unique templates that use natural language to describe the task. The following Figure illustrates this:
- During finetuning, each example in each dataset is formatted via a randomly selected template for that dataset. Some examples of FLAN prompts:
Written on February 1, 2023