GPT-2 Decoder Language Model

The GPT-2 language model was published in the paper “Language Models are Unsupervised Multitask Learners” in Feburary 2019. The GPT-2 paper has 2 main differences with the GPT-1 paper. First, GPT-2 experimented with various model sizes, ranging from 117M parameters (same size as GPT-1), to 1.5G parameters. Second, instead of performing fine-tuning on downstream NLP tasks as was done in GPT-1, the GPT-2 paper focused on zero-shot evaluation.

Model pre-training details

Training dataset: Mined and cleaned outbound webpages from Reddit, resulting in 8 million documents for 40GB of text. All Wikipedia documents were omitted since it is a common datasource for NLP evaluation datasets.
A modified version of Byte-pair encoding (BPE) was used for subword tokenization, for 50K vocab size.
The language model is basically a Transformer decoder like GPT-1, with a few modifications. For instance, the context size was increased from 512 to 1024 tokens, and batch size was increased from 64 to 512.
The smallest GPT-2 model has 117M parameters (12 layers, 768 dimension), similar in size to GPT-1 and BERT. The largest GPT-2 model has 1.5G parameters (48 layers, 1600 dimension).

Zero-shot evaluation results (no fine-tuning)

Improves SOTA on the LAMBDA dataset (task is to predict final word of a text paragraph), and the Winograd schema challenge (given a sentence and question, choose between two answers).
Performs badly on the Conversation Question Answering dataset (CoQA) where the task is: Given a text paragraph, answer a series of questions. GPT-2 zero-shot achieves 55 F1, while a SOTA BERT-based model achieves 89 F1.
Performs badly on CNN/Daily-mail summarization. GPT-2 achieves 21 Rouge-F1 vs SOTA 32 Rouge-F1. To get GPT-2 to perform summarization, the authors added the special token TL;DR: as part of the input prompt, after the article text.
On the WMT-14 Fench-English translation, zero-shot GPT-2 obtained BLEU scores that significantly lagged behind SOTA results.
On the “Natural Questions” QA dataset; zero-shot GPT-2 obtained 4.1% correct vs SOTA 30-50% correct.

Written on January 11, 2023