BART Encoder-Decoder Language Model

BART is a language model from Meta, described in the paper “BART: Denoising Sequence-to-Sequence Pre-Training for Natural Language Generation, Translation, and Comprehension”, published in October 2019. It is most similar to the T5 model, which is also an encoder-decoder Transformer.

BART is a sequence to sequence model, with a bidirectional encoder over corrupted text and a left-to-right auto-regressive decoder. The base model has 6 layers (each in encoder and decoder). Large model has 12 layers each with 406M parameters. Each layer of decoder additionally performs cross-attention over the final layer of the encoder.

Differences with BERT, GPT, T5

  • BERT: This is an encoder-only Transformer where documents are encoded bidirectionally, and pretraining is done with a MLM objective where random tokens are replaced with masks, and masked tokens are predicted independently.
  • GPT: This is a decoder-only Transformer where tokens are predicted auto-regressively. Words can only be conditioned on leftward contexts, so it cannot learn bidirectional interactions.
  • T5: Similar to BART, T5 is an encoder-decoder Transformer. But in T5, only the masked text spans are predicted, whereas BART predicts the complete output. Another difference is when fine-tuning, T5 predicts the output auto-regressively in the decoder. Whereas BART feeds the final hidden representation of its decoder into a classification layer. More details below.

Pre-Training and Fine-Tuning

Pre-Training Fine-Tuning
_config.yml _config.yml
  • Pre-training. First, text is corrupted with a noising function. Then, a sequence-to-sequence model is learned to reconstruct the original text. That is, the corrupted document is encoded with a bidirectional model, then the likelihood of the original document is calculated with an auto-regressive decoder (we optimize the negative log likelihood of the original document). This is similar to the T5 model, but BART explores the following text corruption options:
    • Token masking: Following BERT, random tokens are sampled and replaced with [MASK].
    • Token deletion: Random tokens are deleted from the input text.
    • Text infilling: A number of text spans are sampled (average length 3). Each span is replaced with a single [MASK] token. Text infilling is inspired by SpanBERT, but SpanBERT replaces each span with a sequence of [MASK] tokens of exactly the same length.
    • Sentence permutation: Sentences are shuffled in random order.
    • Document rotation: A token is chosen uniformly at random, and the document is rotated so that it begins with that token.
  • Fine-tuning: The uncorrupted text is input to both the encoder and decoder, and we use representations from the final hidden state or layer of the decoder.
    • Sequence classification tasks: the same input is fed into the encoder and decoder, and the final hidden state (i.e. from last layer) of the final decoder token is fed into a new multi-class linear classifier.
    • Token classification tasks: feed the texts into both the encoder and decoder, and use hidden states (of last layer) of decoder as a representation of each token. This representation is used to perform token classification.
    • Sequence generation: For instance, abstractive QA and summarization. Input text is fed to encoder, and decoder generates outputs auto-regressively.

Final BART model

  • The BART-large model has 406M parameters (12 layers each of encoder and decoder), hidden size 1024, batch size 8000, and was trained for 500K steps.
  • Tokenization using the byte-pair-encoding of GPT-2.
  • For text corruption, it used a combination of text infilling (mask 30% of tokens, spans of average 3 tokens in length) and permutation of all sentences.
  • Same pre-training data as the Roberta model (160GB of news, books, stories, and web text).
Written on February 6, 2023