LaMDA Decoder Dialog Model

Google’s LaMDA (Language Models for Dialog Applications) model, is a decoder-based Transformer dialog model that is designed to produce dialog responses that are high quality, safe, and grounded. It was introduced in the paper “LaMDA: Language Models for Dialog Applications” published in Janurary 2022.

The authors note that the usual practice of fine-tuning language models on automatic quality metrics such as perplexity, BLEU/ROUGE may not correlate well with human judgements. Hence, the authors proposed fine-tuning a Transformer decoder model to improve three metrics that are important for dialog:

Quality, focusing on Sensibleness, Specificity, Interestingness (SSI):
- Sensible: Whether responses make sense in context and do not contradict what was said earlier. But sensible alone is not enough, as it would encourage models to “play it safe” by producing short, generic, boring responses, e.g. “I don’t know”, “OK”.
- Specificity: Whether a response is specific to a given context. E.g. user says “I love Eurovision”, a response “Me too” is too general, and a better response is “Me too. I love Eurovision songs”.
- Interestingness: whether response is witty, likely to catch someone’s attention.
Safety: Avoid producing responses with unintented results (e.g. financial advice, how-to information on harmful activities), reponses that might cause discrimination or marginalization, and avoid responses that propagate or reinforce misinformation.
Groundedness: Encourage responses that are gounded in known sources whenever they contain verifiable external world information.

The LaMDA paper shows that a pre-trained language model that is fine-tuned on thousands of curated dialog data annotated for quality and safety, and then incorporating safety thresholding and quality ranking to select the “best” response, can significantly improve users’ perception of quality and safety.

Pretraining of LaMDA

The authors pretrained a decoder-only Transformer model, where the training objective is to predict the next token in a text corpus. The model uses 64 layers, and 128 attention heads, and is up to 137B parameters. The pretraining dataset consists of 2.97B documents, 1.12B dialogs, for a total of 1.56T words. Tokenization is done with SentencePiece with a vocab size of 32K tokens.
To generate responses, the model used a bin size of 40 to first sample 16 candidate responses. Then, the model calculates each candidate’s log-likelihood and produce the highest-scoring candidate as the response.

Fine-Tuning for SSI and Safety

The pretrained models are first fine-tuned for quality and safety. To perform fine-tuning, the examples are first formatted as text sequences like so: “<context> <separator> <response>” with training losses applying only on the “<response>” portion.

For the SSI fine-tuning dataset:

The authors collected 6400 dialogs where crowdworkers interact with LaMDA about any topic. The workers rate each model response on whether it is sensible, specific, and interesting. Each rating is a binary label.
For instance, a sensible-rated example is “What’s up? RESPONSE not much. SENSIBLE 1”. A interesting-rated example is “What’s up? RESPONSE not much. INTERESTING 0”.

For the safety fine-tuning dataset:

The authors collected 8K dialogs where crowdworkers interact with LaMDA about any topic. Each model response is rated as safe or not (a binary rating).
An example is: “What’s up? RESPONSE not much. UNSAFE 0”

The models are first fine-tuned to predict the SSI and safety rating, e.g. calculate $P(\text{}|\text{ })$. The candidate responses where safety prediction falls below a certain threshold are then filtered away. The remaining candidate responses are then ranked for quality, and the top ranked candidate is selected as the next response. In experiments, the authors saw that safety ratings do not benefit much from increasing model scale alone without fine-tuning.

Fine-Tuning for Goundedness

The above models that had been fine-tuned for quality and safety, are subsequently fine-tuned for groundedness.

As training data, the authors collected 4K dialogs of crowdworkers interacting with the model, where the conversation focused on information-seeking interactions. If a model response contains claims that need to be checked, crowdworkers will record the search queries used to investigate the claims. There are two fine-tuning tasks:

First fine-tuning task: Take the context thus far, and decide whether to generate a query to an in-house fact checking toolset (which consists of an Internet search, translator, and calculator).
Second fine-tuning task: Given the original response (e.g. “He is 31 years old right now”) and the snippet returned by the toolset (e.g. “Rafael Nadal / Age / 35”), produce the grounded response to the user (e.g. “He is 35 years old right now”). Alternatively, this task can also invoke another search query to the toolset.

Written on January 25, 2023