Zephyr

Two weeks after the release of Mistral-7B, researchers at HuggingFace release “Zephyr: direct distillation of LM alignment” on 10/25/2023. This is the most capable 7B language model at the time of its release. The Zephyr authors started with the Mistral-7B model, then performed instruction fine-tuning with the UltraChat dataset, then performed preference optimization using DPO on the UltraFeedback dataset.

Approach

Starting with Mistral-7B, the authors first performed two fine-tuning steps to arrive at Zephyr-7B:

Instruction fine-tuning on a dataset generated automatically from other language models.
Direct preference optimization (DPO) fine-tuning on preference dataset whose responses are automatically generated by a set of LMs, and then scored for preferred vs non-preferred by GPT-4.

Distilled supervised fine-tuning (dSFT)

Starting the a raw pretrained LLM, the usual first step is to perform instruction fine-tuning, via training on a dataset of instructions and responses. The Zephyr authors leverged the distilled SFT (dSFT) approach:

Teacher models are used to generate instructions and responses, i.e. the self-instruct protocol described in “Self-instruct: aligning language models with self-generated instructions” (Wang et. al. 2023).
Starting with a seed set of prompts $x_{1}^{0}, \ldots, x_{J}^{0}$, the teacher model is leveraged to iteratively generate response, and then refine the existing instruction based on the newly generated response, culmulating in a final dataset $C = { (x_{1}, y_{1}, \ldots, (x_{J}, y_{J}) }$.
The pretrained model is then fine-tuned on the above dataset $C$ to generate $y$ from input $x$, resulting in $\pi_{\text{dSFT}}$.

Generate Preference Dataset using Teacher Models

Starting with a set of prompts $x_{1}, \ldots, x_{J}$, each prompt $x$ is fed to four LLM models $\pi_{1}, \ldots, \pi_{4}$ (e.g. Llama, Falcon, etc.), where each teacher model generates a response $y^{1} \sim \pi_{1}(\cdot|x), \ldots, y^{4} \sim \pi_{4}(\cdot|x)$.
These response are fed to GPT-4, which gives a score for each response: $s^{1} \sim \pi_{\text{GPT-4}}(\cdot | x, y^{1}), \ldots, s^{4} \sim \pi_{\text{GPT-4}}(\cdot | x, y^{4})$.
The highest scoring response is denoted as $y_{w}$, and a random lower scoring prompt is chosen as $y_{l}$. The final dataset consists of ${ (x, y_{w}, y_{l}) }$.

DPO

Taking the above preference dataset ${(x, y_{w}, y_{l})}$, the authors performed DPO fine-tuning over the $\pi_{\text{dSFT}}$ model, using the following DPO objective: $\pi_{\theta} = \text{max}_{\pi} \mathbb{E}_{(x, y_{w}, y_{l}) \sim D} \text{ log} \sigma \left( \beta \text{ log} \frac{\pi (y_{w}|x)}{\pi_{\text{dSFT}}(y_{w}|x)} - \beta \text{ log} \frac{\pi (y_{l}|x)}{\pi_{\text{dSFT}}(y_{l}|x)} \right)$

Starting from the $dSFT$ model, then for each $x, y_{w}, y_{l}$:

Compute the probability for $(x, y_{w})$ and $(x, y_{l})$ from the dSFT model $\pi_{\text{dSFT}}$ (forward pass only).
Compute the probability for $(x, y_{w})$ and $(x, y_{l})$ from the DPO model $\pi$.
Compute the above DPO objective and update the weights of the DPO model $\pi$.

Experiments

Model settings

The base LLM is Mistral 7B.
They used DeepSpeed ZeRO-3 and FlashAttention-2 to optimize memory and training speed.
All models are trained with AdamW optimizer and no weight decay.
Did not use parameter efficient techniques such as LoRA.
All experiments were run on 16 A100s using bfloat16, taking 2-4 hours to complete.

Instruction fine-tuning (dSFT) training data

The authors used the UltraChat corpus. This consists of 1.47M multi-turn dialogues generated or distilled from by GPT-3.5-turbo over 30 topics. The Zephyr authors fixed some grammatical errors, and applied filters to focus on helpfulness to remove undesired model responses, resulting in a dataset of 200K examples.
Train for 1 epoch. Use a cosine learning rate scheduler, with a peak learing rate of 2e-5 and 10% warmup steps.
Batch size of 512, use packing with a sequence length of 2048 tokens.

DPO preference fine-tuning training data

The authors used the UltraFeedback corpus. This consists of 64K prompts, each of which have four LLM responses that are rated by GPT-4 for criteria: instruction-following, helpfulness, honesty, and truthfulness. The response having the highest mean criteria score is designated as the preferred response, and one of the remaining three responses is chosen at random to be the dispreferred response.
From the SFT model, train for 3 epoch. Use a linear learning rate scheduler, with a peak learning rate of 5e-7 and 10% warmup steps.
Batch size of 32. Use $\beta=0.1$ in the DPO optimization equation.

Evaluation Datasets

MT-Bench: a multi-turn benchmark:
- 160 questions across 8 different areas of knowledge
- Model must answer an initial question, and then provide a second response to a predefined followup question.
- Each model response will be rated by GPT-4 on 1-10 scale. Final score is the mean over the 2 turns.
AlpacaEval: a single-turn benchmark:
- Model must generate a response to 805 questions on different topics, mostly focused on helpfulness.
- Response are scored by GPT-4. Metric is the pairwise win-rate against text-davinci-003
Open LLM Leaderboard:
- Performance over 4 multiclass classification tasks: ARC, HellaSwag, MMLU, Truthful QA.
- Although this does not directly measure the conversational quality of models, but it does validate whether fine-tuning has introduced regression on the base model’s reasoning and truthfulness capabilities.

Evaluation Results

Zephyr-7B-SFT-DPO out-performs other 7B models.
Worse than GPT-3.5-turbo on MT-Bench, but better on AlpacaEval.
Worse than Claude-2 and GPT-4.
Is competitive with 40B scale models on the LLM Leaderboard classification tasks.
They also performed ablations: directly do DPO without dSFT, just do dSFT fine-tuning. Conclusion is that it is necessary to first do instruction fine-tuning, then preference optimization.

Written on December 10, 2023