Direct Preference Optimization
Existing methods typicaly steer LMs to match human preferences using reinforcement learning from human feedback (RLHF). This (i) fit a reward model $r$ to a dataset of human preferences, (ii) then use RL to optimize a language model policy to produce responses that would be assigned high reward by $r$, while not drifting excessively far from the original model. This paper “Direct preference optimization: your language model is secretly a reward model” on 5/29/2023 eliminates the need to fit a reward model and directly fine-tune LMs to align with human preferences.
Summary of Approach
In current RLHF, the commonly used objective combines reward maximization $r(x,y)$ for input $x$ from a dataset of prompts $D_{p}$, with a KL-divergence penalty between the language model policy $\pi_{\theta}$ and its initialization $\pi_{\text{ref}}$ which is an instruction-tuned model: \(\tag{eq:RL} \text{max}_{\pi_{\theta}} \mathbb{E}_{x \sim D_{p}, y \sim \pi_{\theta}(y|x)} \left[ r(x,y) - \beta \text{ log} \frac{\pi_{\theta}(y|x)}{\pi_{\text{ref}}(y|x)} \right]\)
The reason for including a KL-divergence penalty term is because past work have observed that using an objective of unconstrained reward maximization can lead to overoptimization, resulting in a policy that achieves high reward but which is not aligned with the intended behavior.
The paper proposed an approach for directly fine-tuning LLMs over a dataset of preference pairs. For each response pair where $y_{w}$ and $y_{l}$ where $y_{w}$ is preferred over $y_{l}$ (denoted as $y_{w} \succ y_{l}$), the probability of observing a particular preference pair is assumed to follow a Bradley-Terry model: \(p(y_{w} \succ y_{l}) = \sigma(r(x, y_{w}) - r(x, y_{l}))\)
The proposed paper shows that the optimal policy $\pi^{*}$ for the RL objective shown in Equation $(\text{eq:RL})$ can be found by optimizing a classification loss computed directly on the preference data: \(L_{\text{DPO}}(\pi_{\theta}; \pi_{\text{ref}}) = - \mathbb{E}_{x, y_{w}, y_{l} \sim D} \left[ \text{log} \sigma \left( \beta \text{ log} \frac{\pi_{\theta}(y_{w} | x)}{\pi_{\text{ref}}(y_{w}|x)} - \beta \text{ log} \frac{\pi_{\theta}(y_{l}|x)}{\pi_{\text{ref}}(y_{l}|x)} \right) \right]\)
Existing RLHF Pipeline
Supervised Instruction Fine-Tuning
Using a pre-trained LM, the first step is to perform instruction fine-tuning to obtain a model $\pi^{SFT}$.
Reward Modeling
- The $\pi^{SFT}$ model is prompted with prompts $x$ to produce pairs of answers $(y_1, y_2) \sim \pi^{SFT}(y|x)$. For each answer pair, let $y_{w}$ and $y_{l}$ denote the preferred and dispreferred answer respectively.
- Initialize a reward model $r_{\phi}(x,y)$ from $\pi^{SFT}(y|x)$.
- Then given a static dataset of comparisons \(D = \{ x^{(i)}, y_ {w}^{(i)}, y_ {t}^{(i)} \}_ {i=1}^{N}\), we train $r_{\phi}(x,y)$ as a binary model to minimize: \(L_{R}(r\phi, D) = - \mathbb{E}_{x, y_{w}, y_{l} \sim D} \left[ \text{log } \sigma(r_{\phi}(x, y_{w}) - r_{\phi}(x, y_{l})) \right]\)
RL Fine-Tuning
- We initialize a language model policy $\pi_{\theta}$ from the instruction fine-tuned model $\pi^{SFT}$.
- We use the reward function $r_{\phi}$ to provide feedback to the language model. In particular, we optimize:
\(\text{max}_{\pi_{\theta}} \mathbb{E}_{x \sim D, y \sim \pi_{\theta}(y|x)} [r_{\phi}(x,y)] - \beta \mathbb{D}_{KL}[\pi_{\theta}(y|x) \Vert \pi_{\text{ref}}(y|x)]\)
- where $\beta$ is a parameter controlling the deviation from the base reference policy $\pi_{ref}$ (which is the SFT model $\pi^{SFT}$)
Direct Preference Optimization
The authors proposed to skip the reward modeling step, and directly optimize a copy of $\pi^{SFT}$ using preference data.
The overall goal is to optimize the following objective: \(\begin{equation} \tag{1} \text{max}_{\pi} \mathbb{E}_{x \sim D, y \sim \pi} [ r(x,y) ] - \beta \mathbb{D}_{KL} [ \pi (y|x) \Vert \pi_{\text{ref}}(y|x) ] \end{equation}\)
Through derivation in the next Section, we can show that the optimal solution to the KL-constrainted reward maximization objective takes the form: \(\begin{equation} \tag{2} \pi(y|x) = \pi^{*}(y|x) = \frac{1}{Z(x)} \pi_{ref}(y|x) \text{ exp}\left( \frac{1}{\beta} r(x,y) \right) \end{equation}\) Where the normalizer $Z(x) = \sum_{y} \pi_{ref}(y|x) \text{ exp} \left( \frac{1}{\beta} r(x,y) \right)$
By taking $\text{log}$ on both sides, we can rewrite Equation $(2)$ as: \(\begin{aligned} \frac{\pi^{*}(y|x) Z(x)}{\pi_{\text{ref}}(y|x)} & = \text{exp}\left( \frac{1}{\beta} r(x,y) \right) \\ r(x,y) & = \beta \text{log}\left( \frac{\pi^{*}(y|x) Z(x)}{\pi_{\text{ref}}(y|x)} \right) \\ & = \beta \text{log}\left( \frac{\pi^{*}(y|x)}{\pi_{\text{ref}}(y|x)} \right) + \beta \text{ log} Z(x) \end{aligned}\)
Under the Bradley-Terry preference model, the probability that a completion $y_{w}$ is preferred to a completion $y_{l}$ is formulated as: \(\begin{aligned} p(y_{w} \succ y_{l} | x) & = \frac{\text{exp}(r(x, y_{w}))}{\text{exp}(r(x, y_{w})) + \text{exp}(r(x, y_{l}))}\\ & = \frac{1}{1 + \frac{\text{exp}(r(x, y_{l}))}{\text{exp}(r(x, y_{w}))}}\\ & = \frac{1}{1 + \text{exp}(r(x, y_{l}) - r(x, y_{w}))} \\ & = \sigma(r(x, y_{w}) - r(x, y_{l})) \end{aligned}\)
Earlier, we had derived: $r(x, y) = \beta \text{ log} \left( \frac{\pi^{*}(y|x)}{\pi_{\text{ref}}(y|x)} \right) + \beta \text{ log} Z(x)$. Substituting this into the above equation: \(\begin{aligned} p(y_{w} \succ y_{l} | x) & = \sigma \left( \left[ \beta \text{ log} \frac{\pi^{*}(y_{w}|x)}{\pi_{\text{ref}}(y_{w}|x)} + \beta \text{ log} Z(x) \right] - \left[ \beta \text{ log} \frac{\pi^{*}(y_{l}|x)}{\pi_{\text{ref}}(y_{l}|x)} + \beta \text{ log} Z(x) \right] \right) \\ & = \sigma \left( \beta \text{ log} \frac{\pi^{*}(y_{w}|x)}{\pi_{\text{ref}}(y_{w}|x)} - \beta \text{ log} \frac{\pi^{*}(y_{l}|x)}{\pi_{\text{ref}}(y_{l}|x)} \right) \end{aligned}\)
The final policy objective becomes: \(L_{\text{DPO}}(\pi_{\theta}; \pi_{\text{ref}}) = - \mathbb{E}_{(x, y_{w}, y_{l}) \sim D} \left[ \text{log} \sigma \left( \beta \text{ log} \frac{\pi_{\theta}(y_{w} | x)}{\pi_{\text{ref}}(y_{w} | x)} - \beta \text{ log} \frac{\pi_{\theta}(y_{l} | x)}{\pi_{\text{ref}}(y_{l} | x)} \right) \right]\)
Derivation
- Let’s focus on rewriting $\mathbb{D}_ {KL} [ \pi (y|x) \Vert \pi_ {\text{ref}}(y|x) ]$
- Recall that KL divergence is a measure of how one probability distribution diverges from a second, expected probability distribution. For two probability distributions P and Q over the same event space: \(\mathbb{D}_{KL}(P || Q) = \sum_{i} P(i) \text{log} \left( \frac{P(i)}{Q(i)} \right)\)
- In our context, $P = \pi(y|x)$ and $Q = \pi_{\text{ref}}(y|x)$, therefore: \(\mathbb{D}_{KL} [ \pi (y\|x) \Vert \pi_{\text{ref}}(y\|x) ] = \sum_{y} \pi(y\|x) \text{ log} \left( \frac{\pi(y\|x)}{\pi_{\text{ref}}(y\|x)} \right)\)
- Now recall that $\mathbb{E}[g(Y)] = \sum_{y \in Y} g(y) f(y)$. In our context:
- $g(y) = \text{log} \left( \frac{\pi (y|x)}{\pi_{\text{ref}}(y|x)} \right)$
- $f(y) = \pi(y|x)$
- Hence: \(\sum_{y} \pi(y\|x) \text{ log} \left( \frac{\pi(y\|x)}{\pi_{\text{ref}}(y\|x)} \right) = \mathbb{E}_{y \sim \pi(y\|x)} \text{ log} \left( \frac{\pi(y\|x)}{\pi_{\text{ref}}(y\|x)} \right)\)
- which denotes the expected value of the logarithm of the ratio of $\pi / \pi_{\text{ref}}$, where $y$ is drawn from the distribution $\pi(y|x)$.
- Now, we have: \(\begin{aligned} (1) & = \text{max}_{\pi} \mathbb{E}_{x \sim D, y \sim \pi} [ r(x,y) ] - \beta \mathbb{E}_{y \sim \pi(y|x)} \left[ \text{log} \frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)} \right] \\ & = \text{max}_{\pi} \mathbb{E}_{x \in D} \mathbb{E}_{y \sim \pi(y|x)} \left[ r(x,y) - \beta \text{log} \frac{\pi (y|x)}{\pi_{\text{ref}}(y|x)} \right] \end{aligned}\)
-
Dividing by $-\beta$, we have: \(\text{min}_{\pi} \mathbb{E}_{x \sim D} \mathbb{E}_{y \sim \pi (y|x)} \left[ -\frac{1}{\beta}r(x,y) + \text{log} \frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)} \right]\)
-
Using $x = \text{log exp}(x)$: \(\text{min}_{\pi} \mathbb{E}_{x \sim D} \mathbb{E}_{y \sim \pi (y|x)} \left[ \text{log} \frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)} - \text{log exp} \left( \frac{1}{\beta}r(x,y) \right) \right]\)
-
Adding and subtracting by $\text{log}Z(x)$: \(\text{min}_{\pi} \mathbb{E}_{x \sim D} \mathbb{E}_{y \sim \pi (y|x)} \left[ \text{log} \frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)} - \text{log exp} \left( \frac{1}{\beta}r(x,y) \right) + \text{log}Z(x) - \text{log}Z(x) \right]\)
- Note: $\text{log}Z(x) = \text{log} \frac{1}{\frac{1}{Z(x)}} = \text{log}(\frac{1}{Z(x)})^{-1} = -\text{log}\frac{1}{Z(x)}$, so we have:
- Let’s now define: $\pi^{*}(y|x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y|x) \text{ exp}\left( \frac{1}{\beta} r(x,y) \right)$:
- We would like $\pi^{*}$ to be a valid probability distribution, i.e. $\sum_{y} \pi^{*}(y|x)=1.0$.
- Hence, $Z(x)$ is a normalization factor: \(Z(x) = \sum_{y} \pi_{\text{ref}}(y\|x) \text{ exp} \left( \frac{1}{\beta} r(x,y) \right)\)
-
We now have: \(\text{min}_{\pi} \mathbb{E}_{x \sim D} \mathbb{E}_{y \sim \pi (y|x)} \left[ \text{log} \frac{\pi (y|x)}{\pi^{*}(y|x)} - \text{log} Z(x)\right]\)
-
Since $Z(x)$ does not depend on $y$, we shift $\mathbb{E}_{y \sim \pi(y|x)}$ inside: \(\text{min}_{\pi} \mathbb{E}_{x \sim D} \left[ \mathbb{E}_{y \sim \pi (y\|x)} \left[ \text{log} \frac{\pi (y\|x)}{\pi^{*}(y\|x)} \right] - \text{log} Z(x)\right]\)
- Using the KL divergence definition, and noticing that $\text{log}Z(x)$ does not depend on $\pi$ hence it does not affect the optimization (and thus can be treated as a constant): \(\text{min}_{\pi} \mathbb{E}_{x \sim D} [ \mathbb{D}_{KL} (\pi(y\|x) \Vert \pi^{*}(y\|x)) + \mathbb{E}_{x \sim D}\text{log}Z(x)] =\\ \text{min}_{\pi} \mathbb{E}_{x \sim D} [ \mathbb{D}_{KL} (\pi(y\|x) \Vert \pi^{*}(y\|x)) + Z(x)]\)