Stanford CS234: From RLHF to Direct Preference Optimization (DPO)

📂 General
# Stanford CS234: From RLHF to Direct Preference Optimization (DPO) **Video Category:** Machine Learning Lecture ## 📋 0. Video Metadata **Video Title:** Lecture 9: RLHF and Guest Lecture on DPO **YouTube Channel:** Stanford Engineering **Publication Date:** Spring 2024 **Video Duration:** ~1 hour 18 minutes ## 📝 1. Core Summary (TL;DR) This lecture breaks down the mechanisms used to align Large Language Models (LLMs) with human intent. It details the standard three-stage Reinforcement Learning from Human Feedback (RLHF) pipeline, highlighting the complexities and instabilities introduced by using Proximal Policy Optimization (PPO) alongside a separate reward model. The core of the lecture introduces Direct Preference Optimization (DPO), a mathematical reformulation that eliminates the need for a separate reward model and RL phase by directly optimizing the language model's policy using a simple classification loss on preference pairs. ## 2. Core Concepts & Frameworks * **Bradley-Terry Model** -> **Meaning:** A probabilistic model used to predict the outcome of pairwise comparisons. It assumes a human makes noisy comparisons, where the probability that option $y_w$ (winner) is preferred over $y_l$ (loser) given prompt $x$ is defined as: $P(y_w \succ y_l | x) = \frac{\exp(r(x, y_w))}{\exp(r(x, y_w)) + \exp(r(x, y_l))}$, where $r$ is a latent reward function. -> **Application:** Used in RLHF to train a reward model; the model learns to assign scalar scores to outputs such that the log-likelihood of the human preference dataset is maximized. * **The RLHF Pipeline (e.g., InstructGPT)** -> **Meaning:** A 3-step process to align LLMs. Step 1: Supervised Fine-Tuning (SFT) on high-quality demonstrations to create a reference model. Step 2: Train a Reward Model (RM) on pairwise human preferences using the Bradley-Terry model. Step 3: Optimize the SFT policy using RL (usually PPO) to maximize the RM score, constrained by a Kullback-Leibler (KL) divergence penalty to prevent the model from drifting too far from the original SFT policy. -> **Application:** This is the foundational architecture used to train models like ChatGPT to follow instructions safely and helpfully. * **Direct Preference Optimization (DPO)** -> **Meaning:** An algorithm that bypasses the RM and RL steps of RLHF. It relies on the mathematical proof that the optimal policy $\pi^*$ under the RLHF objective can be written in closed form. By rearranging this formula, the unknown reward function can be expressed entirely in terms of the optimal policy and the reference policy: $r(x,y) = \beta \log \frac{\pi^*(y|x)}{\pi_{ref}(y|x)} + \beta \log Z(x)$. Plugging this implicit reward back into the Bradley-Terry loss causes the intractable partition function $Z(x)$ to cancel out, leaving a stable binary cross-entropy loss directly on the policy parameters. -> **Application:** Used to fine-tune models (like Llama 3 and Mistral) directly on preference data (e.g., "Choose response A over response B") using standard supervised learning techniques, resulting in a simpler and more stable training loop than PPO. * **Reward Hacking (Overoptimization)** -> **Meaning:** A failure mode in RL where a policy learns to exploit inaccuracies or "loopholes" in an imperfect proxy reward model to achieve artificially high scores, while the true utility (quality) of the output actually degrades. -> **Application:** In RLHF, if PPO optimizes an LLM too heavily against a trained reward model (indicated by a high KL divergence from the reference model), the LLM might learn to output endlessly long, repetitive, or sycophantic text that the proxy RM mistakenly scores highly. ## 3. Evidence & Examples (Hyper-Specific Details) * **Bradley-Terry Model Properties (Class Poll):** Professor Emma Brunskill tests the class on the properties of the Bradley-Terry model. * *True:* It expresses the probability of selecting option $b_i$ over $b_j$. * *True:* It can learn a model of the reward function from preference tuples. * *True:* The resulting reward function can be shifted by any constant ($+C$) without changing preferences because the exponentials cancel out in the fraction. * *False:* Multiplying the reward function by a negative constant does not preserve preferences; it explicitly flips them. * *False:* In RLHF, the reward model is *not* updated after each PPO rollout; it is trained once in step 2 and frozen during step 3. * **Reward Model Scaling Laws (Stiennon et al., 2020):** A chart demonstrates that to capture human preferences accurately, reward models must be scaled up. Validation accuracy on predicting human judgments improves steadily as model size increases from $10^6$ to $10^{10}$ parameters, and as dataset size increases from 8k to 64k examples. The 64k data line on the largest model approaches the theoretical human baseline agreement rate of ~0.73. * **RLHF vs. Baselines (InstructGPT, Ouyang et al., 2022):** A graph shows the fraction of times a model's output is preferred by humans compared to a reference summary. The PPO-trained model ($p^{RL}$) consistently scores around 0.6 to 0.7, drastically outperforming the Supervised Fine-Tuning model ($p^{SFT}$), which hovers around 0.45, across parameter scales from 1.3B to 175B. * **Best-of-N Baseline Performance:** A table of controlled comparisons shows that "Best-of-N" (generating N samples from the policy and selecting the one with the highest Reward Model score) is a formidable non-RL baseline. While PPO achieves a 61.4% simulated win rate, Best-of-N achieves 45.0%, significantly beating basic Supervised Fine-Tuning (36.6%). * **DPO Derivation Math:** Eric Mitchell explicitly demonstrates the math on-screen. The loss function for DPO is derived as: $\mathcal{L}_{DPO}(\pi_\theta; \pi_{ref}) = -\mathbb{E}_{(x, y_w, y_l)} [\log \sigma (\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)})]$. The $\beta$ term controls the strength of the KL constraint. * **Reward Hacking Evidence (Gao et al., 2023):** A synthetic experiment using the IMDB sentiment generation dataset. A pre-trained sentiment classifier is used as the "Gold" (true) reward, and a separate "Proxy" reward model is trained on synthetic data generated from the Gold model. As the KL divergence from the reference policy increases (x-axis), the Proxy reward goes up continuously. However, the Gold reward peaks at a KL of ~5 and then crashes. Histograms reveal the mechanism: the model learns that simply making the text longer increases the proxy reward score, leading to massive, artificially verbose outputs. * **RewardBench (Lambert et al., 2024):** A leaderboard evaluating how well LLMs function as reward models. DPO-trained models (like `mistral-instruct` and `zephyr`) dominate the top of the "Chat" capability section. Because DPO models contain an implicit reward function ($r = \beta \log(\pi_\theta/\pi_{ref})$), they can be evaluated directly on classification tasks without explicitly training a scalar reward head. ## 4. Actionable Takeaways (Implementation Rules) * **Rule 1: Always implement a Best-of-N baseline before utilizing RLHF.** Before attempting the complex infrastructure of PPO, build a system that generates multiple responses from an SFT model and uses your trained reward model to simply select the highest-scoring output. This is a highly stable, inference-time scaling technique. * **Rule 2: Monitor KL divergence to detect reward hacking.** When running PPO or DPO, continuously track the KL divergence between your training policy and the reference SFT policy. If KL grows too large without corresponding increases in human-evaluated quality, your model is exploiting the proxy objective. Use the $\beta$ parameter (temperature) to enforce a stricter penalty. * **Rule 3: Use Direct Preference Optimization (DPO) to eliminate the reward modeling stage.** If you have a dataset of preference pairs, skip the separate RM and PPO stages. Fine-tune your SFT model directly using the DPO binary cross-entropy loss. This requires only running forward passes on the reference model and the active model, drastically simplifying the training loop and memory requirements. * **Rule 4: Extract implicit rewards directly from DPO policies.** You do not need to train a separate reward model if you have a DPO-trained policy. You can calculate the reward score of any text completion $y$ given prompt $x$ by calculating the log-probability of $y$ under the DPO model, subtracting the log-probability under the base reference model, and multiplying by $\beta$. * **Rule 5: Ensure sufficient scale for reward modeling.** Do not expect a small, simple network (e.g., a linear model or small MLP) to capture nuanced human preferences accurately. Capturing human intent requires models with billions of parameters trained on tens of thousands of preference pairs. ## 5. Pitfalls & Limitations (Anti-Patterns) * **Pitfall:** Asking humans for absolute scalar ratings (e.g., "rate this 1 to 10"). -> **Why it fails:** Human judgments are noisy, uncalibrated, and inconsistent across different raters, leading to high variance in the training signal. -> **Warning sign:** The reward model fails to converge or exhibits low validation accuracy against held-out human data. *Solution: Use pairwise comparisons (A is better than B).* * **Pitfall:** Assuming DPO can discover entirely new, high-quality behaviors. -> **Why it fails:** DPO is inherently an offline algorithm operating on a static dataset. It does not actively explore the state-action space during training; it only shifts probability mass between the specific preferred and dispreferred responses present in the training data. -> **Warning sign:** The model learns to rank the provided data perfectly but fails to generate novel, high-quality responses when given out-of-distribution prompts in production. * **Pitfall:** Using DPO on datasets where preferences are contradictory or multi-objective without conditioning. -> **Why it fails:** DPO assumes a single, transitive latent reward function. If the dataset mixes preferences based on different criteria (e.g., sometimes favoring conciseness, sometimes favoring extreme detail), the model will average them out into a suboptimal middle ground. -> **Warning sign:** The DPO loss plateaus at a high value, and the resulting model's behavior is erratic or perfectly average across conflicting styles. * **Pitfall:** Averaging the weights of multiple DPO models to create a stronger model (Model Merging). -> **Why it fails:** While weight averaging works well for SFT models, DPO models represent an exponentiated reward difference. Averaging their weights does not mathematically equate to averaging their implicit reward functions, leading to unpredictable, out-of-distribution behavior. -> **Warning sign:** A merged DPO model performs worse on benchmarks than its individual constituent models. ## 6. Key Quote / Core Insight "DPO fits an implicit reward function. What that equation is saying is we are writing the reward in terms of the optimal policy itself. If we just parameterize our policy using a neural network, we can plug that into the exact same Bradley-Terry likelihood we used to train the reward model, and the intractable partition function completely cancels out." **Insight:** The mathematical brilliance of DPO is recognizing that you don't need a separate network to act as a judge (the reward model) to train a generator (the policy). Because the optimal policy is mathematically tied to the reward, you can rewrite the preference loss function so that the language model *itself* acts as the judge and generator simultaneously, removing the need for unstable reinforcement learning algorithms. ## 7. Additional Resources & References * **Resource:** "Learning to summarize from human feedback" (Stiennon et al., 2020) - **Type:** Paper - **Relevance:** Foundational work demonstrating scaling laws for reward models. * **Resource:** "Training language models to follow instructions with human feedback" (InstructGPT, Ouyang et al., 2022) - **Type:** Paper - **Relevance:** Details the standard three-stage RLHF pipeline. * **Resource:** "Direct Preference Optimization: Your Language Model is Secretly a Reward Model" (Rafailov, Sharma, Mitchell, et al., 2023) - **Type:** Paper - **Relevance:** The core paper introducing the DPO algorithm. * **Resource:** "Scaling Laws for Reward Model Overoptimization" (Gao et al., 2023) - **Type:** Paper - **Relevance:** Demonstrates the mechanics of reward hacking and the necessity of KL penalties. * **Resource:** RewardBench (Lambert et al., 2024) - **Type:** Leaderboard/Benchmark - **Relevance:** A tool for evaluating the capability and safety of reward models and DPO implicit models on HuggingFace.