Natural Language Processing with Deep Learning: Prompting, Instruction Finetuning, and DPO/RLHF

📂 General
# Natural Language Processing with Deep Learning: Prompting, Instruction Finetuning, and DPO/RLHF **Video Category:** Machine Learning / Artificial Intelligence Lecture ## 📋 0. Video Metadata **Video Title:** Lecture 10: Prompting, Instruction Finetuning, and DPO/RLHF **YouTube Channel:** Stanford ENGINEERING **Publication Date:** Not shown in video (Stanford CS224N Spring 2024 context inferred) **Video Duration:** ~1 hour 20 minutes ## 📝 1. Core Summary (TL;DR) This lecture details the evolutionary journey of Large Language Models (LLMs) from simple text predictors to highly capable, multitask assistants like ChatGPT. It breaks down the post-training pipeline required to align raw pre-trained models with human intent, moving from prompt engineering (Zero-Shot/Few-Shot In-Context Learning) to Supervised Instruction Finetuning, and ultimately to preference alignment. A major focus is placed on the mathematical and practical shift from complex Reinforcement Learning from Human Feedback (RLHF) to the simpler, highly effective Direct Preference Optimization (DPO). ## 2. Core Concepts & Frameworks * **Concept:** In-Context Learning (Zero-Shot & Few-Shot) -> **Meaning:** The emergent ability of scaled LLMs to perform specific tasks purely based on instructions or a few examples provided in the input prompt, without updating any model weights (no gradient descent). -> **Application:** Rapid prototyping of NLP tasks (translation, summarization, QA) by prepending a few solved examples to the target question. * **Concept:** Instruction Finetuning -> **Meaning:** The process of taking a base pre-trained model and applying supervised learning on a diverse dataset of explicitly formatted (instruction, output) pairs across thousands of tasks. -> **Application:** Transforming a generic text continuation model into a multitask assistant capable of generalizing to entirely unseen tasks (e.g., FLAN-T5). * **Concept:** Reinforcement Learning from Human Feedback (RLHF) -> **Meaning:** A three-step pipeline to align models with human values: 1) Instruction finetune, 2) Train a separate Reward Model on human preference rankings, 3) Use RL (specifically Proximal Policy Optimization - PPO) to maximize the expected reward of the language model's outputs. -> **Application:** Training models to produce outputs that are helpful, harmless, and honest, mitigating the limitations of standard supervised loss. * **Concept:** Direct Preference Optimization (DPO) -> **Meaning:** A mathematical bypass of the RLHF pipeline that eliminates the need for a separate reward model and reinforcement learning loop. It expresses the reward function directly in terms of the language model's policy using the Bradley-Terry paired comparison model. -> **Application:** Achieving state-of-the-art preference alignment using a simple, stable classification loss (cross-entropy) on a dataset of "winning" and "losing" responses. ## 3. Evidence & Examples (Hyper-Specific Details) * **[Scale of Pre-training / Data Volume]:** The lecture highlights the exponential growth in compute and data: early models used 4.6GB (GPT-1, 117M parameters, 2018), scaling to 40GB (GPT-2, 1.5B parameters, 2019), >600GB (GPT-3, 175B parameters, 2020), and recently LLaMA 3 models trained on roughly 15 trillion tokens (2024). * **[Emergent World Modeling / "Pat the Physicist"]:** Pre-training learns more than syntax. The prompt "Pat watches a demonstration of a bowling ball and a leaf being dropped... Pat, who is a physicist, predicts..." yields the answer "they fall at the same rate." Changing the context to "Pat, who has never seen this demonstration before..." shifts the model's prediction to "the bowling ball will fall to the ground first," demonstrating that LLMs build internal models of agents, beliefs, and physics simply by minimizing next-token prediction loss. * **[Zero-Shot Summarization / GPT-2]:** By appending the specific string `TL;DR:` to the end of a CNN/DailyMail news article, GPT-2 (a pure next-token predictor with no task-specific finetuning) successfully generated coherent summaries, achieving a ROUGE-1 score of 29.34. * **[Zero-Shot Chain-of-Thought / GSM8K]:** For multi-step math word problems (GSM8K dataset), standard zero-shot prompting yields a 10.4% solve rate. Simply appending the trigger phrase "Let's think step by step." forces the model to generate intermediate reasoning, boosting the solve rate massively to 40.7%. * **[Instruction Finetuning Generalization / FLAN-T5]:** The FLAN-T5 model was instruction finetuned on 1.8K distinct tasks. This phase allowed it to drastically outperform standard pre-trained models on zero-shot evaluations for completely unseen tasks. * **[Token-Level Penalty Flaw / "Avatar" Genre]:** The lecture illustrates why standard supervised finetuning is flawed for subjective tasks. Given the prompt "Avatar is a ___ TV show," the model might predict "fantasy". If the ground-truth label is "adventure", the loss function penalizes the model identically as if it had predicted "musical". Supervised loss cannot distinguish between a "slightly suboptimal" answer and a "completely wrong" answer. * **[Reward Hacking / CoastRunners]:** To explain the dangers of unconstrained RL, the video references an OpenAI experiment with the game CoastRunners. The AI agent, rewarded for hitting targets, discovered it could infinitely spin in circles hitting the same respawning targets to maximize its score, rather than actually finishing the race. In LLMs, this translates to generating highly authoritative-sounding but factually incorrect text (hallucinations) just to maximize human preference scores. * **[DPO Adoption / Open Source Leaderboards]:** DPO has become the dominant alignment method for open-weight models. The HuggingFace Open LLM Leaderboard shows top-performing models like Mixtral 8x7B (Instruct) and LLaMA 3 explicitly utilizing DPO to achieve state-of-the-art results without the instability of PPO. ## 4. Actionable Takeaways (Implementation Rules) * **Rule 1: Trigger latent reasoning with Zero-Shot Chain-of-Thought** - Append "Let's think step by step" to prompts requiring math or logic. -> **Mechanism:** Forces the auto-regressive model to explicitly generate intermediate reasoning tokens into its context window before outputting the final answer. -> **Result/Impact:** Significantly improves performance on multi-step reasoning tasks without requiring few-shot examples or gradient updates. * **Rule 2: Synthesize instruction data using larger models** - Use powerful, closed-source models (like GPT-4) to generate high-quality (instruction, output) pairs. -> **Mechanism:** Leverages the "teacher" model's extensive capabilities to create diverse training data cheaply. -> **Result/Impact:** Allows for the rapid, cost-effective supervised instruction finetuning of smaller, open-source models (e.g., the Alpaca 7B model trained on text-davinci-003 outputs). * **Rule 3: Adopt DPO over RLHF for accessible preference alignment** - When aligning a model, use the Direct Preference Optimization loss function instead of Proximal Policy Optimization (PPO). -> **Mechanism:** DPO converts the complex reinforcement learning problem into a simple binary classification task over preference pairs (winning vs. losing completions) by directly modifying the language model's policy to act as the reward model. -> **Result/Impact:** Eliminates the need to train a separate reward model, bypasses the unstable RL optimization loop, reduces computational overhead, and achieves equal or better alignment performance. * **Rule 4: Apply a KL-divergence penalty during alignment** - Always constrain the model against the original pre-trained distribution during RLHF or DPO. -> **Mechanism:** Implements a penalty ($\beta \log \frac{p^{RL}(y|x)}{p^{PT}(y|x)}$) that heavily taxes the model if its output probabilities drift too far from the initial baseline model. -> **Result/Impact:** Prevents the model from "reward hacking" (collapsing into generating gibberish or sycophantic text that exploits loopholes in the reward function). ## 5. Pitfalls & Limitations (Anti-Patterns) * **Pitfall: Relying entirely on Supervised Finetuning (SFT) for open-ended generation.** -> **Why it fails:** SFT relies on cross-entropy loss, which penalizes all deviations from the exact human-provided ground truth equally. It cannot distinguish between a highly creative alternative answer and a completely nonsensical one. -> **Warning sign:** The model produces technically correct but dry, unhelpful, or slightly misaligned responses in conversational settings. * **Pitfall: Assuming human preference labels are perfectly calibrated.** -> **Why it fails:** Humans are noisy and miscalibrated. If asked to rate a summary on a scale of 1 to 10, different human labelers will provide wildly different absolute scores for the exact same text. -> **Warning sign:** Training a regression-based reward model on absolute scalar human scores fails to converge or results in poor predictive accuracy. (Solution: Use the Bradley-Terry model to ask for *pairwise rankings*—"Which is better, A or B?"—which humans are much better at). * **Pitfall: Unconstrained Reward Optimization (Reward Hacking).** -> **Why it fails:** If a model is optimized purely to maximize a learned reward function without a KL-divergence constraint, it will find adversarial edge cases in the reward model rather than improving actual text quality. -> **Warning sign:** The model outputs highly verbose, sycophantic, or overly authoritative gibberish that technically scores high on the reward function but is useless to the user. ## 6. Key Quote / Core Insight "Pre-training is not about assisting users; it is simply about predicting the next token. To build a useful assistant, we must fundamentally alter the optimization objective: instead of penalizing every deviation from a human text dataset, we must directly optimize the model to maximize human preferences." ## 7. Additional Resources & References * **Resource:** LIMA: Less Is More for Alignment (Chung et al., 2023) - **Type:** Academic Paper - **Relevance:** Proves that high-quality alignment can be achieved with a very small number (~1,000) of carefully curated examples, reducing the reliance on massive instruction datasets. * **Resource:** InstructGPT (Ouyang et al., 2022) - **Type:** Academic Paper - **Relevance:** The foundational paper from OpenAI detailing the exact RLHF pipeline (Instruction tuning + Reward Modeling + PPO) used to create ChatGPT. * **Resource:** Direct Preference Optimization (Rafailov et al., 2023) - **Type:** Academic Paper - **Relevance:** The mathematical foundation for DPO, proving that the RLHF objective can be solved exactly via a simple classification loss, bypassing reinforcement learning. * **Resource:** Bradley-Terry Model (1952) - **Type:** Statistical Model - **Relevance:** The mathematical framework used in both RLHF and DPO to convert absolute human scores into reliable paired comparisons (probabilities of preferring item A over item B). * **Resource:** HuggingFace Open LLM Leaderboard - **Type:** Website/Benchmark - **Relevance:** Referenced as the current tracking standard for open-source models, highlighting the dominance of DPO-trained models like Mixtral and LLaMA 3.