Stanford CS234: Deep Reinforcement Learning Extensions & Imitation Learning

📂 General
# Stanford CS234: Deep Reinforcement Learning Extensions & Imitation Learning **Video Category:** Machine Learning / Artificial Intelligence Tutorial ## 📋 0. Video Metadata **Video Title:** Stanford CS234: Reinforcement Learning | Winter 2019 | Lecture 7 - Imitation Learning **YouTube Channel:** Stanford Engineering **Publication Date:** Winter 2019 **Video Duration:** ~1 hour 13 minutes ## 📝 1. Core Summary (TL;DR) This lecture details advanced extensions to Deep Q-Networks (DQN) designed to stabilize learning and improve sample efficiency, such as Double DQN, Prioritized Experience Replay, and Dueling architectures. It then transitions into Imitation Learning, addressing the "hard exploration" problem in environments with sparse rewards where traditional Reinforcement Learning (RL) fails. By leveraging expert demonstrations through Behavioral Cloning, Inverse Reinforcement Learning (IRL), and Apprenticeship Learning, agents can rapidly learn complex behaviors in large state spaces without requiring manually engineered, brittle reward functions. ## 2. Core Concepts & Frameworks * **Double DQN:** -> **Meaning:** An extension of Q-learning that mitigates "maximization bias" (the tendency to overestimate action values) by decoupling action selection from action evaluation. -> **Application:** The online network $w$ is used to select the best action ($\arg\max_a Q(s', a; w)$), while the target network $w^-$ is used to evaluate the value of that specific action. * **Prioritized Experience Replay:** -> **Meaning:** A sampling strategy for replay buffers that replaces uniform random sampling with a probability distribution weighted by the magnitude of a transition's Temporal Difference (TD) error. -> **Application:** Transitions where the agent's prediction was highly inaccurate are replayed more frequently, accelerating learning in sparse or complex environments. * **Dueling DQN Architecture:** -> **Meaning:** A neural network design that splits the Q-value estimation into two separate streams: one estimating the state value $V(s)$ and the other estimating the state-action advantage $A(s, a)$. They are combined at the output layer: $Q(s,a) = V(s) + A(s,a) - \frac{1}{|A|}\sum A(s,a')$. -> **Application:** Highly effective in environments where the value of being in a state is mostly independent of the specific action taken (e.g., dodging obstacles in a racing game). * **Behavioral Cloning:** -> **Meaning:** The simplest form of Imitation Learning, formulated as a standard supervised learning problem. It learns a policy mapping states to actions by directly minimizing the error between the agent's predictions and a dataset of expert state-action pairs. -> **Application:** Used in early autonomous driving (e.g., ALVINN) to map dashboard camera pixels directly to steering wheel angles. * **Inverse Reinforcement Learning (IRL):** -> **Meaning:** The process of inferring the underlying reward function that an expert is implicitly optimizing, given a set of their demonstrated trajectories. -> **Application:** Used when defining a manual reward function is too brittle or complex, such as teaching a robotic arm to pour water or a car to drive safely in traffic. * **DAgger (Dataset Aggregation):** -> **Meaning:** An iterative Imitation Learning algorithm designed to solve the compounding error problem of Behavioral Cloning. The agent acts in the environment using its current policy, collects the states it visits, and queries the expert for the optimal actions in those specific states to augment the training dataset. -> **Application:** Corrects "off-distribution" failures in autonomous navigation by teaching the agent how to recover when it inevitably drifts away from the expert's ideal path. ## 3. Evidence & Examples (Hyper-Specific Details) * **DQN on Atari (Mnih et al., Nature 2015):** The baseline algorithm achieved human-level performance across 50 Atari games using the exact same neural network architecture and hyperparameters, relying on experience replay and fixed target networks updated every few thousand steps. * **Rainbow DQN (Hessel et al., AAAI 2018):** A chart demonstrated that combining independent improvements—Double DQN, Prioritized Replay, Dueling architectures, Distributional RL, A3C, and Noisy Nets—into a single "Rainbow" algorithm yields a massive performance leap over the baseline DQN, crossing the 200% median human-normalized score at 200 million frames. * **Montezuma's Revenge as a Sparse Reward Failure:** Visual evidence of a DQN agent running for 50 million frames in the Atari game Montezuma's Revenge and barely making it past the first two rooms. Standard $\epsilon$-greedy exploration fails completely because the probability of randomly executing the sequence of actions required to find the first key (reward) is infinitesimally small. * **ALVINN (Pomerleau, NIPS 1989):** A visual diagram showed an early neural network for autonomous driving. The input was a 30x32 video retina and an 8x32 range finder, feeding into a hidden layer of 29 units, and outputting to 45 direction units representing steering angles. It was trained using purely supervised behavioral cloning from human driving data. * **Compounding Errors in Behavioral Cloning:** A graph illustrated a test trajectory diverging from the training trajectory. If an agent trained via Behavioral Cloning makes a small error (probability $\epsilon$) at time $t$, it enters a state not present in the training data. Lacking recovery data, it makes further errors. The mathematical expectation of total errors scales quadratically: $E[\text{Total errors}] \leq \epsilon T^2$. * **Visualizing Dueling Networks:** A diagram compared standard DQN with Dueling DQN. In standard DQN, convolution layers feed into a single fully connected sequence. In Dueling DQN, the network splits into a value stream ($V(s)$) and an advantage stream ($A(s,a)$) before combining into the final Q-values. * **Feature Expectation Matching (Abbeel and Ng, 2004):** To perform Apprenticeship Learning, the agent assumes the reward is linear over features $R(s) = w^T x(s)$. The algorithm iterates to find a policy whose discounted sum of feature expectations $\mu(\pi)$ matches the expert's feature expectations $\mu(\pi^*)$ such that $|w^T \mu(\pi) - w^T \mu(\pi^*)| \leq \epsilon$. ## 4. Actionable Takeaways (Implementation Rules) * **Rule 1: Decouple Action Selection and Evaluation to prevent overestimation** - When using Q-learning, implement Double DQN. Use your online network weights $w$ to select the best action $\arg\max_a Q(s', a; w)$, but use the target network weights $w^-$ to evaluate it $Q(s', a; w^-)$. This prevents positive bias from the max operator. * **Rule 2: Center your Dueling Advantage Stream** - When implementing a Dueling architecture, do not simply add $V(s)$ and $A(s,a)$. The networks are unidentifiable (you can add a constant to V and subtract it from A). Force identifiability by subtracting the mean of the advantages: $Q(s,a) = V(s) + (A(s,a) - \frac{1}{|A|}\sum_{a'} A(s,a'))$. * **Rule 3: Use DAgger instead of naive Behavioral Cloning** - Do not train an agent solely on offline expert logs. Initialize a policy $\pi_1$, run it in the simulator to generate a trajectory, ask the expert (human or slow optimal planner) to label the actions for the states $\pi_1$ actually visited, aggregate this new data into your dataset, and retrain. This teaches the agent how to recover from its own mistakes. * **Rule 4: Apply IRL when reward engineering causes unintended behavior** - If writing a manual reward function (e.g., "stay on road", "avoid obstacles", "maintain speed") results in brittle or dangerous edge cases, switch to Inverse RL. Provide optimal human demonstrations and let the algorithm infer the linear combination of features $w^T x(s)$ the human is prioritizing. * **Rule 5: Smooth out Prioritized Replay with Stochasticity** - Do not strictly sample only the highest TD error transitions in a replay buffer, as this leads to overfitting on noisy outliers. Use stochastic prioritization where the probability of sampling is proportional to $p_i^\alpha$, ensuring low-error transitions still have a non-zero chance of being sampled. ## 5. Pitfalls & Limitations (Anti-Patterns) * **Pitfall:** Using identical networks for action selection and target evaluation in Q-learning. -> **Why it fails:** The $\max_a$ operation acts as an optimistic estimator. If noise causes a suboptimal action to have a temporarily high Q-value, the network will select it and use that inflated value as the target, causing uncontrolled upward drift in value estimates. -> **Warning sign:** Q-values diverge toward infinity and policy performance collapses. * **Pitfall:** Assuming Behavioral Cloning will generalize perfectly to deployment. -> **Why it fails:** Supervised learning assumes independent and identically distributed (i.i.d.) training and test data. In MDPs, actions affect future states. A small initial error puts the agent in an unfamiliar state, where it makes a larger error, compounding quadratically $O(T^2)$. -> **Warning sign:** The agent achieves near-zero loss on the training dataset but catastrophically fails (e.g., drives off the track) after a few steps in the real environment. * **Pitfall:** Reward ambiguity in Inverse Reinforcement Learning. -> **Why it fails:** The IRL problem is ill-posed because infinitely many reward functions can explain a given behavior. Setting $R(s)=0$ everywhere makes any policy technically optimal. -> **Warning sign:** The algorithm outputs degenerate reward functions that do not actually encode the desired task constraints. * **Pitfall:** Relying on standard $\epsilon$-greedy exploration for sparse rewards. -> **Why it fails:** If a reward is 100 specific actions deep (like grabbing a key in Montezuma's Revenge), the probability of finding it by random chance is $1/|A|^{100}$, which is practically zero. -> **Warning sign:** The agent's performance graph remains completely flat over hundreds of millions of training frames. * **Pitfall:** DAgger requires an "always-on" expert. -> **Why it fails:** DAgger forces the agent into unfamiliar states and requires the expert to provide optimal labels for those bad states. If the expert is a human driver, it is dangerous and mentally taxing to ask them how to recover while the car is actively driving off a cliff. -> **Warning sign:** Data collection becomes bottlenecked by human fatigue or safety risks. ## 6. Key Quote / Core Insight "If you are doing behavioral cloning, the fundamental problem is that the state distribution you encounter at test time depends on the actions your imperfect policy takes. The moment you make a mistake, you fall off the distribution of your training data, leading to compounding errors that scale quadratically with time." ## 7. Additional Resources & References * **Resource:** Mnih et al., Nature 2015 - **Type:** Paper - **Relevance:** Foundational paper introducing DQN and its success on Atari. * **Resource:** Van Hasselt et al., AAAI 2016 - **Type:** Paper - **Relevance:** Introduced Deep Reinforcement Learning with Double Q-learning. * **Resource:** Schaul et al., ICLR 2016 - **Type:** Paper - **Relevance:** Introduced Prioritized Experience Replay. * **Resource:** Wang et al., ICML 2016 - **Type:** Paper - **Relevance:** Introduced the Dueling Network Architecture for Deep RL. * **Resource:** Hessel et al., AAAI 2018 - **Type:** Paper - **Relevance:** The "Rainbow" paper showing how combining DQN extensions yields state-of-the-art performance. * **Resource:** Pomerleau, NIPS 1989 - **Type:** Paper - **Relevance:** Introduced ALVINN, an early and successful application of Behavioral Cloning for autonomous driving. * **Resource:** Ross et al., AISTATS 2011 - **Type:** Paper - **Relevance:** Introduced DAgger (Dataset Aggregation) to solve the compounding error problem in imitation learning. * **Resource:** Abbeel and Ng, ICML 2004 - **Type:** Paper - **Relevance:** Foundational paper on Apprenticeship Learning via Inverse Reinforcement Learning using feature matching. * **Resource:** Ziebart et al., AAAI 2008 - **Type:** Paper - **Relevance:** Introduced Maximum Entropy Inverse Reinforcement Learning to address reward ambiguity. * **Resource:** Ho and Ermon, NeurIPS 2016 - **Type:** Paper - **Relevance:** Introduced Generative Adversarial Imitation Learning (GAIL).