Hierarchical RL and Skill Discovery | CS 330

📂 General
# Hierarchical RL and Skill Discovery | CS 330 **Video Category:** Machine Learning / Reinforcement Learning Tutorial ## 📋 0. Video Metadata **Video Title:** Hierarchical RL and Skill Discovery **YouTube Channel:** Stanford Engineering **Publication Date:** Not shown in video **Video Duration:** ~120 minutes ## 📝 1. Core Summary (TL;DR) This lecture explores the transition from standard Reinforcement Learning (RL), where tasks and rewards are manually prescribed, to unsupervised "Skill Discovery" and Hierarchical RL. By utilizing information-theoretic concepts like Entropy and Mutual Information, agents can learn a diverse repertoire of predictable behaviors (skills) without explicit reward functions. These discovered skills can then be leveraged by a higher-level policy to solve complex, long-horizon tasks efficiently, effectively bridging the gap between low-level motor control and high-level abstract planning. ## 2. Core Concepts & Frameworks * **Skill Discovery:** -> **Meaning:** The process of training an agent to learn a diverse set of useful behaviors (skills) in an unsupervised manner, without manually designing specific tasks or reward functions. -> **Application:** Dropping a robot into an unknown environment and allowing it to learn fundamental movements (e.g., walking forward, backward, jumping) purely through intrinsic motivation, which can later be used for specific tasks. * **Hierarchical Reinforcement Learning (HRL):** -> **Meaning:** An architecture where control is split into multiple levels. A high-level policy (the "manager") operates at a lower temporal frequency, outputting abstract sub-goals or selecting skills. A low-level policy (the "worker") operates at a high frequency, taking environmental states and the high-level command to execute concrete motor actions. -> **Application:** Baking a cake: the high-level policy decides "go to the store," while the low-level policy handles the micro-adjustments of muscle contractions to take steps. * **Entropy ($H(p(x))$):** -> **Meaning:** A mathematical measure of the breadth, uncertainty, or unpredictability of a probability distribution. A uniform distribution has maximum entropy (highest uncertainty), while a deterministic outcome has zero entropy. -> **Application:** In Maximum Entropy RL (MaxEnt RL), adding an entropy term to the reward function forces the policy to remain stochastic and explore multiple viable paths to a goal, rather than collapsing prematurely onto a single, potentially sub-optimal trajectory. * **KL-Divergence ($D_{KL}(q||p)$):** -> **Meaning:** A non-symmetric measure of the difference between two probability distributions $q$ and $p$. It quantifies how much information is lost if you use distribution $p$ to approximate distribution $q$. -> **Application:** Used mathematically to formulate Mutual Information and to force learned skill distributions to diverge from a uniform prior, ensuring that the agent learns distinct, non-overlapping behaviors. * **Mutual Information ($I(x; y)$):** -> **Meaning:** A measure of the mutual dependence between two variables; it quantifies how much knowing about one variable reduces uncertainty about the other. -> **Application:** In skill discovery algorithms like DADS, the objective is to maximize the mutual information between a latent skill vector $z$ and the future state $s'$. This ensures that invoking a specific skill leads to a highly predictable and distinct outcome. * **The Options Framework:** -> **Meaning:** A specific mathematical formulation for HRL consisting of a triplet: an initiation set (where the option can be triggered), an intra-option policy (the low-level execution), and a termination function (the probability the option ends in a given state). -> **Application:** A "navigate to door" option can only be initiated inside a room, uses a low-level walking policy, and its termination function spikes to $1.0$ when the agent reaches the doorway. ## 3. Evidence & Examples (Hyper-Specific Details) * **The Meta-World Benchmark Constraint:** The speaker demonstrated the difficulty of manually designing tasks by asking the audience to invent tasks for a single-arm tabletop robot (stack, push, pull, rotate, fold cloth). Creating the Meta-World benchmark required designing 50 distinct tasks that all had similar difficulty levels and time horizons, proving that manual task engineering is unscalable and labor-intensive. * **Biological Synergies (Frog & Human):** The lecture cited biology research showing that a frog's spinal cord uses only a small handful of core "synergies" (skills) to modulate all its movements (Tresch et al., 1999). Similarly, human grasping relies on a few principal components (Santello et al., 1998). This biological evidence justifies building artificial agents that learn a small basis of core skills. * **Soft Q-Learning in an Ant Maze:** A visual simulation compared standard Q-learning against Soft Q-learning (MaxEnt RL). The standard Q-learning ant committed to a specific path around a wall early in training. The Soft Q-learning ant "hedged its bets," exploring both sides of the wall simultaneously until it was absolutely necessary to commit. When the goal was suddenly moved during the test, the Soft Q-learning ant adapted easily, while the standard ant failed. * **Robustness in Lego Stacking (Haarnoja et al., 2017):** A real-world robotic arm was trained to stack a Lego block using Soft Q-learning. During execution, a human repeatedly pushed the arm away from the target. Because the policy was trained to maximize entropy (and thus maintain multiple valid trajectories to the goal), the arm smoothly recovered from the physical perturbations without failing. * **DADS Algorithm (Dynamics-Aware Discovery of Skills):** Visual simulations showed an Ant robot trained with DADS. Unsupervised, the robot learned to map different categorical skill vectors ($z$) to specific, predictable directions of walking. Because DADS simultaneously trains a "skill dynamics model" (predicting the next state given a skill), the researchers could use this model zero-shot to perform model-based planning, navigating the ant to specified coordinate targets without any additional RL training. * **Relay Policy Learning (Gupta et al., 2019):** An HRL approach demonstrated on a complex kitchen robot. The low-level policy was pre-trained using imitation learning on unstructured human demonstrations. The high-level policy was then trained via RL to output intermediate state goals (e.g., "move hand to microwave handle"). This allowed the robot to sequentially open a microwave, move a kettle, and open a cabinet—a task too long-horizon for standard flat RL. * **Option-Critic Architecture (Bacon et al., 2016):** An end-to-end HRL method was tested in a four-room grid world. The visualizations of the learned termination functions showed that the low-level options naturally learned to terminate at the "bottlenecks" (the narrow doorways connecting the rooms), proving that meaningful temporal abstractions can emerge from end-to-end training. ## 4. Actionable Takeaways (Implementation Rules) * **Rule 1: Optimize for Mutual Information to Discover Useful Skills** - When building unsupervised agents, do not rely on random motor babbling. Define an objective function that maximizes the Mutual Information between a latent skill variable ($z$) and the future state ($s'$). This forces the agent to learn behaviors that are diverse from each other but highly predictable internally. * **Rule 2: Integrate MaxEnt RL for Robust Control** - Use Soft Actor-Critic (SAC) or Soft Q-learning instead of standard Q-learning when deploying robots in the real world. By adding an entropy maximization term to your objective, the policy will naturally learn to accommodate perturbations and avoid brittle, single-path solutions. * **Rule 3: Decouple Temporal Horizons using HRL** - For tasks spanning thousands of timesteps, split the architecture. Train a high-level manager to operate every $K$ steps to output a sub-goal, and a low-level worker to operate every step to achieve that sub-goal. This exponentially reduces the search space for the high-level planner. * **Rule 4: Train Skill-Dynamics Models for Zero-Shot Transfer** - When discovering skills, simultaneously train a supervised neural network to predict $s_{t+1}$ given $s_t$ and skill $z$. You can then discard RL for downstream tasks and use standard model-based planning (like Model Predictive Control) to sequence the skills toward new goals without retraining. * **Rule 5: Pre-train Low-Level Skills Before End-to-End Fine-tuning** - Avoid training deep hierarchies entirely from scratch end-to-end, as gradients struggle to pass through the discrete choices of a high-level manager. Pre-train the low-level skills first (using unsupervised skill discovery or imitation learning), freeze them, train the high-level manager, and only then fine-tune end-to-end. ## 5. Pitfalls & Limitations (Anti-Patterns) * **Pitfall:** Using Standard RL for Long-Horizon tasks. -> **Why it fails:** The probability of a random exploration sequence successfully completing a 10,000-step task (like baking a cake) is virtually zero, resulting in no reward signal to learn from. -> **Warning sign:** The agent's learning curve remains flat at zero reward indefinitely despite millions of environmental interactions. * **Pitfall:** Training HRL end-to-end from scratch without constraints. -> **Why it fails:** The low-level policy often learns to bypass the high-level commands and solve the task itself, or the high-level policy collapses to selecting only a single skill, effectively rendering the hierarchy useless. -> **Warning sign:** When visualizing the skill usage, you notice the high-level manager outputs the exact same skill index $z$ at every timestep. * **Pitfall:** Optimizing only for skill diversity (DIAYN) without predictability. -> **Why it fails:** If the agent is rewarded only for making states distinct based on $z$, it might learn highly chaotic, high-variance behaviors that are easy to distinguish but impossible to use for downstream planning. -> **Warning sign:** The learned skills are highly distinct (e.g., jumping wildly in different ways) but a model-based planner fails to string them together to navigate to a target. * **Pitfall:** Manual Reward Shaping for every sub-task. -> **Why it fails:** Humans are bad at specifying comprehensive rewards; you will likely overfit the reward to a specific scenario, resulting in reward hacking where the agent finds a degenerate solution that maximizes the math but fails the actual intent. -> **Warning sign:** The agent achieves a high score on your reward function by vibrating in place or exploiting a physics glitch. ## 6. Key Quote / Core Insight "We want low diversity for a fixed skill, meaning high predictability, but high diversity across different skills, meaning wide coverage of the state space. This tension between predictability and coverage is exactly what maximizing mutual information achieves." ## 7. Additional Resources & References * **Resource:** Meta-World (Yu, Quillen, He, Julian et al., 2019) - **Type:** Benchmark - **Relevance:** Demonstrates the difficulty of manual task design and the need for meta/multi-task RL. * **Resource:** Soft Q-Learning / Deep Energy-Based Policies (Haarnoja et al., 2017) - **Type:** Paper - **Relevance:** Foundational work on using Maximum Entropy RL for robust, multimodal policy learning. * **Resource:** DADS: Dynamics-Aware Discovery of Skills (Sharma, Gu, Levine, Kumar, Hausman, 2019) - **Type:** Paper - **Relevance:** Demonstrates how to discover predictable skills and simultaneously learn a dynamics model for zero-shot planning. * **Resource:** Diversity is All You Need (DIAYN) (Eysenbach, Gupta, Ibarz, Levine, 2018) - **Type:** Paper - **Relevance:** An alternative unsupervised skill discovery method focused on state-distinguishability. * **Resource:** Relay Policy Learning (Gupta, Kumar, Lynch, Levine, Hausman, 2019) - **Type:** Paper - **Relevance:** Shows how to combine imitation learning with HRL to solve complex, long-horizon kitchen tasks. * **Resource:** Option-Critic Architecture (Bacon, Harb, Precup, 2016) - **Type:** Paper - **Relevance:** A key framework for learning options (skills) and their termination conditions end-to-end.