Model-Based Reinforcement Learning and Monte Carlo Tree Search (CS234)

đź“‚ General
# Model-Based Reinforcement Learning and Monte Carlo Tree Search (CS234) **Video Category:** Machine Learning / Reinforcement Learning Tutorial ## 📋 0. Video Metadata **Video Title:** CS 234: Reinforcement Learning - Lecture 15 (Model-Based RL and MCTS) **YouTube Channel:** Stanford ENGINEERING **Publication Date:** Winter Quarter 2019 **Video Duration:** ~1 hour 11 minutes ## 📝 1. Core Summary (TL;DR) This lecture details Model-Based Reinforcement Learning (RL), where an agent learns an explicit model of the environment's transition dynamics and reward function to plan optimal actions. It contrasts this with model-free approaches, highlighting the benefits of model-based RL, such as improved sample efficiency and zero-shot transfer when reward functions change. The video then introduces Monte Carlo Tree Search (MCTS), a powerful simulation-based planning algorithm that selectively explores large state spaces by combining simulated rollouts with tree-based planning, famously utilized by DeepMind's AlphaGo to master the complex board game Go. ## 2. Core Concepts & Frameworks * **Model-Based Reinforcement Learning (RL):** -> **Meaning:** An RL paradigm where the agent learns a predictive model of the environment from experience (transitions $T$ and/or rewards $R$) and then uses planning algorithms (like dynamic programming) on that model to construct a value function or policy. -> **Application:** Used when you want to minimize real-world interactions by simulating experience internally, or when you need to quickly adapt to changing goals without relearning environment physics. * **Planning (in RL):** -> **Meaning:** The computational process of taking a known or learned model of the world (MDP) and using algorithms like Value Iteration, Policy Iteration, or Tree Search to compute an optimal policy for that specific model. -> **Application:** Used inside the model-based RL loop once a transition and reward model have been estimated from data. * **Certainty Equivalence Model / Empirical Model:** -> **Meaning:** A tabular or approximate model constructed by taking the maximum likelihood estimate (MLE) of empirical data (e.g., counting state-action-next-state transitions and averaging rewards) and treating it as the absolute truth for planning. -> **Application:** A simple baseline for model learning, often implemented as a table lookup model in discrete state spaces. * **Expectimax / Minimax Trees:** -> **Meaning:** Forward search tree structures used for planning. Expectimax takes the maximum over actions and the expectation over stochastic next states. Minimax is used in 2-player adversarial games to maximize your own reward while assuming the opponent minimizes it. -> **Application:** Used to formally evaluate the exact value of taking specific actions from a current state by looking ahead a finite number of steps $H$. * **Monte Carlo Tree Search (MCTS):** -> **Meaning:** A simulation-based search algorithm that builds a partial search tree rooted at the current state. It iteratively samples actions, simulates trajectories (rollouts) using a model or simulator, and updates the estimated value (Q-value) of nodes in the tree based on the averaged returns of those rollouts. -> **Application:** Essential for finding near-optimal moves in massive state spaces (like Go or complex robotics) where exhaustive Expectimax search is computationally impossible. * **Upper Confidence Tree (UCT) Search:** -> **Meaning:** A specific action-selection strategy within MCTS that treats each node in the tree as a multi-armed bandit problem. It uses Upper Confidence Bounds (UCB) to balance exploiting known good paths and exploring lesser-visited paths during tree expansion. -> **Application:** Used to efficiently decide which branches of the MCTS tree to expand during simulated rollouts. * **Self-Play:** -> **Meaning:** A training paradigm where an agent learns by playing games against itself (or past versions of itself), generating its own training data and curriculum. -> **Application:** Crucial for environments like board games where the reward signal (win/loss) is incredibly sparse and external expert data is limited. ## 3. Evidence & Examples (Hyper-Specific Details) * **Customer Service Example (Known Reward, Unknown Dynamics):** The lecturer notes that in many practical applications like customer service, the reward function is explicitly known (e.g., user engagement, purchase completion). However, the transition dynamics—how a customer's state changes in response to an action—must be learned through model-based RL. * **Robot Navigation Zero-Shot Transfer:** If a robot learns a dynamics model of how to navigate a room (how to turn, move forward without hitting walls) to reach Exit A, and the objective is suddenly changed to reach Exit B (a +1 reward is moved), the robot can achieve "zero-shot transfer." It uses its existing dynamics model to replan a new path immediately, requiring zero new physical experience. * **The "A-B" Tabular MDP Example:** A multi-step example on the slides shows a simple MDP with states A and B. A transition from A goes to B. B transitions to a terminal state yielding a reward of 1 (75% of the time) or 0 (25% of the time). Given 8 episodes of empirical experience (e.g., ending in three 1s and one 0 from a subset of data), the lecture proves that Temporal Difference (TD) learning with infinite experience replay converges to $V(B) = 0.75$. This is the exact same value derived by constructing a Maximum Likelihood Estimate (MLE) table lookup model and running planning on it. * **MCTS Tree Expansion vs. Full Expectimax:** The slides visually contrast a full Expectimax tree (which expands every possible action and state out to horizon $H$, scaling as $|S||A|^H$) with an MCTS tree. The MCTS diagram shows an asymmetric, highly selective "best-first" search tree where only the most promising paths are deeply expanded based on Monte Carlo rollouts, breaking the curse of dimensionality. * **Case Study: The Game of Go:** Highlighted as a 2500-year-old game and a "Grand challenge task" (John McCarthy). It is played on a 19x19 board, resulting in a finite but massive combinatorial state space. The rules are simple, but the reward is incredibly sparse (only given at the very end of a long sequence of moves). Traditional game-tree search (Minimax) fails here because the horizon ($H \approx 150-200$ moves) makes the tree size computationally intractable (scaling at least $S \times A^{150}$). * **Atari Game Mental Models:** The lecturer mentions psychological evidence suggesting humans systematically build models when playing Atari games—such as internally predicting what will happen if they move an iceberg next to a polar bear—to allow them to generalize to unseen states. ## 4. Actionable Takeaways (Implementation Rules) * **Rule 1: Isolate dynamics learning from reward learning for high adaptability.** -> If you operate in an environment where goals frequently change but physics/rules do not, implement model-based RL. Learn the transition matrix $T$ separately so you can instantaneously apply dynamic programming to a new reward matrix $R$ without gathering new data. * **Rule 2: Use simulation-based search (MCTS) to break the curse of dimensionality.** -> Never attempt exhaustive forward search (Expectimax/Minimax) if your state-action horizon calculation ($|S||A|^H$) exceeds your computational budget. Instead, use a model only to generate simulated samples, building an asymmetric search tree focused on high-value paths. * **Rule 3: Implement UCT to manage exploration inside the simulation tree.** -> When building an MCTS tree, do not select rollout actions purely randomly or purely greedily. Treat every node in the simulated tree as a distinct multi-armed bandit. Apply the UCB formula (Average Reward + Exploration Bonus based on visit counts) to force the search to occasionally explore unvisited branches. * **Rule 4: Default to Mean Squared Error (MSE) or KL Divergence for model learning.** -> When training a transition model $P(s' | s, a)$ or a reward model $R(s, a)$ from a dataset of tuples $(S_t, A_t, R_t, S_{t+1})$, frame it as a supervised learning problem. Use MSE for scalar reward regression and density estimation (like KL Divergence) for stochastic state transitions. ## 5. Pitfalls & Limitations (Anti-Patterns) * **Pitfall: Compounding approximation errors in Model-Based RL.** -> **Why it fails:** The agent first learns an approximate model of the world, and then runs an approximate planning algorithm (like deep Q-learning) on top of that flawed model. Errors in the model are exploited by the planner, leading to an escalating divergence between simulated performance and reality. -> **Warning sign:** The agent finds a "perfect" policy in simulation that immediately fails or acts erratically when deployed in the real environment. * **Pitfall: Enforcing Markovian assumptions on non-Markovian data.** -> **Why it fails:** If the true environment requires historical context (e.g., hidden variables), but your model uses Maximum Likelihood Estimation (MLE) based only on the immediate previous state, the model will average over the hidden states. This creates a fundamentally inaccurate transition model. -> **Warning sign:** Planning on the learned model yields worse results than simple model-free TD learning, because model-free methods can sometimes bypass the structural bottleneck of a flawed Markov assumption. * **Pitfall: Confirmation bias in self-play or simulated data.** -> **Why it fails:** If a model is slightly inaccurate and you use it to generate simulated trajectories, the agent will learn to optimize for the flawed simulation. If you feed that simulated data back into training the model, you create a closed feedback loop of confirmation bias. -> **Warning sign:** The agent becomes highly overconfident in specific actions that are demonstrably suboptimal in the real world. ## 6. Key Quote / Core Insight "One of the nice benefits of model-based RL is that if you learn a dynamics model of the world, then if someone changes the reward function, implicitly you can just do zero-shot transfer... I don't need any more experience. I can just re-plan." ## 7. Additional Resources & References * **Resource:** AlphaGo (DeepMind) - **Type:** AI System / Paper - **Relevance:** Cited as the premier example of combining deep learning with Monte Carlo Tree Search to solve a Grand Challenge AI task. * **Resource:** CS238: Decision Making under Uncertainty (Michael Kochenderfer) - **Type:** Course - **Relevance:** Recommended by the lecturer for deeper study into sequential decision-making. * **Resource:** MSE351: Dynamic Programming and Stochastic Control (Ben Van Roy) - **Type:** Course - **Relevance:** Recommended for advanced theoretical topics in RL and dynamic programming.