CS Knowledge Hub

# Model-Free Policy Evaluation: Learning Without Knowing How the World Works **Video Category:** Programming Tutorial / Machine Learning Lecture ## ð 0. Video Metadata **Video Title:** Emma Brunskill (CS234 Reinforcement Learning Lecture 3: Model-Free Policy Evaluation: Po... **YouTube Channel:** Stanford Online (Inferred from context and watermark) **Publication Date:** Spring 2024 (Visible on presentation slides) **Video Duration:** ~2 hours (approximate based on standard lecture length and context) ## ð 1. Core Summary (TL;DR) This lecture introduces model-free policy evaluation in reinforcement learning, detailing how an agent can estimate the expected return of a fixed policy using only empirical data gathered from interacting with the environment. It contrasts three primary methodsâMonte Carlo evaluation, Temporal Difference (TD) learning, and Certainty Equivalence (batch) evaluationâhighlighting their respective mechanisms. By understanding the tradeoffs between these methods regarding bias, variance, data efficiency, and the Markov assumption, practitioners can accurately evaluate policies even when the true transition dynamics and reward functions are completely unknown. ## 2. Core Concepts & Frameworks * **Model-Free Policy Evaluation:** -> **Meaning:** The process of estimating the state-value function $V^\pi(s)$ (the expected discounted sum of future rewards) for a specific policy $\pi$ using only sample trajectories from direct experience. -> **Application:** Used when an agent is placed in a complex real-world environment (like navigating a physical robot) where writing down the exact probability of state transitions or reward distributions is impossible. * **Monte Carlo (MC) Policy Evaluation:** -> **Meaning:** An evaluation method that estimates the value of a state by averaging the actual, observed empirical returns ($G_t$) from that state onwards across multiple completed episodes. -> **Application:** Useful for environments that naturally terminate into finite episodes (e.g., a game of chess) and in scenarios where the Markov property cannot be strictly guaranteed, as MC does not rely on state transitions to estimate value. * **Temporal Difference (TD) Learning (Bootstrapping):** -> **Meaning:** An algorithm that updates its value estimate for a state based on the immediate reward received plus the *estimated* value of the subsequent state ($r_t + \gamma V(s_{t+1})$). This technique of updating a guess based on another guess is called bootstrapping. -> **Application:** Applied in continuous, non-episodic environments or situations requiring online, step-by-step learning (e.g., adjusting financial trading strategies continuously without waiting for the market to close). * **Certainty Equivalence (Batch Policy Evaluation):** -> **Meaning:** An approach that takes a batch of historical data to construct an empirical Maximum Likelihood Estimate (MLE) model of the MDP (counting transitions to estimate probabilities and averaging observed rewards). It then uses Dynamic Programming to solve for the value function of this empirical model. -> **Application:** Highly effective in data-scarce environments where gathering new data is expensive or dangerous (e.g., medical treatment trials), as it extracts the maximum possible statistical information from a limited dataset. ## 3. Evidence & Examples (Hyper-Specific Details) * **[Value Iteration vs Policy Iteration Efficiency]:** To prove that Value Iteration can require more than $|A|^{|S|}$ iterations, the instructor presents a 1-state, 1-action MDP with a reward $r(s,a)=1$ and a discount factor $\gamma=0.9$. Initializing $V_0(s)=0$, the true optimal value is $V^* = \frac{1}{1-0.9} = 10$. However, after the first iteration, $V_1(s)=1$. Value Iteration takes many subsequent iterations to asymptotically approach 10, demonstrating potential inefficiency. Policy Iteration, conversely, would find the optimal policy and evaluate it exactly in one single step. * **[Mars Rover Trajectory Application]:** A sample episodic trajectory is shown to illustrate Monte Carlo updates: $(s_3, a_1, 0, s_2, a_1, 0, s_1, a_1, 1, \text{terminal})$. For First-Visit Monte Carlo, $V(s_3)$ is updated using the total return of the episode, which is 1. If $V(s_3)$ was initialized at 0 and the learning rate $\alpha=1$, the incremental update formula $V(s_3) = V(s_3) + \alpha(G - V(s_3))$ results in $V(s_3) = 0 + 1(1 - 0) = 1$. * **[The AB Example - Contrasting MC and TD]:** Based on Sutton & Barto (Ex 6.4), an MDP has two states, A and B, with $\gamma=1$. An agent is given a batch of 8 episodes: 1 episode goes $A \rightarrow B$ (reward 0) then terminates (reward 0). 6 episodes start in $B$ and terminate with reward 1. 1 episode starts in $B$ and terminates with reward 0. * **Monte Carlo** evaluates $V(A) = 0$ because the single episode starting at A yielded a total return of 0. * **TD(0)** evaluates $V(B) = 6/8 = 0.75$. Because TD bootstraps, it updates $A$ based on the transition to $B$: $V(A) = r + \gamma V(B) = 0 + 1(0.75) = 0.75$. TD(0) effectively builds an implicit Maximum Likelihood Markov model of the data, converging to the Certainty Equivalence solution. ## 4. Actionable Takeaways (Implementation Rules) * **Rule 1: Deploy Monte Carlo when the Markov assumption is weak** - If your state representation does not perfectly capture all historical information needed to predict the future, use Monte Carlo evaluation. Because MC averages actual sequence returns rather than relying on transition estimates, it remains an unbiased estimator even in non-Markovian environments. * **Rule 2: Use TD(0) to minimize variance and enable online learning** - In stochastic environments, implement TD(0) learning ($V(s_t) \leftarrow V(s_t) + \alpha[r_t + \gamma V(s_{t+1}) - V(s_t)]$) to reduce the high variance associated with full MC rollouts. This allows the agent to learn incrementally after every single step rather than waiting for an episode to terminate. * **Rule 3: Implement learning rate decay for noisy environments** - When using incremental MC or TD updates, a constant learning rate $\alpha$ will cause the value estimate to perpetually oscillate around the true value. To guarantee convergence, decay the learning rate over time (e.g., using $\alpha = 1/N(s)$, where $N(s)$ is the number of times state $s$ has been visited). * **Rule 4: Apply Certainty Equivalence for maximum data efficiency** - When evaluating policies offline using a small, finite dataset (like historical medical records), calculate the Maximum Likelihood Estimate of the transition probabilities $\hat{P}(s'|s,a) = \frac{N(s,a,s')}{N(s,a)}$ and run Dynamic Programming on this model. This approach minimizes mean squared error relative to the observed data better than standard TD or MC. ## 5. Pitfalls & Limitations (Anti-Patterns) * **Pitfall:** Attempting to use Monte Carlo evaluation in continuous, non-episodic environments. -> **Why it fails:** MC relies on the total return $G_t$, which can only be calculated once an episode reaches a terminal state. -> **Warning sign:** The algorithm hangs indefinitely, waiting to calculate a return, or runs out of memory attempting to store an infinitely long trajectory. * **Pitfall:** Assuming Every-Visit Monte Carlo is an unbiased estimator. -> **Why it fails:** If an agent visits the same state multiple times within a single episode, the returns calculated from those visits share the same subsequent trajectory and are mathematically correlated. -> **Warning sign:** Value estimates exhibit a persistent, systemic bias compared to First-Visit MC, particularly in highly cyclical MDPs. * **Pitfall:** Utilizing Temporal Difference learning in strongly non-Markovian domains. -> **Why it fails:** TD relies on bootstrapping; the value of a current state is updated based on the assumed value of the next state. If the state representation lacks crucial historical context, the "next state" value is fundamentally flawed, and bootstrapping actively propagates that error backward. -> **Warning sign:** TD value estimates diverge wildly from the actual empirical returns observed over time, while MC estimates remain accurate. ## 6. Key Quote / Core Insight "If one had to identify one idea as central and novel to reinforcement learning, it would undoubtedly be temporal-difference (TD) learning." *Insight:* The true power of modern reinforcement learning lies in combining the direct, model-free interaction of Monte Carlo methods with the mathematical, step-by-step bootstrapping structure of Dynamic Programming. ## 7. Additional Resources & References * **Resource:** *Reinforcement Learning: An Introduction* (Jan 1 2018 draft) by Richard S. Sutton and Andrew G. Barto - **Type:** Book - **Relevance:** Cited directly in the lecture as the primary foundational text for this material, specifically referencing Chapters 5.1, 5.5, and 6.1-6.3 for deep dives into MC and TD learning, as well as providing the AB Example (Ex 6.4).