Lecture 4: Model Free Control

📂 General
# Lecture 4: Model Free Control **Video Category:** Machine Learning / Artificial Intelligence Tutorial ## 📋 0. Video Metadata **Video Title:** Lecture 4: Model Free Control **YouTube Channel:** Stanford ENGINEERING **Publication Date:** Winter 2019 (Extracted from presentation slides) **Video Duration:** ~1 hour 17 minutes ## 📝 1. Core Summary (TL;DR) This lecture introduces "Model-Free Control," a core Reinforcement Learning framework where an agent must figure out the optimal sequence of actions to maximize rewards without prior knowledge of the environment's dynamics or reward structure. By estimating Action-Value functions ($Q$-functions) instead of State-Value functions ($V$-functions), agents can iteratively improve their policies purely from trial-and-error experience. This solves the fundamental challenge of acting optimally in complex, unknown environments where building an explicit simulation or model is computationally prohibitive or impossible. ## 2. Core Concepts & Frameworks * **Model-Free Control:** -> **Meaning:** The process of optimizing a decision-making policy to maximize expected discounted rewards without having access to an explicit model of the environment (no known transition probabilities $P$ or reward functions $R$). -> **Application:** Training an agent to play a board game solely by playing against itself and observing wins/losses, rather than computing probabilities of every possible board state. * **On-Policy vs. Off-Policy Learning:** -> **Meaning:** On-policy algorithms evaluate and improve the exact policy that is currently being used to make decisions in the environment (learning from "direct experience"). Off-policy algorithms learn the value of an optimal policy independently of the exploratory behavior policy currently being executed. -> **Application:** SARSA is on-policy and learns the value of the $\epsilon$-greedy path it walks. Q-learning is off-policy and learns the optimal path even while its physical robot might occasionally take random exploratory steps. * **Generalized Policy Iteration (GPI) with Action-Values ($Q$):** -> **Meaning:** If the environment's model is unknown, policy improvement cannot be performed using state values ($V(s)$) because computing the best next action requires knowing the transition dynamics. Instead, GPI estimates Action-Values ($Q(s,a)$), allowing the policy to be improved simply by taking the $\arg\max$ over the $Q$-values for a given state. -> **Application:** Updating a robot's navigation policy by looking at its internal table of $Q(s, a)$ scores and simply picking the action with the highest score for its current location. * **$\epsilon$-Greedy Exploration:** -> **Meaning:** A strategy to balance exploiting known good actions with exploring unknown actions. With probability $1 - \epsilon$, the agent chooses the action with the highest estimated $Q$-value. With probability $\epsilon$, it chooses a completely random action from the action space. -> **Application:** A recommendation engine showing the historically best-performing ad 90% of the time ($\epsilon = 0.1$), but showing a random ad 10% of the time to discover new high-performing creative. * **GLIE (Greedy in the Limit of Infinite Exploration):** -> **Meaning:** A set of theoretical conditions required for algorithms to converge to the true optimal policy. It requires that all state-action pairs are visited infinitely often, and that the exploration rate (like $\epsilon$) gradually decays to zero over time so the policy eventually becomes purely greedy. -> **Application:** Setting the exploration parameter in an algorithm to $\epsilon_t = 1/t$ so that it explores heavily at the start but eventually locks into the optimal path as time $t$ goes to infinity. ## 3. Evidence & Examples (Hyper-Specific Details) * **TD-Gammon / Backgammon (1994):** Gerry Tesauro utilized model-free reinforcement learning coupled with a neural network to train an agent to play Backgammon. The agent learned purely from playing games against itself without being explicitly programmed with strategy, serving as a landmark early success of model-free control in large state spaces. * **Climate Modeling vs. Model-Free:** The speaker notes that while one *could* try to build a model for computational sustainability or climate modeling, running simulations of the actual climate is incredibly expensive and complex. Model-free methods offer a way to learn optimal interventions directly from sampled data rather than requiring a perfect, computationally intractable physical simulation. * **On/Off-Policy Trajectory Combination:** Demonstrated via a whiteboard example. If an agent executes two trajectories: `[s1, a1, s1, a1]` and `[s1, a2, s1, a2]`, an on-policy learner only knows those exact paths. An off-policy learner can combine these separate experiences to estimate the value of a novel trajectory it has never executed, such as `[s1, a1, s1, a2]`. * **Mars Rover Q-Learning Example:** A mathematical walkthrough on the slides. The Rover has states $s_1, s_2, s_3$ and actions $a_1, a_2$. Action $a_1$ gives a reward of $+10$ eventually, and $a_2$ gives $+5$. The starting $Q$-matrix is initialized with arbitrary values: $Q(\cdot, a_1) = [1,0,1,0,0,0,0]$ and $Q(\cdot, a_2) = [0,1,0,0,0,0,0]$. An $\epsilon$-greedy policy ($\epsilon = 0.5$) is applied. The demonstration shows how an episode is sampled (e.g., $s_3 \rightarrow a_1 \rightarrow s_2 \rightarrow a_2 \rightarrow s_1 \rightarrow a_1 \rightarrow terminal$) and how First-Visit Monte Carlo calculates the return $G(s_3, a_1) = 1$, which is then used to update the previous $Q$-estimate of $1$. * **Sutton and Barto - Double Q-Learning vs. Q-Learning Chart (Fig 6.7):** A chart is shown detailing a state with multiple zero-mean reward actions. Because of Maximization Bias, standard Q-learning heavily overestimates the value of these actions, thinking they are positive, and selects suboptimal actions almost 100% of the time early in training. Double Q-learning, maintaining two estimators, keeps the estimated value near the true value of $0$, resulting in the agent dropping its suboptimal action selection rate to near 5% almost immediately. ## 4. Actionable Takeaways (Implementation Rules) * **Rule 1: Shift from $V$ to $Q$ when the model is unknown** - If you do not have the transition matrix ($P$) or reward function ($R$) of your environment, you cannot perform Policy Iteration using State Values ($V(s)$). You must rewrite your algorithms to estimate and track State-Action Values ($Q(s,a)$) so that policy improvement becomes a simple $\arg\max$ operation over actions. * **Rule 2: Enforce $\epsilon$-greedy exploration to prevent premature convergence** - Never use a purely greedy policy ($\arg\max Q(s,a)$) during the learning phase in a model-free setup. Always mix in random actions with probability $\epsilon$ to ensure the agent discovers high-reward actions that currently have low initial estimates. * **Rule 3: Decay $\epsilon$ to ensure final convergence (GLIE)** - To ensure your algorithm actually settles on the optimal policy, do not keep $\epsilon$ static indefinitely. Implement a schedule where $\epsilon$ decays toward zero over time (e.g., $\epsilon_k = 1/k$, where $k$ is the episode number). * **Rule 4: Use SARSA for safer, on-policy online learning** - Implement SARSA ($Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_t + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)]$) when you want the agent to evaluate the exact policy it is currently executing, which includes its exploration rate. This is preferable if physical safety is a concern during training, as it will learn to avoid dangerous states where random exploration might cause a failure. * **Rule 5: Use Q-Learning for aggressive, off-policy optimization** - Implement Q-Learning ($Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha [R_t + \gamma \max_{a'} Q(S_{t+1}, a') - Q(S_t, A_t)]$) when you want the agent to learn the absolute optimal path regardless of its current exploratory mistakes. * **Rule 6: Implement Double Q-Learning to fix noisy overestimations** - If your environment has highly stochastic or noisy rewards, standard Q-learning will succumb to Maximization Bias and systematically overestimate values. Split your experience to train two independent functions ($Q_1$ and $Q_2$). Use $Q_1$ to select the best action ($a^* = \arg\max_a Q_1(S', a)$), but use $Q_2$ to evaluate its value ($Target = R + \gamma Q_2(S', a^*)$). ## 5. Pitfalls & Limitations (Anti-Patterns) * **Pitfall:** Initializing $Q$-values to zero without forcing exploration. -> **Why it fails:** If a deterministic greedy policy encounters a single negative reward, it will lower that action's $Q$-value below zero and indefinitely select other unvisited actions (which are still at zero), potentially ignoring the true optimal path forever. -> **Warning sign:** The agent tries an action once, gets a minor penalty, and never attempts that action again, despite it leading to a massive payoff later. * **Pitfall:** Relying on Monte Carlo (MC) Control for continuous tasks. -> **Why it fails:** MC Control requires calculating the total return ($G_t$) from a state all the way to the terminal state of an episode before it can make a single update to the $Q$-function. -> **Warning sign:** The algorithm is unable to learn or update in environments that do not have a defined "end" or terminal state, or episodes are excessively long, causing extreme delays in learning. * **Pitfall:** Maximization Bias in standard Q-Learning. -> **Why it fails:** The expectation of a maximum is greater than or equal to the maximum of an expectation ($E[\max(Q_1, Q_2)] \ge \max(E[Q_1], E[Q_2])$). When taking the maximum over noisy estimates of $Q$-values, the algorithm naturally latches onto positive statistical noise, creating a systematic upward bias in value estimations. -> **Warning sign:** The agent's estimated $Q$-values continuously climb higher than the actual empirical rewards it is receiving from the environment, leading it to favor highly stochastic, zero-mean actions over reliable ones. ## 6. Key Quote / Core Insight "The real problem that often comes up in reinforcement learning... is how should an agent make decisions when they don't know how the world works, and they still want to maximize their expected discounted sum of rewards." *Refined Insight:* The core breakthrough of model-free control is realizing that an agent does not need to understand the underlying physics, probabilities, or logic of its environment to master it. By shifting focus from mapping the environment (State Values) to mapping the consequences of interactions (Action Values), an agent can organically compute optimal strategies purely through trial, error, and memory. ## 7. Additional Resources & References * **Resource:** TD-Gammon (Gerry Tesauro, 1994) - **Type:** Mentioned Case Study - **Relevance:** Cited as a primary early example of successful model-free reinforcement learning using Neural Networks to master Backgammon. * **Resource:** Reinforcement Learning: An Introduction (Richard S. Sutton and Andrew G. Barto) - **Type:** Book - **Relevance:** Specifically referenced for "Figure 6.7" demonstrating the mathematical proof and empirical effects of Maximization Bias and Double Q-learning.