CS Knowledge Hub

# Foundations of Reinforcement Learning: Markov Decision Processes and Value Iteration **Video Category:** Machine Learning Lecture ## ð 0. Video Metadata **Video Title:** Not explicitly shown in video (Lecture on Reinforcement Learning) **YouTube Channel:** Stanford Engineering **Publication Date:** Not shown in video **Video Duration:** ~2 hours 10 minutes ## ð 1. Core Summary (TL;DR) This lecture introduces the fundamental concepts of Reinforcement Learning (RL), distinguishing it from supervised learning by emphasizing sequential decision-making in environments with delayed rewards. It formally defines the Markov Decision Process (MDP) framework, establishing the mathematical components of states, actions, transition dynamics, and reward functions. Finally, it details the Bellman equations and the Value Iteration algorithm as rigorous mathematical tools for computing optimal policies that maximize long-term discounted payoffs. ## 2. Core Concepts & Frameworks * **Sequential Decision Making:** Unlike supervised learning which focuses on isolated predictions (e.g., predicting housing prices), RL involves sequences of actions where current decisions influence both future states and the information collected from the environment. * **Markov Decision Process (MDP):** The mathematical framework used to formulate RL problems. It consists of a tuple $(S, A, P_{sa}, \gamma, R)$: * **States ($S$):** The finite set of all possible situations the agent can inhabit. * **Actions ($A$):** The set of all possible moves the agent can make. * **Transition Probabilities ($P_{sa}$):** The probability distribution over next states given a current state and action: $P_{sa}(s') = P(s_{t+1} = s' | s_t = s, a_t = a)$. This defines the environment's dynamics, which can be deterministic or stochastic. * **Reward Function ($R$):** A scalar feedback signal indicating the immediate goodness of a state. $R: S \rightarrow \mathbb{R}$. * **Discount Factor ($\gamma$):** A value $0 \le \gamma < 1$ used to weight immediate rewards more heavily than future rewards and mathematically bound infinite sums. * **Policy ($\pi$):** A rule defining the agent's behavior. A deterministic policy $\pi: S \rightarrow A$ maps a given state to a specific action. The goal of RL is to find the optimal policy $\pi^*$. * **Value Function ($V^\pi(s)$):** The expected total discounted payoff an agent will obtain if it starts in state $s$ and strictly follows policy $\pi$ thereafter. $V^\pi(s) = \mathbb{E}[R(s_0) + \gamma R(s_1) + \gamma^2 R(s_2) + \dots | s_0 = s, \pi]$. * **Bellman Equation:** A recursive equation that decomposes the value of a state into the immediate reward plus the discounted expected value of the subsequent state. * **Value Iteration:** A dynamic programming algorithm used to compute the optimal value function $V^*$. It works by repeatedly updating value estimates using the Bellman optimality operator until convergence. ## 3. Evidence & Examples (Hyper-Specific Details) * **Running Example: Robot on a 1D Tape:** The instructor uses a 1D grid with 10 discrete states ($S = \{1, 2, ..., 10\}$) to illustrate MDP concepts. * **State:** The robot's current position (e.g., $s = 3$). * **Actions:** The robot can attempt to move Left ($L$) or Right ($R$). * **Goal:** Reach state 6. * **Transition Dynamics ($P_{sa}$):** The environment is stochastic. If at state 7 the robot takes action $L$, it succeeds with a 90% probability ($P_{7,L}(6) = 0.9$) but slips and stays in place with a 10% probability ($P_{7,L}(7) = 0.1$). * **Reward Function ($R$):** The reward at the goal state is positive ($R(6) = 1.0$), while all other states have a slight negative reward ($R(s) = -0.1$ for $s \neq 6$) to encourage the robot to find the goal quickly rather than loitering. * **Helicopter Flight Control:** Used as a conceptual example of an environment lacking expert supervision. Unlike medical diagnosis where a doctor provides the "correct" label, flying a novel autonomous helicopter requires the algorithm to discover the best sequence of control surface adjustments purely through trial, error, and a reward function, as human experts do not inherently know the optimal mathematical policy. * **Mathematical Bound of Discounted Payoff:** The instructor demonstrates why $\gamma < 1$ is necessary for infinite horizon problems. If the reward is bounded by $[-M, M]$, the maximum possible discounted payoff is an infinite geometric series: $M + M\gamma + M\gamma^2 + \dots$, which converges exactly to $M / (1 - \gamma)$. * **Effective Horizon:** The instructor explains that the discount factor creates an "effective horizon length." The term $\gamma^{1/(1-\gamma)}$ approximates $1/e$. If $\gamma = 0.99$, the effective horizon is roughly 100 steps, meaning rewards beyond that point are discounted so heavily they have minimal impact on current decisions. ## 4. Actionable Takeaways (Implementation Rules) * **Rule 1: Design policies independent of history** -> Formulate your state representation so that it satisfies the Markov Property. The optimal action at time $t$ should only require the current state $S_t$, allowing you to discard the history of past states and actions. * **Rule 2: Evaluate fixed policies using linear solvers** -> To find the value of a specific policy ($V^\pi$), construct the Bellman equation $V^\pi(s) = R(s) + \gamma \sum_{s'} P_{s\pi(s)}(s') V^\pi(s')$. Because the policy is fixed, the `max` operator is removed, resulting in a system of $m$ linear equations (for $m$ states) that can be solved directly and efficiently via matrix operations (Policy Evaluation). * **Rule 3: Compute optimal values via Value Iteration** -> To find the optimal value function ($V^*$), initialize a vector $V \in \mathbb{R}^m$ to zeros. Repeatedly apply the update rule: $V(s) := R(s) + \max_a \gamma \sum_{s'} P_{sa}(s') V(s')$ for all states $s$. Continue iterating this Bellman operator until the values converge to a fixed point. * **Rule 4: Extract the optimal policy from optimal values** -> Once Value Iteration converges to $V^*$, derive the optimal policy $\pi^*$ by selecting the action that yields the highest expected next-state value: $\pi^*(s) = \arg\max_a \sum_{s'} P_{sa}(s') V^*(s')$. ## 5. Pitfalls & Limitations (Anti-Patterns) * **Pitfall:** Utilizing purely greedy decision-making strategies. -> **Why it fails:** By choosing actions based only on immediate, short-term rewards, the agent ignores the sequential nature of the environment and the long-term consequences of its actions. -> **Warning sign:** The system achieves early minor successes but consistently fails to reach complex, distant goals that require traversing through low-reward states. * **Pitfall:** Using an infinite time horizon without a discount factor ($\gamma = 1$). -> **Why it fails:** The mathematical sum of rewards over infinite time steps will diverge to infinity (if rewards are positive), making it impossible for the Bellman equations to converge or to definitively compare which policy is better. -> **Warning sign:** Value iteration values explode to infinity, or the algorithm never converges. * **Pitfall:** Assuming the environment is deterministic when it is stochastic. -> **Why it fails:** If transitions are modeled as $s_{t+1} = f(s_t, a_t)$ rather than a probability distribution $P_{sa}$, the agent will fail to account for failure modes (e.g., slipping on the tape) and will overvalue risky policies. -> **Warning sign:** The theoretical value function drastically overestimates the actual rewards observed during real-world execution. ## 6. Key Quote / Core Insight "Reinforcement learning is about sequential decision making. It's not like you can just greedily choose a decision based on the current situation... your decisions have long-term consequences. The decision you made yesterday probably would affect the decision you make today." ## 7. Additional Resources & References * No external resources explicitly mentioned in this video.