Fast Reinforcement Learning: Optimism, Bayesian Bandits, and PAC Frameworks

📂 General
# Fast Reinforcement Learning: Optimism, Bayesian Bandits, and PAC Frameworks **Video Category:** Machine Learning / Artificial Intelligence Tutorial ## 📋 0. Video Metadata **Video Title:** Lecture 12: Fast Reinforcement Learning **YouTube Channel:** Stanford Engineering **Publication Date:** Winter 2019 (Visible on slide) **Video Duration:** ~1 hour 21 minutes ## 📝 1. Core Summary (TL;DR) This lecture explores how to transition Reinforcement Learning (RL) from theoretical models to practical, real-world applications by focusing on "fast learning"—minimizing the data required to make optimal decisions. It contrasts frequentist approaches like Optimism Under Uncertainty (UCB) with Bayesian approaches like Thompson Sampling for solving the exploration-exploitation dilemma in Multi-Armed Bandits. Finally, it scales these concepts to full Markov Decision Processes (MDPs), demonstrating how optimistic initialization and model-based exploration bonuses can provide formal guarantees (PAC) for efficient learning. ## 2. Core Concepts & Frameworks * **Concept:** Multi-Armed Bandits -> **Meaning:** A simplified RL setting with no state transitions, only a set of actions (arms) that each yield a reward from an unknown stochastic distribution. -> **Application:** Used for isolated decision-making problems like A/B testing or ad placement where the environment does not change based on the action taken. * **Concept:** Regret -> **Meaning:** A mathematical framework to assess algorithm quality, defined as the expected opportunity loss. It is the difference between the expected reward of the optimal action and the expected reward of the action actually chosen by the algorithm. -> **Application:** Used to prove that an algorithm is systematically learning; a good algorithm must achieve sub-linear (e.g., logarithmic) cumulative regret over time. * **Concept:** Optimism Under Uncertainty -> **Meaning:** A principle where an agent assigns an "optimistic" upper confidence bound to the estimated value of an action, scaling with the agent's uncertainty about that action. -> **Application:** Implemented in the UCB1 algorithm to force systematic exploration. As an action is sampled, uncertainty shrinks, and the bound lowers, preventing the agent from getting stuck on suboptimal actions. * **Concept:** Bayesian Bandits & Conjugate Priors -> **Meaning:** An approach that maintains a full probability distribution (posterior) over the unknown reward parameters for each action, updated via Bayes' Rule as new data arrives. Conjugate priors (where the prior and posterior share the same parametric form) make this computationally tractable. -> **Application:** Using a Beta distribution to model uncertainty over the success rate of a Bernoulli (binary success/fail) reward, updating the parameters via simple counting. * **Concept:** Thompson Sampling (Probability Matching) -> **Meaning:** A randomized exploration strategy that selects an action according to the exact probability that it is the optimal action, given the current posterior distributions. -> **Application:** Executed by sampling a single parameter value from each action's posterior distribution and greedily picking the action with the highest sampled value. * **Concept:** PAC (Probably Approximately Correct) -> **Meaning:** A theoretical framework stating an algorithm is PAC if, on all but a polynomial number of steps ($N$), it selects an action whose value is $\epsilon$-close to the optimal action, with a high probability ($1 - \delta$). -> **Application:** Used to provide formal sample complexity bounds for RL algorithms in MDPs, ensuring they won't require exponential data to converge. ## 3. Evidence & Examples (Hyper-Specific Details) * **Toy Example: Ways to Treat Broken Toes (Optimism via UCB1):** * **Setup:** Three actions with true, unknown Bernoulli parameters (success rates): Surgery ($\theta_1 = 0.95$), Buddy Taping ($\theta_2 = 0.9$), Doing Nothing ($\theta_3 = 0.1$). * **Initialization:** The algorithm samples each arm once. Suppose the empirical results are: Surgery gets +1, Taping gets +1, Nothing gets 0. * **Computation ($t=3$):** Upper Confidence Bounds are calculated using the formula $UCB(a) = \hat{Q}(a) + \sqrt{\frac{2\log t}{N_t(a)}}$. * $UCB(a_1) = 1 + \sqrt{\frac{2\log 3}{1}}$ * $UCB(a_2) = 1 + \sqrt{\frac{2\log 3}{1}}$ * $UCB(a_3) = 0 + \sqrt{\frac{2\log 3}{1}}$ * **Outcome:** The algorithm breaks ties randomly between $a_1$ and $a_2$. The UCB for the chosen action will drop on the next step due to $N_t(a)$ increasing, forcing eventual rotation, but it effectively ignores $a_3$ because its initial empirical mean was 0, meaning its upper bound will stay significantly lower. * **Toy Example: Ways to Treat Broken Toes (Thompson Sampling):** * **Setup:** Same true parameters ($\theta_1 = 0.95, \theta_2 = 0.9, \theta_3 = 0.1$). * **Initialization:** Place an uninformative uniform prior over each arm using a Beta distribution: Beta(1,1). * **Step 1:** Sample a parameter $\theta_i \sim \text{Beta}(1,1)$ for each arm. Assume we draw: $a_1=0.3, a_2=0.5, a_3=0.6$. * **Action:** Select $a_3$ because 0.6 is the max. * **Observation:** The patient outcome is a failure (reward = 0). * **Update:** The posterior for $a_3$ becomes Beta(1, 2). $a_1$ and $a_2$ remain Beta(1,1). * **Step 2:** Sample again from the new distributions. Assume we draw: $a_1=0.7, a_2=0.5, a_3=0.3$. * **Action:** Select $a_1$. * **Observation:** Patient outcome is a success (reward = 1). * **Update:** Posterior for $a_1$ becomes Beta(2, 1). * **Outcome:** Thompson sampling naturally shifts its distribution shapes. A Beta(2,1) for $a_1$ skews right (higher probability of being optimal), while Beta(1,2) for $a_3$ skews left. * **Empirical Comparison of UCB vs. Thompson Sampling:** * In a simulated run, Thompson Sampling incurs 0 regret for selecting $a_3$ (doing nothing) because it quickly lowers the posterior of $a_3$ after a failure and rarely samples it again. UCB, however, will occasionally be forced to sample $a_3$ simply because its visitation count $N_t(a)$ is low, causing the uncertainty bound $\sqrt{2\log t / N_t(a)}$ to grow until it surpasses the bounds of $a_1$ and $a_2$. * **Contextual Bandits for News Article Recommendation:** * **Source:** Chapelle and Li, 2010. * **Evidence:** The paper analyzed real-world click-through rates (CTR). A chart showed Thompson Sampling vastly outperforming standard UCB algorithms (like UCB1) and $\epsilon$-greedy in maximizing normalized CTR over time. The speaker notes this empirical success sparked a massive resurgence of interest in Thompson Sampling within the AI/CS community after 2010. * **Model-Based Interval Estimation with Exploration Bonus (MBIE-EB):** * **Source:** Strehl & Littman, 2006. * **Evidence:** On a 6-arm simulated MDP, MBIE-EB achieved a cumulative reward of ~$1 \times 10^7$, significantly outperforming R-Max and $\epsilon$-3 (epsilon-greedy) algorithms. The algorithm adds a mathematical bonus to the empirical reward: $\hat{R}(s,a) + \frac{\beta}{\sqrt{n(s,a)}}$, where $\beta$ is a tuning parameter based on required confidence ($\delta$) and state/action sizes. ## 4. Actionable Takeaways (Implementation Rules) * **Rule 1: Never use greedy algorithms for exploration** - A purely greedy approach (always selecting the highest empirical mean) can incur linear total regret. If the optimal action yields a bad reward on its first pull by chance, a greedy algorithm may permanently ignore it. * **Rule 2: Use Upper Confidence Bounds (UCB), not Lower Bounds** - Always select actions based on their highest potential value, not their guaranteed minimum. Using lower bounds creates confirmation bias; the algorithm repeatedly selects an initially "lucky" suboptimal arm because its lower bound grows, while ignoring optimal arms that were initially unlucky. * **Rule 3: Use Conjugate Priors for computational efficiency** - When implementing Bayesian algorithms, match the prior distribution to the likelihood function to avoid intractable integration. If rewards are binary (success/fail), use a Beta distribution prior. The update rule is computationally trivial: if success, add 1 to $\alpha$; if failure, add 1 to $\beta$. * **Rule 4: Implement Thompson Sampling for empirical performance** - To implement: 1) Define a prior distribution for each action. 2) Loop: Sample one value from each action's distribution. 3) Select the action with the highest sampled value. 4) Observe the reward. 5) Update the distribution for the chosen action via Bayes rule. * **Rule 5: Use Optimistic Initialization for model-free RL** - If you are using Q-learning or Sarsa and want to force exploration without complex math, initialize all $Q(s,a)$ values to $\frac{R_{max}}{1-\gamma}$ (where $R_{max}$ is the maximum possible reward and $\gamma$ is the discount factor). The agent will explore every state-action pair because all untried actions will look better than the updated, realistic values of tried actions. * **Rule 6: Use visitation counts to build PAC-compliant model-based RL** - To build an MDP algorithm with formal PAC guarantees, maintain a count of state-action visitations $n(s,a)$. Add a reward bonus proportional to $\frac{1}{\sqrt{n(s,a)}}$. As states are explored, the bonus shrinks, naturally transitioning the algorithm from exploration to exploitation. ## 5. Pitfalls & Limitations (Anti-Patterns) * **Pitfall:** Using $\epsilon$-greedy with a constant $\epsilon$ rate. -> **Why it fails:** The algorithm will explore randomly forever, meaning it will continually select suboptimal actions even after it has learned which action is best. -> **Warning sign:** The cumulative regret graph grows linearly forever as a straight line sloping upward. * **Pitfall:** Bayesian updating with deep neural networks. -> **Why it fails:** Deep neural networks do not generally form conjugate pairs with standard data likelihoods. Writing down a closed-form posterior distribution over millions of neural network weights is mathematically intractable. -> **Warning sign:** You cannot analytically update your prior, forcing you to rely on computationally expensive approximations (like MCMC) which slow down the RL loop. * **Pitfall:** Believing model-free RL algorithms (like standard Q-learning) are PAC. -> **Why it fails:** In general, model-free algorithms lack formal guarantees on performance. While optimistic initialization helps empirically, proving that standard Q-learning will find the optimal policy in a polynomial number of steps is impossible without highly constrained, specific learning-rate schedules. -> **Warning sign:** An algorithm gets stuck in local optima in environments requiring deep, multi-step exploration before seeing a reward. * **Pitfall:** Supplying highly opinionated, incorrect priors to Thompson Sampling. -> **Why it fails:** If the prior strongly asserts that an optimal arm is bad (e.g., heavily skewed to 0), the algorithm will sample a low value for that arm, rarely pick it, and rarely gather the evidence needed to correct the false prior. -> **Warning sign:** The algorithm explores inefficiently and converges slowly compared to a frequentist UCB approach. Uninformative (flat) priors like Beta(1,1) are safer. ## 6. Key Quote / Core Insight "Optimism under uncertainty allows an algorithm to essentially learn to make good decisions over time, avoiding the linear regret of greedy approaches by systematically exploring actions it is uncertain about. You construct a bound on how good the world could be, and you act as if that optimistic bound is true." ## 7. Additional Resources & References * **Resource:** "Interval Estimation" (1993) by Leslie Kaelbling (MIT) - **Type:** Paper - **Relevance:** The foundational text that introduced the concept of estimating confidence intervals for potential rewards in RL. * **Resource:** UCB1 Algorithm (2002) by Auer, Cesa-Bianchi, Fischer - **Type:** Paper - **Relevance:** Formalized the Upper Confidence Bound algorithm and proved its logarithmic regret bounds. * **Resource:** "An Empirical Evaluation of Thompson Sampling" (2010) by Chapelle and Li - **Type:** Paper - **Relevance:** Showed Thompson Sampling dramatically outperforming UCB in real-world contextual bandit problems (news recommendation), sparking modern interest in the algorithm. * **Resource:** Even-Dar and Mansour (2002) - **Type:** Paper - **Relevance:** Proved that Q-learning with specific learning rates can achieve PAC guarantees. * **Resource:** Jin, Allen-Zhu, Bubeck, Jordan (NeurIPS 2018) - **Type:** Paper - **Relevance:** Recent proof that optimistic Q-learning has good regret bounds, advancing model-free theoretical guarantees. * **Resource:** "Incremental model-based learners with formal learning-time guarantees" (2006) by Strehl, A. L., Li, L., & Littman, M. L. - **Type:** Paper - **Relevance:** Introduces the MBIE-EB algorithm, proving how to use visitation-count reward bonuses to achieve PAC RL in Markov Decision Processes.