Advanced Probabilistic Inference: Hidden Markov Models, Particle Filtering, and Gibbs Sampling
📂 General
# Advanced Probabilistic Inference: Hidden Markov Models, Particle Filtering, and Gibbs Sampling
**Video Category:** Computer Science / Artificial Intelligence Tutorial
## ð 0. Video Metadata
**Video Title:** Not explicitly shown (Stanford CS221 Lecture on Probabilistic Inference)
**YouTube Channel:** Stanford Engineering
**Publication Date:** Autumn 2019 (extracted from slide footers)
**Video Duration:** ~1 hour 15 minutes
## ð 1. Core Summary (TL;DR)
This lecture explores how to perform efficient probabilistic inference within complex Bayesian networks, specifically focusing on tracking hidden states over time. It introduces Hidden Markov Models (HMMs) for sequence data and contrasts exact inference methods with approximate ones. By leveraging the Forward-Backward algorithm, Particle Filtering, and Gibbs Sampling, AI practitioners can compute probabilities in systems ranging from simple discrete state tracking to massive, intractable models like continuous spatial tracking and image denoising.
## 2. Core Concepts & Frameworks
* **Concept:** Hidden Markov Model (HMM) -> **Meaning:** A specific type of Bayesian network designed for sequence data, consisting of a sequence of unobserved "hidden" variables ($H_1, H_2 \dots H_n$) and corresponding "observed" evidence variables ($E_1, E_2 \dots E_n$). It relies on a start distribution $p(h_1)$, transition distributions $p(h_i|h_{i-1})$, and emission distributions $p(e_i|h_i)$. -> **Application:** Used for real-time object tracking, speech recognition, and any scenario where true states must be inferred from a sequence of noisy sensor readings over time.
* **Concept:** Filtering vs. Smoothing Queries -> **Meaning:** In an HMM, "Filtering" asks for the probability of the *current* state given all evidence up to the *current* time step (e.g., $P(H_3 | E_1, E_2, E_3)$). "Smoothing" asks for the probability of a *past* state given *all* evidence across the entire timeline (e.g., $P(H_3 | E_1, E_2, E_3, E_4, E_5)$). -> **Application:** Filtering is used for real-time navigation (where am I now?), while smoothing is used for retrospective analysis (where was the object 5 minutes ago given everything I know now?).
* **Concept:** The Lattice Representation -> **Meaning:** A way to visualize all possible assignments of hidden variables over time in an HMM as paths through a grid. The columns represent time steps ($1$ to $n$), and the rows represent possible values ($1$ to $K$). Every path from start to end represents a complete assignment. -> **Application:** By converting inference into a path-summation problem, it allows the use of dynamic programming (the Forward-Backward algorithm) to avoid calculating probabilities for an exponential number of paths.
* **Concept:** Forward-Backward Algorithm -> **Meaning:** An exact dynamic programming algorithm that efficiently computes smoothing queries. It computes "Forward messages" $F_i(h_i)$ (sum of path weights from the start to node $h_i$) and "Backward messages" $B_i(h_i)$ (sum of path weights from node $h_i$ to the end). The total unnormalized probability for any node is simply $F_i(h_i) \times B_i(h_i)$. -> **Application:** Used to compute exact marginal probabilities for any hidden variable in $O(nK^2)$ time, sharing intermediate computations across all queries simultaneously.
* **Concept:** Particle Filtering -> **Meaning:** An approximate inference algorithm used when the state space $K$ is too large for exact calculation. It maintains a subset of $K$ "particles" (hypotheses) through three steps: Propose (move particles based on transition physics), Weight (score particles based on how well they match the new sensor emission), and Resample (draw a new set of particles proportional to those weights). -> **Application:** Highly effective for continuous spatial tracking (like autonomous driving or robotics) where it is impossible to enumerate every single coordinate point the object could exist in.
* **Concept:** Gibbs Sampling -> **Meaning:** A Markov Chain Monte Carlo (MCMC) algorithm for approximate inference in arbitrary factor graphs (not just HMM sequences). It initializes with a random complete assignment, then iteratively loops through individual variables, resampling the value of one variable at a time based only on its local probabilities conditioned on its immediate neighbors. -> **Application:** Used in complex, loopy graphs where exact inference is impossible, such as computer vision tasks like image denoising or segmentation.
## 3. Evidence & Examples (Hyper-Specific Details)
* **HMM Mathematical Setup (Object Tracking):** The speaker details an object tracking scenario where $H_i \in \{1 \dots K\}$ represents the true location of a car at time step $i$, and $E_i \in \{1 \dots K\}$ represents the noisy sensor reading. The start distribution $p(h_1)$ is uniform. The transition distribution $p(h_i | h_{i-1})$ dictates that cars can only move to adjacent locations (uniform over adjacent). The emission distribution $p(e_i | h_i)$ assigns probabilities to sensor readings given the actual location, factoring in expected sensor noise.
* **Whiteboard Demonstration of Forward-Backward:** The speaker draws a lattice with 3 time steps ($H_1, H_2, H_3$) and 2 possible states ($1, 2$). He assigns arbitrary integer weights to the edges to avoid decimal math. To find the probability of $H_2=2$, he calculates the sum of weights of all paths entering $H_2=2$ from the left (Forward message) and multiplies it by the sum of weights of all paths exiting $H_2=2$ to the right (Backward message). He demonstrates that this mathematically factors out the exponential complexity, calculating $(a+b)(c+d)$ instead of $ac+ad+bc+bd$.
* **Particle Filtering JavaScript Demo:** A live demo shows a yellow dot moving on a black background, representing the true object. A swarm of red dots represents the "particles" trying to track it. The speaker adjusts the `numParticles` parameter. When set too low, the red swarm loses the yellow dot and wanders randomly. When increased to $10,000$, the red blob tightly follows the yellow dot. The demo visually proves the "Resample" step, showing how particles with low weights disappear while high-weight particles duplicate and cluster around the probable location.
* **Image Denoising Demo (Gibbs Sampling):** A grid of binary pixels ($X_i \in \{0, 1\}$) represents a black-and-white image. The observed image is heavily corrupted with noise. The factor graph enforces a "coherence factor" (e.g., weight 2 if neighboring pixels are the same color, weight 1 if different) and an observation factor linking hidden pixels to the noisy observed pixels. The demo runs iterations of Gibbs Sampling; as the iterations progress from 1 to 200+, the noise visually vanishes, and the numbers "221" clearly emerge from the static, proving that local, single-variable updates eventually push the entire graph toward the global maximum probability state.
## 4. Actionable Takeaways (Implementation Rules)
* **Rule 1: Use Forward-Backward to avoid redundant computation** - When solving smoothing queries in an HMM, do not compute probabilities for each time step individually. Run one forward pass to compute all $F$ values, run one backward pass to compute all $B$ values, and cache them. You can then compute $P(H_i=h | E=e)$ for *any* time step $i$ instantly by multiplying $F_i(h_i) \times B_i(h_i)$ and normalizing.
* **Rule 2: Default to Particle Filtering for massive state spaces** - If your variable can take on a massive number of states (e.g., $10,000$ possible grid coordinates), the $O(nK^2)$ complexity of Forward-Backward becomes a bottleneck. Switch to Particle Filtering and set the beam size (number of particles) to a manageable number (e.g., $1,000$).
* **Rule 3: Always Resample proportionally, do not use greedy pruning** - When pruning hypotheses in Particle Filtering, never just take the top $K$ highest-weighted particles (like in Beam Search). This destroys the diversity of the probability distribution. You must resample by drawing $K$ elements *with probability proportional to their weight*, allowing a chance for lower-weight edge cases to survive and maintain an accurate distribution shape.
* **Rule 4: Apply Gibbs Sampling for cyclic/loopy graph structures** - If your Bayesian network is not a simple linear sequence (like an HMM) but a 2D grid or complex web (like image pixels), exact dynamic programming algorithms will fail due to cycles. Implement Gibbs Sampling by freezing all variables except one, calculating the probability of that one variable based on its direct neighbors, resampling it, and looping until the system converges.
## 5. Pitfalls & Limitations (Anti-Patterns)
* **Pitfall:** Using exact inference (Forward-Backward) on large continuous domains. -> **Why it fails:** The algorithm's time complexity is $O(nK^2)$, where $K$ is the number of possible states. If tracking coordinates on a $100 \times 100$ grid, $K = 10,000$, resulting in $100,000,000$ operations per time step. -> **Warning sign:** Computations become prohibitively slow or run out of memory when transitioning from simple discrete states to spatial tracking.
* **Pitfall:** Treating Particle Filtering exactly like standard Beam Search. -> **Why it fails:** Beam search is a heuristic that greedily keeps the top $K$ candidates. If applied to probability tracking, all particles will quickly converge onto the single highest-probability state, collapsing the distribution to a single point and failing to track multiple possible hypotheses. -> **Warning sign:** Your tracking system is overconfident, outputs a single point instead of a spread, and cannot recover if the object makes an unexpected sudden movement.
* **Pitfall:** Retaining low-weight particles in memory during tracking. -> **Why it fails:** In particle filtering, if you just propose and weight without resampling, you waste computational resources simulating particles that have effectively $0$ probability of matching reality. -> **Warning sign:** Your particle filter runs slowly but your particle cloud spans the entire map rather than tightly grouping around the target.
## 6. Key Quote / Core Insight
"The output of the Forward-Backward algorithm is not just the answer to one query, but all the smoothing queries you want. At every position, you have the distribution over the hidden variable given all the evidence. It computes all the queries in the same time it takes to compute a single query."
## 7. Additional Resources & References
* **Resource:** Stanford CS221 (Artificial Intelligence: Principles and Techniques) - **Type:** University Course - **Relevance:** The overarching curriculum providing the context, homework assignments (e.g., tracking cars), and Javascript demos utilized in this lecture.