Bayesian Networks: Probabilistic Programming
📂 General
# Bayesian Networks: Probabilistic Programming
**Video Category:** Computer Science / Artificial Intelligence Tutorial
## ð 0. Video Metadata
**Video Title:** Bayesian networks: probabilistic programming
**YouTube Channel:** Stanford Engineering (CS221)
**Publication Date:** Not shown in video
**Video Duration:** ~15:30
## ð 1. Core Summary (TL;DR)
Probabilistic programming offers a powerful paradigm for defining Bayesian networks by writing randomized programs that tell a "generative story" of how data is produced. Instead of relying solely on complex mathematical notation or static graphs, treating probability distributions as executable code makes it easier to model complex, real-world phenomena. This generative approachâmodeling how hidden quantities of interest generate observable dataâflips traditional classification on its head, providing a natural way to handle complex structures, prior knowledge, and missing information across domains like object tracking, NLP, and medical diagnosis.
## 2. Core Concepts & Frameworks
* **Probabilistic Program:** -> **Meaning:** A randomized program that, when executed, sets the values of random variables. It acts as a formal mathematical construct used to define a joint probability distribution over all possible assignments to those variables. -> **Application:** Used to simulate data generation and formally specify the relationships in a Bayesian network using code instead of raw equations.
* **Markov Model:** -> **Meaning:** A specific type of Bayesian network where variables form a linear chain, and each random variable depends exclusively on its immediate predecessor in the chain. -> **Application:** Used for sequential data like basic object tracking or simple language modeling (N-grams).
* **Hidden Markov Model (HMM):** -> **Meaning:** An extension of the Markov model that introduces a separation between "hidden" states (e.g., true location) and "observable" states (e.g., sensor readings). The hidden states form a Markov chain, and each observable state depends only on the hidden state at that specific time step. -> **Application:** Used for single object tracking from noisy sensor data or speech recognition.
* **Factorial Hidden Markov Model:** -> **Meaning:** A generalized HMM designed to track multiple hidden states simultaneously. It features multiple independent Markov chains of hidden states that jointly produce a single combined observable emission at each time step. -> **Application:** Used for multi-object tracking where sensor readings conflate the signals of multiple targets.
* **Latent Dirichlet Allocation (LDA):** -> **Meaning:** A probabilistic generative model for text where documents are represented as continuous probability distributions over "topics," and topics are distributions over words. -> **Application:** Used for topic modeling in document analysis to infer what subjects a large text corpus is discussing.
* **Stochastic Block Model:** -> **Meaning:** A generative model for graphs/networks where nodes (people) are assigned latent "types", and the probability of an edge (connection) forming between two nodes is determined entirely by their respective types. -> **Application:** Used in social network analysis to infer community structures and node roles based on connectivity patterns.
## 3. Evidence & Examples (Hyper-Specific Details)
* **Basic Probabilistic Program (Alarm Network):** The speaker demonstrates a basic network defining variables `B` and `E` drawn from a Bernoulli distribution (`B ~ Bernoulli(epsilon); E ~ Bernoulli(epsilon)`). Variable `A` is defined as `A = B OR E`. The `Bernoulli` function is shown as `def Bernoulli(epsilon): return random.random() < epsilon`. Running this generates joint assignments like `(A=1, B=0, E=1)`.
* **Object Tracking (Markov Model Demo):** An interactive JavaScript grid is shown. The object starts at `X_0 = (0,0)`. The code dictates: `if Bernoulli(alpha): go right; else: go down`.
- When `alpha = 0.5`, the generated path (red squares) steps randomly right and down with equal probability.
- Changing `alpha` to `0.1` visibly forces the path straight down the left edge.
- Changing `alpha` to `0.9` forces the trajectory strictly along the top right edge.
* **Conditioned Inference (Object Tracking):** Using the same grid, the speaker asks: "What are possible trajectories given evidence $X_{10} = (8,2)$?" The demo is run using a `condition(10, [8, 2])` function, which visually restricts the sampled red paths *only* to those that successfully pass through the coordinate (8,2) at time step 10.
* **Language Modeling (N-gram/Markov Model):** The speaker models speech recognition scoring using the phrase "Wreck a nice beach". The program generates $X_1$ ("Wreck"), then generates $X_2$ ("a") given $X_1$, $X_3$ ("nice") given $X_2$, etc., demonstrating a chain dependency $X_i \sim p(X_i | X_{i-1})$.
* **Single Object Tracking (HMM):** Visualized as a graph where hidden locations $H_t$ (e.g., coordinate $3,1$) generate sensor readings $E_t$ (e.g., reading value $4$). The inference task is to find the object location $H$ given only the sequence of sensor readings $E_1, E_2, E_3 \dots$.
* **Multiple Object Tracking (Factorial HMM):** The model generates locations for object $a$ ($H_t^a$) and object $b$ ($H_t^b$). At time step 1, $H_1^a$ and $H_1^b$ jointly produce a single sensor reading $E_1$. The visual graph shows arrows from both $H_1^a$ and $H_1^b$ pointing into $E_1$.
* **Document Classification (Naive Bayes):** A class label $Y$ is generated first (e.g., $Y$ = "travel"). Given this label, the program independently generates positions $W_1$ to $W_L$. The example shows $W_1$ becoming "beach" and $W_2$ becoming "Paris" based purely on the parent label "travel".
* **Topic Modeling (LDA):** The model first generates a continuous distribution vector $\alpha \in \mathbb{R}^K$ (e.g., $\alpha$ = `{travel: 0.8, Europe: 0.2}`). For each word position, it generates a specific topic $Z_i$ based on $\alpha$ (e.g., $Z_1$ = travel, $Z_L$ = Europe), and then generates the actual word $W_i$ based on $Z_i$ (e.g., $W_1$ = "beach", $W_L$ = "Euro").
* **Medical Diagnosis (Bipartite Structure):** The model generates boolean states for diseases $D_i$ (e.g., Pneumonia=1, Cold=0, Malaria=0). Symptoms $S_j$ are then generated based on the combination of active diseases. For instance, "Fever=1" and "Cough=1" are generated, conditional on the parent diseases.
* **Social Network Analysis (Stochastic Block Model):** The program generates person types: $H_1$ = politician, $H_2$ = scientist, $H_3$ = scientist. It then generates binary connectivity $E_{ij}$. Because $H_2$ and $H_3$ are both scientists, $E_{23}$ evaluates to 1 (connected). Because $H_1$ is a politician and $H_3$ is a scientist, $E_{13}$ evaluates to 0 (not connected).
## 4. Actionable Takeaways (Implementation Rules)
* **Rule 1: Adopt the generative modeling paradigm** - Do not build models by mapping inputs directly to outputs (discriminative). Instead, define a "story" of how the hidden quantities of interest in the real world physically generate the observable data.
* **Rule 2: Use simulation to build probability intuition** - Write your probabilistic model as a program with variable parameters (like the `alpha` value in the grid tracker) and run it forward multiple times. Observing the raw outputs of the generative process helps debug model assumptions before attempting complex inference.
* **Rule 3: Frame inference as restricted simulation** - To perform probabilistic inference (answering "what happened?"), run your generative program but strictly clamp or filter the runs against the known observed evidence (e.g., forcing $X_{10} = (8,2)$). The remaining distribution of valid paths represents your answer.
* **Rule 4: Ignore missing variables entirely during inference** - When dealing with missing information (e.g., a patient was not asked if they have a specific symptom), do not impute the data or alter the network structure. Simply do not condition on that variable; the probabilistic program mathematically marginalizes it out by default.
## 5. Pitfalls & Limitations (Anti-Patterns)
* **Pitfall:** Treating classification strictly as an input-to-output mapping. -> **Why it fails:** It prevents the model from understanding the causal structure of the environment, making it brittle when faced with missing data or novel scenarios that require domain knowledge. -> **Warning sign:** You find it impossible to incorporate prior knowledge (like the fact that two scientists are more likely to talk) into a standard black-box classifier.
* **Pitfall:** Treating a probabilistic program *only* as executable code. -> **Why it fails:** While the code can be run to generate samples, relying on raw execution (like rejection sampling) for inference is computationally intractable for rare events. The code must be viewed as a formal definition of a joint distribution that requires specialized mathematical inference engines to solve efficiently. -> **Warning sign:** Your program runs forever trying to generate a simulation that matches a highly specific, rare set of observed evidence.
## 6. Key Quote / Core Insight
"The common paradigm in probabilistic programming is to come up with stories of how the quantities of interest generate the observed data. This is the exact opposite of how we normally do classification."
## 7. Additional Resources & References
No external resources explicitly mentioned in this video.