Learning Parameters of Bayesian Networks: Maximum Likelihood and EM

📂 General
# Learning Parameters of Bayesian Networks: Maximum Likelihood and EM **Video Category:** Computer Science / Probabilistic Graphical Models ## 📋 0. Video Metadata **Video Title:** Lecture 14: learning parameters of Bayesian networks **YouTube Channel:** Stanford Engineering **Publication Date:** Not shown in video **Video Duration:** ~80 minutes ## 📝 1. Core Summary (TL;DR) This lecture details how to transition from manually defining probability tables in a Bayesian Network to learning them directly from training data. It demonstrates how Maximum Likelihood Estimation (MLE) can be achieved in fully observed datasets by simply counting local variable assignments and normalizing them into probabilities. To prevent fragile models that overfit sparse data, it introduces Laplace smoothing via pseudocounts. Finally, it solves the complex problem of learning from partially observed data (hidden variables) by introducing the Expectation-Maximization (EM) algorithm, which iteratively hallucinates fractional data weights to optimize parameters. ## 2. Core Concepts & Frameworks * **Concept:** Conditional Independence (D-Separation) -> **Meaning:** A framework to determine if two variables A and B are independent given evidence C by analyzing whether all paths between them in the Directed Acyclic Graph (DAG) are structurally blocked. -> **Application:** Used to simplify probabilistic inference by identifying which parts of a network can be computed independently, allowing for parallelization and reduced computational load. * **Concept:** Maximum Likelihood Estimation (MLE) -> **Meaning:** The mathematical principle of finding the specific parameters $\theta$ that maximize the probability (likelihood) of the observed training data. For fully observed Bayesian networks, this optimization problem has a closed-form solution: counting the occurrences of local assignments and normalizing them. -> **Application:** Automatically generating the local conditional probability tables for a network directly from a historical dataset. * **Concept:** Parameter Sharing -> **Meaning:** A modeling decision where multiple distinct variables in a network are forced to use the exact same local conditional distribution (e.g., assuming two different users behave according to the exact same rating model). -> **Application:** Greatly reduces the number of parameters the network must learn, requiring significantly fewer examples in the training data to achieve statistical significance. * **Concept:** Laplace Smoothing -> **Meaning:** The practice of adding a small constant $\lambda$ (pseudocount) to all event counts before calculating normalizations, ensuring that no event ever receives a strict $0\%$ probability. -> **Application:** Prevents the model from catastrophically failing or overfitting when it encounters a valid event combination in production that simply happened to be absent from the limited training data. * **Concept:** Expectation-Maximization (EM) -> **Meaning:** An iterative optimization algorithm used when variables in the training data are hidden or unobserved. It alternates between guessing the missing data using current parameters (E-step) and updating the parameters based on those weighted guesses (M-step). -> **Application:** Training models for clustering, speech recognition (HMMs), or any domain where the underlying causes (hidden states) of observed data are unknown. ## 3. Evidence & Examples (Hyper-Specific Details) * **[Whiteboard Game: Conditional Independence]:** The speaker demonstrated D-separation rules using graph drawings to show when paths are blocked: * $A \rightarrow C \rightarrow B$ (Chain): A and B are dependent. Given C, the path is blocked, so A and B become independent. * $A \leftarrow C \rightarrow B$ (Common Cause): A and B are dependent. Given C, the path is blocked, so A and B become independent. * $A \rightarrow C \leftarrow B$ (V-Structure): A and B are marginally independent initially. Given C (or any descendant of C), A and B become dependent (the "Explaining Away" phenomenon). * *Complex Graph:* $A \rightarrow D \rightarrow B$, $A \rightarrow C \leftarrow B$, and $C \rightarrow E$. A and B are dependent. Given D, A and B are independent (top path blocked by D, bottom path blocked by the unobserved v-structure at C). Given D and C, A and B are dependent (C unblocks the bottom path). Given D and E, A and B are dependent (E is a descendant of C, which unblocks the v-structure). * **[One Variable MLE - Movie Rating]:** A single node network $R \in \{1,2,3,4,5\}$. Given the training data: `[1, 3, 4, 4, 4, 4, 5, 5]`. To find the parameters $\theta = (P(1), P(2), P(3), P(4), P(5))$, the algorithm counts occurrences (e.g., $R=4$ happens 4 times) and divides by the total (8), yielding $P(R=4) = 4/8 = 0.5$. * **[Two Variable MLE - Genre and Rating]:** A network $G \rightarrow R$. $G \in \{drama, comedy\}$, $R \in \{1,2,3,4,5\}$. Data includes full pairs like `{"G": "drama", "R": 4}`. The algorithm isolates local structures: to find $P(G)$, it only counts total $G$ occurrences. To find $P(R|G=drama)$, it filters for rows where $G=drama$, counts the $R$ values within that subset, and normalizes only those specific counts. * **[Hidden Markov Model (HMM) Parameter Sharing]:** A network with hidden states $H_1 \rightarrow H_2 \rightarrow H_3$ emitting observations $E_1, E_2, E_3$. Instead of learning separate transition probabilities for $H_1 \rightarrow H_2$ and $H_2 \rightarrow H_3$, the model defines a single shared parameter `p_trans`. During MLE, the algorithm aggregates counts from all $H_{t-1} \rightarrow H_t$ transitions across the entire sequence into a single dictionary before normalizing, rather than treating them as distinct distributions. * **[Expectation-Maximization Walkthrough]:** A network $G \rightarrow R_1$, $G \rightarrow R_2$ where $G$ (genre) is unobserved in the data. * *Data:* User 1 rates 1, User 2 rates 2 (`R1=1, R2=2`). * *Initialization:* Parameters are randomized, ensuring non-uniformity. E.g., $P(G=drama)=0.5$, $P(G=comedy)=0.5$, with random local probabilities for $P(R|G)$. * *E-Step:* Using the current parameters, calculate probabilities for the hidden variable. Result: $P(G=drama | R_1=1, R_2=2) = 0.6923$ and $P(G=comedy | R_1=1, R_2=2) = 0.3077$. * *M-Step:* Instead of adding integer counts, the algorithm "hallucinates" fractional data points. It adds $+0.6923$ to the count for the assignment `(G=drama, R1=1, R2=2)` and $+0.3077$ to the count for `(G=comedy, R1=1, R2=2)`. It then normalizes these fractional counts to produce the new $\theta$ for the next iteration. ## 4. Actionable Takeaways (Implementation Rules) * **Rule 1: Isolate variables to compute maximum likelihood** -> To estimate a local conditional distribution $P(X_i | Parents(X_i))$ in a fully observed dataset, ignore all other variables in the network. Count the specific occurrences of $(X_i, Parents)$ and divide by the total occurrences of $(Parents)$. * **Rule 2: Implement Parameter Sharing via unified counting** -> When two variables $X_i$ and $X_j$ should behave identically (e.g., multiple time steps in an HMM or multiple interchangeable users), map them to the same parameter label. Loop through the data, add counts from both variables into a single shared tracking dictionary, and normalize once at the end. * **Rule 3: Apply Laplace Smoothing to prevent fragile probability zero-outs** -> Before normalizing raw counts into probabilities, add a pseudocount constant $\lambda$ (commonly $\lambda = 1$) to every possible state's count. This guarantees that $P > 0$ for unseen events, preventing mathematical errors (like multiplying by zero) during future inference. * **Rule 4: Break symmetry when initializing EM** -> When starting the Expectation-Maximization algorithm, never initialize the hidden state parameters uniformly (e.g., 50/50). You must initialize them with random, non-uniform distributions so the algorithm has mathematical leverage to begin pulling clusters apart. * **Rule 5: Use fractional weights to handle unobserved data** -> When training data lacks assignments for a hidden variable, compute the conditional distribution $P(Hidden | Observed)$ using current parameters. Create "weighted" assignments using these probabilities, and add these fractional values to your counts during the M-step. ## 5. Pitfalls & Limitations (Anti-Patterns) * **Pitfall:** Assuming independent variables in a V-structure remain independent when the child is observed. -> **Why it fails:** The "Explaining Away" phenomenon dictates that observing a common effect inherently creates a probabilistic dependency between its independent causes. -> **Warning sign:** You condition your inference on a node that has multiple incoming arrows, accidentally coupling parent variables you believed could be processed in parallel. * **Pitfall:** Relying on raw Maximum Likelihood Estimation with small, sparse datasets. -> **Why it fails:** If a valid, real-world combination never appears in the limited training set, MLE assigns it a strict $0.0$ probability. Any future sequence containing that event will multiply to zero, crashing the inference. -> **Warning sign:** Absolute zero probabilities appear in your learned conditional probability tables. * **Pitfall:** Trusting the semantic meaning of the labels produced by EM. -> **Why it fails:** The EM algorithm successfully discovers latent clusters in the data based on mathematical variance, but it has no concept of human semantics. It can only recover hidden variables up to a permutation of labels. -> **Warning sign:** EM successfully separates two movie types, but what a human would call "comedy" the model arbitrarily tags internally as "drama". * **Pitfall:** Expecting EM to find the objectively best possible model. -> **Why it fails:** The Expectation-Maximization algorithm is only mathematically guaranteed to increase the likelihood at each iteration and converge to a *local* maximum, not the *global* maximum. -> **Warning sign:** The model converges quickly but performs poorly, indicating it got trapped in a sub-optimal local peak based on unlucky random initialization. ## 6. Key Quote / Core Insight "To estimate a local conditional distribution, just ignore the other variables. You simply count the local occurrences and normalize. The math of Maximum Likelihood Estimation reduces perfectly to this intuitive action." ## 7. Additional Resources & References * **Resource:** Expectation Maximization (EM) - **Type:** Academic Paper - **Relevance:** Referenced on-screen as (Dempster 1977), this is the foundational paper establishing the EM algorithm used to learn parameters from partially observed and hidden data.