Generative Learning Algorithms: Gaussian Discriminant Analysis and Naive Bayes

📂 General
# Generative Learning Algorithms: Gaussian Discriminant Analysis and Naive Bayes **Video Category:** Machine Learning Tutorial / Academic Lecture ## 📋 0. Video Metadata **Video Title:** Machine Learning Lecture 5 (Inferring from content) **YouTube Channel:** Stanford **Publication Date:** Not shown in video **Video Duration:** ~1 hour 18 minutes ## 📝 1. Core Summary (TL;DR) This lecture introduces Generative Learning Algorithms, which represent a fundamental shift from Discriminative Algorithms (like Logistic Regression) that attempt to find a boundary separating classes. Instead of learning $p(y|x)$ directly, Generative Algorithms learn $p(x|y)$ and $p(y)$ to build a distinct statistical model of what each class "looks like" in isolation. By leveraging Bayes' rule to compute the posterior probability, these models (specifically Gaussian Discriminant Analysis for continuous data and Naive Bayes for discrete data) can be exceptionally computationally efficient and, when their strict underlying statistical assumptions hold true, require significantly less training data to achieve optimal performance compared to their discriminative counterparts. ## 2. Core Concepts & Frameworks * **Discriminative Learning Algorithms:** -> **Meaning:** Algorithms that learn the conditional probability $p(y|x)$ directly, or learn a direct mapping function $h_\theta(x)$ from inputs $x$ to labels $y$ (e.g., Logistic Regression, Perceptron). -> **Application:** Used to find a decision boundary that strictly separates positive and negative examples in the feature space by maximizing the conditional likelihood. * **Generative Learning Algorithms:** -> **Meaning:** Algorithms that independently model the distribution of features for each class, learning $p(x|y)$ (what features look like given the class) and $p(y)$ (the class prior probability). -> **Application:** Used to build generative profiles of classes. A new example is evaluated against all class profiles using Bayes' Rule: $p(y=1|x) = \frac{p(x|y=1)p(y=1)}{p(x)}$ to determine which class it most closely matches. * **Multivariate Gaussian (Normal) Distribution:** -> **Meaning:** A generalization of the 1D normal distribution to $n$ dimensions, parameterized by a mean vector $\mu \in \mathbb{R}^n$ and a symmetric covariance matrix $\Sigma \in \mathbb{R}^{n \times n}$. The probability density function is $p(x; \mu, \Sigma) = \frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}} \exp(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu))$. -> **Application:** Serves as the foundational probability distribution for modeling continuous features in Gaussian Discriminant Analysis. * **Gaussian Discriminant Analysis (GDA):** -> **Meaning:** A generative model for continuous features assuming $x|y=0$ and $x|y=1$ are both Multivariate Gaussian distributions, typically sharing the same covariance matrix $\Sigma$ but having different means ($\mu_0$, $\mu_1$), with the class label $y$ following a Bernoulli distribution ($\phi$). -> **Application:** Used for classification tasks with continuous numerical data where features are roughly normally distributed. * **Maximum Likelihood Estimation (MLE) for Generative Models:** -> **Meaning:** Unlike discriminative models that maximize conditional likelihood $\prod p(y^{(i)}|x^{(i)})$, generative models fit parameters by maximizing the joint likelihood of the data: $L(\phi, \mu_0, \mu_1, \Sigma) = \prod_{i=1}^m p(x^{(i)}, y^{(i)})$. -> **Application:** Used to derive the specific formulas for estimating $\mu_0, \mu_1, \Sigma$, and $\phi$ directly from the empirical counts and averages of the training data. * **The Naive Bayes Assumption (Conditional Independence):** -> **Meaning:** The assumption that given the class label $y$, all features $x_i$ are conditionally independent of one another. Mathematically: $p(x_1, x_2, ..., x_n|y) = \prod_{j=1}^n p(x_j|y)$. -> **Application:** Makes modeling high-dimensional discrete data (like text vocabularies) computationally tractable by reducing the required parameters from $O(2^n)$ to $O(n)$. ## 3. Evidence & Examples (Hyper-Specific Details) * **Tumor Classification (Discriminative vs. Generative Strategy):** * *Discriminative (Logistic Regression):* Plots benign (O) and malignant (X) tumors on a 2D plane and uses gradient descent to search for a straight line that separates the two sets. * *Generative (GDA):* Looks only at malignant tumors to build a model (an ellipse containing the Xs), then looks only at benign tumors to build a second model (an ellipse containing the Os). A new test point is evaluated to see which ellipse it structurally matches better. * **Visualizing Multivariate Gaussian Parameters:** Demonstrates how changing $\mu$ and $\Sigma$ alters the probability density $p(x)$ (shown as 3D bumps and 2D contour plots): * *Standard Gaussian:* $\Sigma = I$ (Identity matrix), $\mu = [0, 0]^T$. Results in a perfectly symmetric, round, bell-shaped bump centered at the origin. * *Shrinking Variance:* $\Sigma = 0.6I$. The bump becomes taller and narrower because the volume under the curve must integrate to 1. * *Increasing Variance:* $\Sigma = 2I$. The bump becomes wider and flatter. * *Positive Correlation:* Off-diagonal elements set to $>0$ (e.g., $\Sigma = [1, 0.5; 0.5, 1]$ and $\Sigma = [1, 0.8; 0.8, 1]$). The contour ellipses become slanted, indicating $x_1$ and $x_2$ tend to increase together. The higher the off-diagonal value, the thinner and more slanted the ellipse. * *Negative Correlation:* Off-diagonal elements set to $<0$ (e.g., $\Sigma = [1, -0.5; -0.5, 1]$). The ellipses slant in the opposite direction. * *Shifting Mean:* Changing $\mu$ (e.g., to $[1.5, 1.5]^T$ or $[-1.5, -1]^T$) shifts the center of the peak on the 2D plane without altering its shape. * **GDA Parameter Estimation Formulas:** The lecture derives the exact MLE formulas for GDA parameters based on a training set $\{(x^{(i)}, y^{(i)})\}_{i=1}^m$: * $\phi = \frac{1}{m} \sum_{i=1}^m \mathbb{1}\{y^{(i)} = 1\}$ (Fraction of positive examples). * $\mu_0 = \frac{\sum_{i=1}^m \mathbb{1}\{y^{(i)} = 0\} x^{(i)}}{\sum_{i=1}^m \mathbb{1}\{y^{(i)} = 0\}}$ (Mean vector of negative examples). * $\mu_1 = \frac{\sum_{i=1}^m \mathbb{1}\{y^{(i)} = 1\} x^{(i)}}{\sum_{i=1}^m \mathbb{1}\{y^{(i)} = 1\}}$ (Mean vector of positive examples). * $\Sigma = \frac{1}{m} \sum_{i=1}^m (x^{(i)} - \mu_{y^{(i)}})(x^{(i)} - \mu_{y^{(i)}})^T$ (Covariance matrix averaged across all examples, using their respective class mean). * **GDA vs. Logistic Regression Decision Boundary Plot:** A scatter plot shows both algorithms fit to the same data. The GDA decision boundary is defined by the set of points where $p(y=1|x) = p(y=0|x)$ (or $p(y=1|x) = 0.5$). Because GDA assumes the same covariance matrix $\Sigma$ for both classes, the resulting decision boundary is strictly linear. The plot shows the GDA line and Logistic Regression line are slightly different. * **Email Spam Filter Representation (Naive Bayes):** To map an email to a feature vector $x$, a dictionary of $n$ words (e.g., $n=10,000$) is created. * Word 1: "a" * Word 2: "aardvark" * Word 3: "aardwolf" * ... Word 5000: "buy" ... * Word 10,000: "zymurgy" (the chemistry of fermentation/brewing). * $x \in \{0, 1\}^{10000}$, where $x_i = 1$ if the word appears in the email, and $0$ if it does not. ## 4. Actionable Takeaways (Implementation Rules) * **Rule 1: Choose Generative models for data efficiency when assumptions hold.** -> Implement GDA when you have a small training set *and* you are confident the continuous features are roughly normally distributed. Generative models require less data to converge to their maximum performance level than discriminative models under these conditions. * **Rule 2: Choose Discriminative models for robustness.** -> Implement Logistic Regression when you have abundant data or when you suspect the underlying data distribution is highly non-Gaussian. Discriminative models make fewer assumptions and will generally outperform generative models asymptotically as data size increases. * **Rule 3: Maximize Joint Likelihood for Generative Models.** -> When training a generative algorithm, do not maximize the conditional likelihood $\prod p(y|x)$. You must maximize the joint likelihood $\prod p(x,y)$ to correctly fit the parameters defining the class distributions. * **Rule 4: Use a shared covariance matrix to achieve a linear boundary.** -> When implementing GDA, use the same covariance matrix $\Sigma$ for both $p(x|y=0)$ and $p(x|y=1)$. If you use separate covariance matrices ($\Sigma_0, \Sigma_1$), the decision boundary will become quadratic (parabolic/elliptical) rather than linear. * **Rule 5: Apply the Conditional Independence assumption to discrete data.** -> When facing high-dimensional discrete features (like a 10,000-word dictionary), apply the Naive Bayes assumption to calculate $p(x|y) = \prod p(x_j|y)$. Attempting to model the full joint distribution without this assumption requires $2^{10000}-1$ parameters, which is impossible to compute or store. ## 5. Pitfalls & Limitations (Anti-Patterns) * **Pitfall: Using GDA on non-Gaussian data.** -> **Why it fails:** GDA explicitly tells the algorithm that the data follows a Gaussian distribution. If the data is actually Poisson-distributed or highly skewed, the algorithm forces a Gaussian fit anyway, resulting in highly inaccurate decision boundaries. -> **Warning sign:** Poor classification performance on a dataset where Logistic Regression performs well. * **Pitfall: Trying to compute the full joint distribution of discrete features.** -> **Why it fails:** For a feature vector of length $n$ with binary values, modeling the true joint distribution requires $2^n - 1$ parameters. For a 10,000-word dictionary, $2^{10000}$ is vastly larger than the number of atoms in the universe, making computation impossible. -> **Warning sign:** Out-of-memory errors or impossible mathematical formulations when dealing with text data without the Naive Bayes assumption. * **Pitfall: Calculating the denominator in Bayes' Rule for pure prediction.** -> **Why it fails:** When making a prediction using $\arg\max_y p(y|x) = \arg\max_y \frac{p(x|y)p(y)}{p(x)}$, computing $p(x)$ is often computationally expensive and unnecessary because $p(x)$ is a constant with respect to $y$. -> **Warning sign:** Wasted computational cycles. Simply compute $\arg\max_y p(x|y)p(y)$. ## 6. Key Quote / Core Insight "If you make stronger modeling assumptions, and if your modeling assumptions are roughly correct, then your model will do better, because you're telling the algorithm more information about the world. But if your assumptions are wrong—if your data is decidedly not Gaussian—then GDA will do poorly, while a robust model like Logistic Regression will adapt and still perform well." ## 7. Additional Resources & References * **Resource:** CS228 (Probabilistic Graphical Models) - **Type:** Stanford Course - **Relevance:** Mentioned by the instructor as the course that covers the conditional independence assumptions (used in Naive Bayes) in much greater formal detail using graphical models. * **Resource:** Lecture Notes - **Type:** Course Material - **Relevance:** Referenced multiple times for the step-by-step mathematical proofs of the MLE parameter derivations and the transformation of multivariate covariance formulas.