Unsupervised Learning: Factor Analysis, PCA, and ICA
📂 General
# Unsupervised Learning: Factor Analysis, PCA, and ICA
**Video Category:** Machine Learning / Computer Science Lecture
## ð 0. Video Metadata
**Video Title:** CS229 Lecture 18
**YouTube Channel:** Stanford ENGINEERING
**Publication Date:** Not shown in video
**Video Duration:** ~123 minutes
## ð 1. Core Summary (TL;DR)
This lecture concludes the module on Unsupervised Learning by exploring three foundational techniques for dimensionality reduction and latent variable modeling. It provides a conceptual wrap-up of Factor Analysis, demonstrating how it can be interpreted as a series of independent linear regression problems in a latent space. The core of the lecture then contrasts Principal Component Analysis (PCA), a non-probabilistic variance-maximization technique, with Independent Component Analysis (ICA), a probabilistic model designed to solve the "cocktail party problem" by recovering independent, non-Gaussian source signals from mixed observations.
## 2. Core Concepts & Frameworks
* **Concept:** Factor Analysis (FA) as Linear Regression -> **Meaning:** A probabilistic dimensionality reduction model that maps a low-dimensional latent variable $z$ to a high-dimensional observation $x$ with anisotropic noise. -> **Application:** The model $x = \mu + Lz + \epsilon$ can be viewed as $d$ independent linear regression problems running in parallel, where the latent variables $z$ act as the "design matrix," $L$ acts as the parameters $\theta$, and the diagonal elements of $\Psi$ act as the independent noise variance for each feature dimension.
* **Concept:** Principal Component Analysis (PCA) -> **Meaning:** A non-probabilistic, linear dimensionality reduction algorithm that finds an orthogonal subspace (spanned by principal components) such that when data is projected onto this subspace, the variance of the projected data is maximized (or equivalently, the projection error is minimized). -> **Application:** Used for visualizing high-dimensional data, reducing the computational cost of downstream supervised learning tasks, and removing correlated noise by retaining only the eigenvectors associated with the largest eigenvalues of the sample covariance matrix.
* **Concept:** Independent Component Analysis (ICA) -> **Meaning:** A probabilistic model designed to recover unobserved, independent source signals from a set of observed linear mixtures ($x = As$, where $s$ are sources, $A$ is the mixing matrix, and $x$ are observations). -> **Application:** Solves the "Cocktail Party Problem" (blind source separation) by finding an unmixing matrix $W$ ($W \approx A^{-1}$) such that $s = Wx$, heavily relying on the assumption that the original sources are statistically independent and distinctly non-Gaussian.
* **Concept:** The Jacobian in Probability Transformations -> **Meaning:** When a random variable is transformed by a function (e.g., $s = Wx$), the probability density function of the new variable requires scaling by the determinant of the Jacobian matrix to ensure probabilities integrate to 1. -> **Application:** In ICA, the likelihood of the observed mixed data $x$ is defined based on the prior probability of the sources $s$: $p_x(x) = p_s(Wx) \cdot |W|$. The $|W|$ term prevents the algorithm from trivially setting $W$ to zero to maximize likelihood.
## 3. Evidence & Examples (Hyper-Specific Details)
* **Factor Analysis Matrix Visualized:** The instructor draws a clear visual mapping comparing standard linear regression to Factor Analysis. In linear regression: $Y \approx X\theta$ (where $Y$ is a column vector, $X$ is a design matrix, $\theta$ is a parameter vector). In Factor Analysis: $(X-\mu)^T \approx Z L^T$, where the design matrix is the unobserved $Z \in \mathbb{R}^{n \times k}$, the parameters are the rows of $L \in \mathbb{R}^{d \times k}$, and we are simultaneously fitting $d$ different output features with independent Gaussian noise $\Psi_{ii}$.
* **PCA Variance Maximization Intuition:** A 2D scatter plot is drawn showing a highly correlated dataset (skill vs. enjoyment of flying helicopters). Projecting these points onto a line drawn through the main axis of the data (the first principal component) results in points that are highly spread out (maximum variance). Projecting them onto a perpendicular line results in points clustered tightly together (low variance), visually proving that finding the direction of maximum variance captures the most meaningful structure.
* **ICA Audio Demonstration (Cocktail Party Problem):** The instructor plays audio clips to demonstrate the problem setup.
* *Mixed Audio:* Two clips are played where two different speakers (one English, one Spanish) are talking simultaneously, recorded by two different microphones at different distances. Both clips sound like an unintelligible mixture.
* *Separated Audio:* After running the ICA algorithm, the recovered source signals are played. One clip contains almost entirely the English speaker, and the other contains almost entirely the Spanish speaker, demonstrating the algorithm's ability to blindly unmix the signals without knowing the original audio or microphone placements.
* **Gaussian Ambiguity Scatter Plots:** To prove why ICA requires non-Gaussian sources, a slide shows scatter plots of simulated data.
* *Gaussian Sources:* Two independent Gaussian sources form a perfect circle. When multiplied by a mixing matrix, they form a tilted ellipse. There are infinitely many unmixing matrices (rotations) that can map that ellipse back to a circle, making it impossible to identify the *true* original axes.
* *Non-Gaussian Sources:* Two independent Laplace or Logistic distributions form a sharp "diamond" or "square" shape. When mixed, they form a skewed parallelogram. Because the original shape had distinct sharp corners (due to heavy tails), the ICA algorithm can specifically rotate the space to align with those exact corners, uniquely recovering the original axes.
## 4. Actionable Takeaways (Implementation Rules)
* **Rule 1: Standardize features before running PCA.** -> **Mechanism:** PCA maximizes variance. If one feature is measured in millimeters (large numerical variance) and another in kilometers (small numerical variance), PCA will blindly align the principal component with the millimeter feature. -> **Result:** To find true structural correlation rather than just scale differences, set $x_j^{(i)} \leftarrow \frac{x_j^{(i)} - \mu_j}{\sigma_j}$ so all features have mean 0 and variance 1 before computing the covariance matrix.
* **Rule 2: Determine PCA dimensions using the variance ratio.** -> **Mechanism:** Compute the eigenvalues ($\lambda_1, \lambda_2, ... \lambda_d$) of the covariance matrix. Sort them in descending order. -> **Result:** Choose the number of components $k$ such that the ratio $\frac{\sum_{i=1}^k \lambda_i}{\sum_{i=1}^d \lambda_i}$ meets a desired threshold, typically 95% or 99%, ensuring you retain the vast majority of the data's structural information while dropping noise dimensions.
* **Rule 3: Use SVD for efficient PCA computation.** -> **Mechanism:** Instead of calculating the dense $d \times d$ covariance matrix $\Sigma = \frac{1}{n}X^TX$ and finding its eigenvectors, perform Singular Value Decomposition (SVD) directly on the centered data matrix $X$. -> **Result:** This avoids explicitly forming the potentially massive covariance matrix, yielding the principal components faster and with better numerical stability.
* **Rule 4: Apply ICA only when sources are distinctly non-Gaussian.** -> **Mechanism:** ICA relies on finding the unique "corners" or non-rotational symmetries in the joint distribution of the data to find the true unmixing axes. -> **Result:** If the underlying sources resemble a Gaussian distribution (e.g., due to the Central Limit Theorem applying to the mixture), ICA will fail to find a unique solution. Use the Logistic or Laplace probability density function as the assumed prior $P(s)$ for optimization.
## 5. Pitfalls & Limitations (Anti-Patterns)
* **Pitfall:** Using PCA to separate independent signals (like audio tracks). -> **Why it fails:** PCA only enforces that the derived components are geometrically orthogonal to each other and capture maximum variance. It does not enforce statistical independence. -> **Warning sign:** Applying PCA to the Cocktail Party audio mixtures will return orthogonal vectors, but both output tracks will still sound like a garbled mixture of both speakers.
* **Pitfall:** Attempting ICA on Gaussian-distributed source data. -> **Why it fails:** The joint distribution of independent Gaussian variables is rotationally symmetric (a sphere/circle). Any rotation of the space looks identical to any other rotation. -> **Warning sign:** The algorithm converges, but the resulting "separated" signals are still arbitrary mixtures of the original sources, because the objective function provides no gradient to prefer one rotation over another.
* **Pitfall:** Expecting ICA to recover the exact amplitude or order of the original sources. -> **Why it fails:** The model $x = As$ has inherent ambiguities. If you scale a source $s_1$ by 2, and scale the corresponding column of mixing matrix $A$ by 0.5, the observation $x$ remains exactly the same. Similarly, swapping the order of sources and columns in $A$ leaves $x$ unchanged. -> **Warning sign:** The recovered audio tracks might be swapped (Speaker 2 is output as track 1) and the volume levels may be entirely different from reality. (Note: In practice, this ambiguity is harmless for simply separating the signals).
## 6. Key Quote / Core Insight
"If your data comes from a Gaussian distribution, the level curves are perfect circles. If I take the data and rotate it, the distribution looks exactly the same. Because of this rotational ambiguity, if your original sources are Gaussian, there is no way to recover the true mixing matrix. ICA only works because real-world signals, like human speech, are heavily non-Gaussian."
## 7. Additional Resources & References
No external resources explicitly mentioned in this video.