CS Knowledge Hub

# Image Classification: Data-Driven Approaches, K-NN, and Linear Classifiers **Video Category:** Computer Vision Tutorial ## ð 0. Video Metadata **Video Title:** Stanford CS231n Image Classification with Linear Classifiers **YouTube Channel:** Stanford Engineering **Publication Date:** Not shown in video **Video Duration:** ~1 hour 25 minutes ## ð 1. Core Summary (TL;DR) This video introduces the foundational problem of image classification in computer vision, highlighting the "semantic gap" between the raw pixel values computers process and the high-level concepts humans perceive. To bridge this gap, the lecture advocates abandoning explicit rule-based programming in favor of a data-driven approach, where models learn from massive labeled datasets. It details two foundational algorithms: the non-parametric K-Nearest Neighbor (K-NN) classifier and the parametric Linear Classifier, explaining their mechanisms, mathematical underpinnings, distance metrics, and the critical process of hyperparameter tuning using validation sets. ## 2. Core Concepts & Frameworks * **The Semantic Gap:** -> **Meaning:** The disconnect between human perception of an image (e.g., "a cat") and the computer's representation (a massive 3D tensor of integer pixel intensities). -> **Application:** Highlights why traditional rule-based programming (like edge detection) fails for object recognition; slight changes in illumination, viewpoint, or pose completely alter the pixel values while leaving the semantic meaning intact. * **The Data-Driven Approach:** -> **Meaning:** A paradigm where instead of writing explicit code to identify objects, developers provide a machine learning algorithm with a massive dataset of examples (images and corresponding labels) to learn the underlying patterns automatically. -> **Application:** This is the standard methodology for modern AI. It involves three steps: Collect data, train a model, and evaluate the model on unseen test data. * **K-Nearest Neighbors (K-NN) Classifier:** -> **Meaning:** A simple, non-parametric algorithm that memorizes the entire training dataset. During prediction, it finds the 'K' most similar training images (using a specified distance metric) to the new test image and assigns the label based on a majority vote among those neighbors. -> **Application:** Used as a baseline for classification, though rarely used for images in practice due to computational inefficiency at test time and the poor perceptual meaning of pixel-wise distance metrics. * **Distance Metrics (L1 vs. L2):** -> **Meaning:** Mathematical formulas used to quantify the "difference" between two images. L1 (Manhattan distance) sums the absolute differences of individual pixels. L2 (Euclidean distance) takes the square root of the sum of squared differences. -> **Application:** The choice of metric is a hyperparameter. L1 is coordinate-dependent and preferred when individual features have specific, distinct meanings. L2 is rotationally invariant and preferred when features are arbitrary or generic. * **Parametric Models (Linear Classifiers):** -> **Meaning:** An approach that summarizes the knowledge from training data into a set of learned parameters (a weight matrix $W$ and bias vector $b$). The scoring function is defined as $f(x, W) = Wx + b$. -> **Application:** Unlike K-NN, parametric models discard the training data after learning. Prediction is extremely fast (just a matrix multiplication), making them the building blocks for modern neural networks. * **Softmax Classifier (Multinomial Logistic Regression):** -> **Meaning:** An algorithm that takes the raw, uncalibrated scores output by a linear classifier and transforms them into a normalized probability distribution across all classes. -> **Application:** It uses the function $P(Y=k|X=x) = e^{s_k} / \sum_j e^{s_j}$. This allows the model to output a confidence level (between 0 and 1) for each class, rather than just arbitrary real numbers. * **Cross-Entropy Loss (Negative Log-Likelihood):** -> **Meaning:** A loss function used with the Softmax classifier that quantifies how "unhappy" the model is. It measures the discrepancy between the predicted probability distribution and the true label. -> **Application:** The loss is calculated as $L_i = -\log(P(Y=y_i | X=x_i))$. The goal of training is to minimize this loss, thereby maximizing the probability assigned to the correct class. ## 3. Evidence & Examples (Hyper-Specific Details) * **Visualizing the Semantic Gap:** An image of a cat is displayed alongside its representation as an $800 \times 600 \times 3$ tensor. The speaker notes that moving the camera slightly (viewpoint variation) changes every single pixel value in that tensor, creating a completely new data point from the computer's perspective. * **Challenges in Image Classification Examples:** * *Illumination:* Four images of cats in different lighting (dark room vs. bright sunlight) are shown. Pixel values change drastically based on light sources, yet the label remains "cat." * *Deformation:* Images of cats stretched out, sitting like humans, or contorted. The algorithm must recognize the object despite non-rigid transformations. * *Intraclass Variation:* A photo containing kittens of various colors, breeds, and sizes, demonstrating that a single class label encompasses massive visual diversity. * *Context / Background Clutter:* An image of a cat camouflaged against leaves, and an image of a dog painted with tiger stripes. The latter demonstrates that algorithms must understand context (e.g., a zoo enclosure vs. a living room) to avoid misclassification based on local textures. * **Early Computer Vision Attempts:** The video references John Canny's 1986 paper on edge detection. Early attempts tried to extract edges, find corners, and explicitly write rules (e.g., "if 3 lines intersect, it's a corner"). This approach was too brittle to handle real-world variations. * **L1 Distance Calculation Demonstration:** A slide shows a $4\times4$ test image patch subtracted from a $4\times4$ training image patch. The absolute differences are calculated element-wise (e.g., $|56 - 10| = 46$, $|32 - 20| = 12$) and summed to produce a final scalar distance of $456$. * **K-NN Decision Boundary Visualization:** A 2D plot with 5 classes (represented by colors) shows how 1-Nearest Neighbor creates "islands" around noisy data points (e.g., a single yellow dot deep inside a green region creates a yellow decision boundary pocket). Changing to $K=3$ or $K=5$ smooths out these boundaries, proving that higher $K$ values provide better generalization and robustness to outliers. * **Failure of Pixel-Based Distance:** A visual comparison shows an original image of a woman, and three modified versions: one with the eyes blacked out (occlusion), one shifted by 1 pixel, and one with a blue tint. The L2 pixel distance between the original and all three modified images is identical, proving that pixel distance does not correlate with human perceptual similarity. * **Hyperparameter Tuning Visualization (Cross-Validation):** A graph plots 5-fold cross-validation accuracy on the y-axis against various values of $K$ on the x-axis for a K-NN model. The plot shows a peak in accuracy around $K \approx 7$, demonstrating the empirical necessity of testing different hyperparameters. * **Linear Classifier Template Visualization:** A slide visualizes the learned weight matrix $W$ for a linear classifier trained on the CIFAR-10 dataset. The row of weights for the "horse" class, when reshaped into an image, looks like a blurry horse with two heads (facing left and right). This proves that a linear classifier learns a single, averaged template for a class and struggles to represent multiple distinct modes of appearance. * **Softmax Math Walkthrough:** Given raw scores for an image (Cat: 3.2, Car: 5.1, Frog: -1.7), the Softmax function exponentiates them ($e^{3.2} \approx 24.5$, $e^{5.1} \approx 164.0$, $e^{-1.7} \approx 0.18$). These unnormalized probabilities are then divided by their sum ($188.68$) to get valid probabilities (Cat: 0.13, Car: 0.87, Frog: 0.00). ## 4. Actionable Takeaways (Implementation Rules) * **Rule 1: Never hardcode recognition rules** - Adopt a data-driven approach. Collect a representative dataset of labeled examples, train a model to map inputs to outputs, and evaluate it on unseen data. * **Rule 2: Split data into Train, Validation, and Test sets** - Use the training set to optimize model parameters (e.g., weights $W$). Use the validation set to select the best hyperparameters (e.g., $K$, distance metric). Keep the test set locked away until the very end to get an honest estimate of real-world performance. * **Rule 3: Use Cross-Validation for small datasets** - If your dataset is small, split the training data into multiple "folds". Train on $N-1$ folds and validate on the remaining fold. Rotate the validation fold and average the results to get a robust evaluation of hyperparameters without wasting data. * **Rule 4: Select distance metrics based on data characteristics** - Use L1 (Manhattan) distance if the individual features (axes) in your data vector have specific, independent meanings. Use L2 (Euclidean) distance if the features are arbitrary and rotation of the coordinate system should not affect the distance. * **Rule 5: Use a "Sanity Check" for Softmax Loss at initialization** - When initializing a model with $C$ classes using small random weights, all classes should have roughly equal probability ($1/C$). Therefore, the initial loss should be $-\log(1/C)$. For a 10-class problem like CIFAR-10, check that your initial loss is $\approx 2.3$. If it isn't, there is a bug in your initialization or loss calculation. ## 5. Pitfalls & Limitations (Anti-Patterns) * **Pitfall: Tuning hyperparameters on the Test Set.** -> **Why it fails:** The model effectively "learns" the quirks of the test data through the human's hyperparameter selections. -> **Warning sign:** The model achieves high accuracy on the test set but performs poorly when deployed in the real world on truly novel data. * **Pitfall: Using K-Nearest Neighbors for image classification in production.** -> **Why it fails:** K-NN has $O(1)$ training time but $O(N)$ prediction time, requiring comparison against the entire dataset for every new query. Furthermore, it scales terribly with high dimensions (the "curse of dimensionality"). -> **Warning sign:** Inference (prediction) is incredibly slow and memory-intensive, growing linearly as the dataset grows. * **Pitfall: Relying on pixel-wise distance metrics (L1/L2) for images.** -> **Why it fails:** Pixel distances do not capture semantic meaning. A 1-pixel shift of an image completely changes the L2 distance but looks identical to a human. -> **Warning sign:** The classifier categorizes images based on background color or generic layout rather than the actual object present. * **Pitfall: Using a Linear Classifier for highly multimodal data.** -> **Why it fails:** A linear classifier attempts to draw a single straight hyperplane in high-dimensional space. It can only learn one "template" per class. It cannot classify sets where class data points are distributed in distinct, separated clusters (e.g., horses looking left vs. horses looking right). -> **Warning sign:** The learned weights look like a blurry mix of different sub-categories, resulting in low accuracy for classes with high visual diversity. ## 6. Key Quote / Core Insight "Unlike sorting a list of numbers, there is no obvious way to hardcode the algorithm for recognizing a cat. The solution is the data-driven approach: we collect a dataset of images and labels, and use machine learning algorithms to train a classifier." ## 7. Additional Resources & References * **Resource:** ImageNet - **Type:** Dataset - **Relevance:** Referenced as a large-scale benchmark dataset that drove the success of modern computer vision algorithms. * **Resource:** CIFAR-10 - **Type:** Dataset - **Relevance:** Used throughout the lecture as a primary example dataset containing 60,000 $32\times32$ color images across 10 classes. * **Resource:** FAISS (Facebook AI Similarity Search) - **Type:** Software Library - **Relevance:** Mentioned as a tool (`github.com/facebookresearch/faiss`) used in practice to perform fast, approximate nearest neighbor searches when exact K-NN is too slow. * **Resource:** "A Computational Approach to Edge Detection" by John Canny (IEEE PAMI, 1986) - **Type:** Paper - **Relevance:** Cited as an example of early, explicit rule-based attempts at object recognition that ultimately proved too brittle to scale.