CS Knowledge Hub

# Non-Parametric Few-Shot Learning and Meta-Learning Applications **Video Category:** Machine Learning / AI Tutorial ## ð 0. Video Metadata **Video Title:** Non-Parametric Few-Shot Learning CS 330 **YouTube Channel:** Stanford Engineering **Publication Date:** Not shown in video **Video Duration:** ~1 hour 24 minutes ## ð 1. Core Summary This lecture introduces non-parametric few-shot learning, exploring how to leverage parametric meta-training to create highly effective non-parametric learners for low-data test environments. It solves the inefficiency and optimization challenges of black-box and optimization-based models by teaching neural networks to map raw data into semantic embedding spaces where simple nearest-neighbor distance metrics can accurately classify novel examples. This approach creates opportunities to build fast, feed-forward architectures capable of generalizing from very few examples across diverse real-world domains, including medical imaging, robotics, and language processing. ## 2. Core Concepts & Frameworks * **Parametric vs. Non-Parametric Learners:** -> **Meaning:** Parametric learners use a fixed number of parameters (weights) to represent a function, while non-parametric learners use the training data itself (or embeddings of it) directly at inference time without fixed parameters (e.g., K-Nearest Neighbors). -> **Application:** Non-parametric methods excel in the extreme low-data regime (few-shot test time), whereas parametric methods are better suited for large-scale meta-training. * **Siamese Networks:** -> **Meaning:** A network architecture that takes two image inputs, passes them through identical sub-networks to extract embeddings, and outputs a binary prediction indicating whether they belong to the same class. -> **Application:** Used as an early approach to few-shot learning where the model learns a distance metric via binary classification, though it suffers from a mismatch between the binary meta-training phase and the multi-class meta-testing phase. * **Matching Networks:** -> **Meaning:** A non-parametric meta-learning framework that computes embeddings for a query image and a support set, then uses a distance metric (like cosine distance) and a softmax function to output a probability distribution over classes. -> **Application:** Aligns the meta-training and meta-testing phases by training directly on N-way classification tasks, often using a bi-directional LSTM to make support embeddings context-aware. * **Prototypical Networks (ProtoNets):** -> **Meaning:** A framework that handles multiple shots (K > 1) by averaging the embeddings of all support examples within a class to create a single "prototype" (centroid), then classifies query images based on their distance to these prototypes. -> **Application:** Simplifies computation for higher-shot learning and reduces the impact of outlier support examples by creating a stable central representation for each class. * **Algorithmic Consistency:** -> **Meaning:** The mathematical property of a learning algorithm where the learned procedure monotonically improves its performance as it is provided with more data at test time. -> **Application:** Optimization-based methods (like MAML) are consistent because they reduce to gradient descent, making them safe for scaling, whereas pure black-box methods (RNNs) are not mathematically guaranteed to improve when fed more data than they saw in training. ## 3. Evidence & Examples (Hyper-Specific Details) * **Failure of L2 Distance in Pixel Space (Zhang et al. arXiv 1801.03924):** A visual demonstration on screen compares a base image of a woman to two variants: a heavily blurred version and a slightly shifted version. While human perception sees the shifted image as essentially identical to the original, the L2 pixel distance metric mathematically calculates the heavily blurred image as being "closer." This visual proof demonstrates why raw pixel comparisons cannot be used for nearest neighbor classification. * **Dermatological Disease Diagnosis Case Study (Machine Learning for Healthcare Conference 2019):** Researchers tackled skin disease classification using the Dermnet dataset (150 base classes, 50 novel classes). Because skin diseases have high intra-class variability (e.g., eczema presents differently across skin tones) and a long-tailed distribution, they used Prototypical Clustering Networks (PCN). PCN learns multiple prototypes per class to handle visual variance and integrates unlabeled support examples via k-means. PCN achieved 49.56% Mean Class Accuracy on 5-way 10-shot tasks, heavily outperforming standard ProtoNets (37.08%) and fine-tuned ResNet baselines. * **One-Shot Imitation Learning for Robotics (Yu*, Finn* et al. RSS 2018):** A robotic arm is tasked with manipulating objects. The model is meta-trained on video footage of a human performing the task (e.g., placing a peach into a red bowl). At meta-test time, the robot uses a teleoperated demonstration to adapt. By using a hybrid optimization-based/black-box method (MAML with an additional learned inner loss function), the robot successfully executes the policy from a novel starting position. * **Low-Resource Molecular Property Prediction (Nguyen et al. 2020):** Applied to drug discovery to predict molecular activities with limited data. The base model uses a Gated Graph Neural Network to process molecular structures. By utilizing MAML and First-order MAML (ANIL), the model outperformed transfer learning and standard multi-task learning baselines, proving optimization-based meta-learning can effectively handle varying graph structures. * **Few-Shot Human Motion Prediction (Gui et al. ECCV 2018):** Designed to predict future human motion coordinates based on the past K time steps of motion (useful for autonomous driving). The model uses MAML coupled with an additional learned update rule and a recurrent neural network base model. A visual on-screen shows the model accurately predicting the skeletal wireframe of a human transitioning from a standing position to sitting down. * **Language Modeling as Few-Shot Learning (GPT-3, Brown et al. 2020):** Demonstrates a black-box meta-learner using a massive Transformer architecture. Diverse tasks such as spelling correction ("pom -> prom"), simple math ("3+5 -> 8"), and language translation ("cheese -> fromage") are formatted entirely as text sequences. The model reads a few examples in its context window and outputs the correct string without requiring any gradient updates or structural changes. ## 4. Actionable Takeaways (Implementation Rules) * **Rule 1: Never use raw pixel distance for nearest neighbors** -> **Use a Convolutional Encoder** -> **Achieve Semantic Matching:** Always pass raw image inputs through a pre-trained Convolutional Neural Network (CNN) to extract latent feature vectors before applying distance metrics like Euclidean (L2) or Cosine distance. * **Rule 2: Match meta-train and meta-test conditions** -> **Simulate testing during training** -> **Prevent Distribution Shift:** Ensure the exact number of classes (N-way) and examples (K-shot) used during meta-training matches the expected conditions during meta-testing. * **Rule 3: Use Prototypical Networks for multi-shot data** -> **Average class embeddings into centroids** -> **Reduce computational overhead:** When given more than one support example per class (K > 1), compute the mathematical center (prototype) of the embeddings rather than measuring distances to every individual data point independently. * **Rule 4: Select Optimization-based methods when future scaling is expected** -> **Rely on algorithmic consistency** -> **Ensure monotonic improvement:** If your deployment environment will eventually gather much more data per class, use an optimization-based method (like MAML) because it guarantees improvement, whereas an RNN black-box model might break when fed sequences longer than its training context. * **Rule 5: Implement multiple prototypes for high intra-class variance** -> **Apply internal clustering** -> **Capture diverse visual presentations:** In complex datasets (like medical imaging) where a single class can look drastically different, run k-means clustering within the class embeddings to generate multiple distinct prototypes for the same label. ## 5. Pitfalls & Limitations (Anti-Patterns) * **Pitfall:** Using Siamese Networks for N-way few-shot evaluation tasks. -> **Why it fails:** It creates a severe discrepancy between training and inference. The network is explicitly meta-trained on a binary task (same vs. different) but forced to evaluate an N-way classification task at test time. -> **Warning sign:** The model shows high accuracy during binary validation but fails to generalize when scaled up to multiple classes in production. * **Pitfall:** Averaging embeddings when the class data is highly multimodal. -> **Why it fails:** If a class contains visually distinct sub-categories, averaging their vectors produces a meaningless centroid that lands in an empty, irrelevant region of the latent space. -> **Warning sign:** Standard Prototypical Networks show poor accuracy on complex classes (e.g., the same disease appearing differently on varying skin tones). * **Pitfall:** Deploying Non-Parametric methods for regression tasks. -> **Why it fails:** Non-parametric methods like Matching or Prototypical networks inherently rely on discrete clustering and distance bounding. They lack a mathematical mechanism to output continuous numerical values. -> **Warning sign:** You find yourself forced to artificially bin continuous targets into discrete classes to make the architecture function. * **Pitfall:** Scaling basic Matching Networks to large K (many shots). -> **Why it fails:** Calculating distances to every individual support example requires $O(K)$ compute at inference, creating a bottleneck. -> **Warning sign:** Inference latency and memory usage spike exponentially as you provide more support examples. ## 6. Key Quote / Core Insight "To solve few-shot learning efficiently, we cannot rely on raw pixel comparisons; instead, we must use parametric meta-learning to construct a deep semantic embedding space where simple, non-parametric rulesâlike nearest neighborsâbecome highly accurate and computationally cheap at test time." ## 7. Additional Resources & References * **Resource:** Siamese Neural Networks for One-shot Image Recognition (Koch et al., ICML 2015) - **Type:** Paper - **Relevance:** Introduces Siamese Networks, a foundational but limited binary approach to few-shot learning. * **Resource:** Matching Networks for One Shot Learning (Vinyals et al., NeurIPS 2016) - **Type:** Paper - **Relevance:** Introduces a non-parametric method that resolves the train/test discrepancy found in Siamese Networks. * **Resource:** Prototypical Networks for Few-shot Learning (Snell et al., NeurIPS 2017) - **Type:** Paper - **Relevance:** Establishes a highly efficient framework for >1 shot classification by averaging embeddings into class centroids. * **Resource:** The Unreasonable Effectiveness of Deep Features as a Perceptual Metric (Zhang et al., arXiv 1801.03924) - **Type:** Paper - **Relevance:** Provides the empirical proof demonstrating the failure of L2 distance in perceptual pixel space. * **Resource:** Dermnet - **Type:** Dataset - **Relevance:** A real-world, long-tailed dataset of dermatological conditions used to test the robustness of few-shot medical imaging models. * **Resource:** Language Models are Few-Shot Learners (Brown et al., 2020) - **Type:** Paper - **Relevance:** The GPT-3 paper, which proves that massive Transformer models operate as zero-gradient, black-box few-shot learners. * **Resource:** Meta-Learning GNN Initializations for Low-Resource Molecular Property Prediction (Nguyen et al., 2020) - **Type:** Paper - **Relevance:** Explores MAML and Graph Neural Networks for practical drug discovery applications.