Multi-Modal Foundation Models: From CLIP to Molmo and Visual Programming
📂 General
# Multi-Modal Foundation Models: From CLIP to Molmo and Visual Programming
**Video Category:** Machine Learning & Computer Vision Tutorial
## ð 0. Video Metadata
**Video Title:** Multi-Modal Foundation Models
**YouTube Channel:** Stanford Engineering
**Publication Date:** Not shown in video
**Video Duration:** ~1 hour 11 minutes
## ð 1. Core Summary (TL;DR)
The landscape of computer vision has shifted from training specialized models for individual tasks to utilizing "Foundation Models" pre-trained on massive, diverse datasets. By aligning image and text representations through contrastive learning (like CLIP), models can perform zero-shot classification and demonstrate remarkable robustness to out-of-distribution data. Furthermore, by fusing these visual encoders with Large Language Models (LLMs) to create Vision-Language Models (VLMs) like LLaVA, Flamingo, and Molmo, AI can achieve complex visual reasoning, pointing, and even generate executable code to solve compositional visual tasks.
## 2. Core Concepts & Frameworks
* **Foundation Models:** -> **Meaning:** A single, large-scale model pre-trained on a massive, diverse dataset that acts as a base layer, which can be adapted or queried out-of-the-box for a wide variety of downstream tasks. -> **Application:** Instead of training a new ResNet for a specific medical imaging dataset, you start with a pre-trained foundation model and adapt it using zero-shot prompting or a lightweight linear probe.
* **Contrastive Learning (e.g., SimCLR, CLIP):** -> **Meaning:** A self-supervised training objective that pulls together representations of related data (e.g., an image and its corresponding text description, or two augmentations of the same image) while pushing apart representations of unrelated data in the embedding space. -> **Application:** Enables models to learn rich, generalized semantic features without requiring explicit human-labeled class categories.
* **Zero-Shot Classification via Text Encoders:** -> **Meaning:** Using a pre-trained text encoder to generate embedding vectors for target class labels (e.g., "plane", "dog", "bird") and classifying a new image by finding the text embedding that has the highest cosine similarity (dot product) with the image's embedding. -> **Application:** Classifying objects in a new domain without collecting any training images for those specific categories.
* **Vision-Language Models (VLMs):** -> **Meaning:** Architectures that fuse a pre-trained visual encoder (like CLIP) with a pre-trained Large Language Model (LLM) to process interleaved image and text inputs and generate text outputs. -> **Application:** Enabling a chatbot to answer complex questions about an uploaded image (Visual Question Answering), describe scenes, or generate code based on visual inputs.
* **Visual Programming (Chaining):** -> **Meaning:** The process of using an LLM to break down a complex visual query into a sequence of steps, generating an executable script (e.g., Python) that calls specialized foundation models (like object detectors, VQA models, or segmentation tools) to calculate the final answer. -> **Application:** Counting objects across multiple separate images by writing a loop that calls a counting tool on each image and sums the results.
## 3. Evidence & Examples (Hyper-Specific Details)
* **Zero-Shot Robustness (CLIP vs. ResNet on ObjectNet):** The speaker demonstrates that a supervised ResNet101 achieves 76.2% on standard ImageNet but drops drastically to 32.6% on ObjectNet (a dataset with objects in unusual contexts, like a banana on the floor). In contrast, CLIP (ViT-L) achieves the same 76.2% on ImageNet but maintains a highly robust 72.3% on ObjectNet, proving it generalizes much better to unseen environments.
* **Prompt Engineering for Zero-Shot Classifiers:** Passing a single word like "dog" to CLIP's text encoder yields suboptimal results due to training bias (internet text is rarely single words). By changing the prompt to "A photo of a [category]", ImageNet accuracy increased by 1.3%. Averaging the vectors of multiple prompts (e.g., "A photo of a [dog]", "A drawing of a [dog]", "A sketch of a [dog]") yielded a +5% performance boost on ImageNet.
* **CoCa (Contrastive Captioners):** To improve upon CLIP, CoCa added a text decoder with a captioning (autoregressive) loss on top of the contrastive loss. This hybrid objective pushed zero-shot ImageNet performance to 86.3%, and when fine-tuned, it achieved 91.0%, effectively beating purely supervised models for the first time.
* **LLaVA Architecture:** To connect a vision encoder to an LLM, LLaVA uses a simple trainable Linear Layer to project frozen CLIP image features directly into the input embedding space of a frozen LLM (like LLaMA).
* **Flamingo Architecture & In-Context Learning:** Flamingo utilizes a "Perceiver Resampler" to convert variable-sized image token sequences into a fixed number of tokens. It then injects these visual tokens into the LLM at every layer using a newly added, trainable "Gated Cross-Attention" module. This architecture enabled zero-shot and few-shot in-context learning (e.g., prompting the model with alternating images and texts: `[Image of Chinchilla] "This is a chinchilla" -> [Image of Shiba] "This is a shiba" -> [New Image] "This is a..." -> Model outputs "flamingo"`).
* **Molmo & The PixMo Dataset:** To overcome the limitations of noisy, scraped web data, the AI2 team built the PixMo dataset (700k pairs). Annotators recorded 60-90 seconds of speech describing images in extreme detail (shape, color, spatial layout) which was converted to text using Whisper. The resulting open-source Molmo-72B model achieved top-tier performance on human preference ELO ratings, matching or beating proprietary models like GPT-4o, Gemini 1.5 Pro, and Claude 3.5.
* **Grounding and Pointing (Molmo):** Molmo was trained not just to output text, but to output 2D coordinate points. When prompted "Point to the handle on the white bottle", the model directly outputs the X,Y coordinates of the handle, enabling downstream robotics applications (e.g., passing those coordinates to a robotic arm for manipulation).
* **Segment Anything Model (SAM):** A segmentation foundation model trained on 1 billion masks (SA-1B dataset). It can take an image and a prompt (a point, a bounding box, or text) and output a segmentation mask. The speaker demonstrates chaining by taking Molmo's point output ("Point to the cricket bat") and feeding that specific coordinate into SAM 2 to generate a perfect pixel-level mask of the bat.
* **VisProg (Visual Programming) for Image Editing:** Given the instruction "Replace desert with lush green grass", VisProg uses GPT-3 to generate a Python program: `OBJ0 = Seg(image=IMAGE, query='desert')` followed by `IMAGE0 = Replace(image=IMAGE, object=OBJ0, prompt='lush green grass')`.
## 4. Actionable Takeaways (Implementation Rules)
* **Rule 1: Use Linear Probing before Fine-Tuning:** When adapting a model like CLIP to a new dataset, do not immediately fine-tune the entire network. Instead, freeze the image encoder and train a single linear classifier layer on top of the extracted features. This requires far less data and compute while delivering massive performance gains.
* **Rule 2: Average Multiple Text Prompts for Zero-Shot Vectors:** When using CLIP for zero-shot classification, never use bare category names. Create a vector representation for a class by generating embeddings for multiple descriptive phrases (e.g., "A photo of a [class]", "A drawing of a [class]") and calculate the mean vector to use as your robust class prototype.
* **Rule 3: Prioritize Data Intentionality Over Sheer Scale:** For training Vision-Language Models, raw internet scraping (incidental data) plateaus in utility because it lacks spatial and structural descriptions. Invest in highly detailed, intentional human-annotated data (like the PixMo dataset) to teach the model spatial reasoning and subtle details.
* **Rule 4: Chain Models to Solve Compositional Reasoning:** If a VLM fails at complex spatial or counting logic across multiple images, switch to a Visual Programming paradigm. Use an LLM to generate code that sequences targeted foundation models (e.g., Object Detector -> Counter -> Math operator) rather than forcing a single model to do the reasoning internally.
* **Rule 5: Extract Multiple Masks for Ambiguous Prompts:** When building or querying segmentation models with ambiguous prompts (like a single click on a pair of scissors), ensure the architecture (like SAM) outputs multiple masks at different levels of granularity (e.g., the handle, the blade, and the whole scissors) so the downstream system can select the most appropriate one.
## 5. Pitfalls & Limitations (Anti-Patterns)
* **Pitfall:** Using small batch sizes during Contrastive Learning. -> **Why it fails:** The model relies on encountering "hard negative" examples (e.g., a Welsh Corgi vs. a similar-looking dog) within the same batch to learn fine-grained distinctions. Small batches lack these difficult comparisons, causing the model to only learn high-level concepts (e.g., it learns "animal" instead of "Welsh Corgi"). -> **Warning sign:** The model accurately classifies broad categories but fails on specific sub-categories or breeds.
* **Pitfall:** Expecting CLIP to understand compositionality and spatial relationships. -> **Why it fails:** CLIP processes text somewhat like a "bag of words" and struggles with syntax and word order. -> **Warning sign:** The model outputs identical high-confidence scores for completely different physical reality prompts, such as "a mug in some grass" versus "some grass in a mug," or "horse eating grass" versus "grass eating horse."
* **Pitfall:** Relying solely on VLM internal reasoning for exact counting. -> **Why it fails:** VLMs (like standard LLMs) process information autoregressively and do not have native, pixel-perfect counting mechanisms, leading to hallucinations. -> **Warning sign:** When asked "How many boats are there?", the model confidently gives an incorrect number without pointing to or grounding each boat individually.
## 6. Key Quote / Core Insight
"Internet data is incidental. Human annotated data is intentional."
*Insight: Scraping billions of image-text pairs from the web yields massive scale, but the text is often subjective alt-text that ignores the actual visual reality of the image. To achieve true spatial and visual reasoning, models require intentionally crafted data where humans explicitly describe the objective visual facts, shapes, and locations within the frame.*
## 7. Additional Resources & References
* **Resource:** SimCLR - **Type:** Paper/Methodology - **Relevance:** Foundational framework for self-supervised contrastive learning on images.
* **Resource:** CLIP (OpenAI, 2021) - **Type:** Model/Paper - **Relevance:** The pivotal vision-language model establishing zero-shot classification via contrastive text-image pre-training.
* **Resource:** CoCa (Contrastive Captioners, 2022) - **Type:** Model - **Relevance:** Improved CLIP by adding a generation/captioning objective.
* **Resource:** LLaVA - **Type:** Model Architecture - **Relevance:** Demonstrates a simple, effective way to link vision encoders to LLMs via a linear projection layer.
* **Resource:** Flamingo (DeepMind) - **Type:** Model/Paper - **Relevance:** Pioneered the use of Perceiver Resamplers and Gated Cross-Attention for interleaved multimodal in-context learning.
* **Resource:** Molmo & PixMo dataset (AI2) - **Type:** Open-Source Model and Dataset - **Relevance:** Proves that smaller, highly intentional, speech-generated datasets can outperform massive noisy web datasets.
* **Resource:** Segment Anything Model (SAM) / SA-1B Dataset - **Type:** Model/Dataset - **Relevance:** The primary foundation model for promptable image segmentation.
* **Resource:** VisProg (Visual Programming) - **Type:** Methodology/Paper - **Relevance:** Shows how to use LLMs to write code that chains multiple visual tools together for compositional reasoning.