Analysis Methods in NLP: Adversarial Testing, Probing, and Feature Attribution

📂 General
# Analysis Methods in NLP: Adversarial Testing, Probing, and Feature Attribution **Video Category:** Natural Language Processing Tutorial / Academic Research Methodology ## 📋 0. Video Metadata **Video Title:** Analysis methods in NLP: Overview **YouTube Channel:** Stanford ENGINEERING **Publication Date:** Not shown in video **Video Duration:** ~8 minutes ## 📝 1. Core Summary (TL;DR) This lecture introduces four structured analysis methods for Natural Language Processing (NLP) to evaluate model robustness and deeply understand internal system behaviors. Moving beyond standard accuracy metrics, it outlines behavioral evaluations (adversarial testing and training) and structural evaluations (probing and feature attribution). Applying these specific techniques allows researchers to expose latent weaknesses, map internal model representations, and ultimately build more robust systems, providing a highly structured approach for writing the analysis section of academic papers. ## 2. Core Concepts & Frameworks * **Adversarial Testing** -> **Meaning:** The process of challenging a model with carefully perturbed examples—often created by making minor lexical changes to standard dataset items—to expose a lack of systematicity or over-reliance on shallow heuristics. -> **Application:** Swapping a single word in a text premise to see if an NLI (Natural Language Inference) model genuinely understands the logical relationship (entailment vs. contradiction) or if it is just guessing based on word associations. * **Probing (Internal Representations)** -> **Meaning:** The technique of fitting small, supervised analytical models (probes) onto the hidden, internal layers of a deep neural network to reveal what specific linguistic properties those layers latently encode. -> **Application:** Testing the intermediate layers of a transformer model to determine exactly at which layer the network transitions from processing basic syntax (parts of speech) to higher-level semantic concepts (coreference). * **Feature Attribution** -> **Meaning:** An introspective evaluation method that quantifies how heavily individual input features (like specific words) influence the model's final prediction, often mapping these influences visually. -> **Application:** Using techniques like Integrated Gradients on a sentiment classifier to color-code words, revealing if the model correctly understands context or if it blindly assigns negative scores to typically negative words regardless of how they are used. ## 3. Evidence & Examples (Hyper-Specific Details) * **Adversarial Testing via "Breaking NLI" (Glockner et al. 2018):** Researchers tested NLI models by making mild, single-word changes using lexical resources. * **Example 1:** Premise: *"A little girl kneeling in the dirt crying"* -> Entails: *"A little girl is very sad."* They changed "sad" to "unhappy". The logical relation should remain *Entails*, but standard models frequently predicted *Contradiction*, likely tripped up by the negation prefix. * **Example 2:** Premise: *"An elderly couple are sitting outside a restaurant, enjoying wine"* -> Entails: *"A couple drinking wine."* They changed "wine" to "champagne". The logical relation flips to *Neutral* (all champagne is wine, but not all wine is champagne). Models frequently predicted *Entails* due to a fuzzy, non-systematic understanding of the relationship between the words. * **Result:** While models scored in the mid-to-high 80s on the standard SNLI test set, their performance plummeted on this new adversarial set. The only exceptions were the WordNet Baseline and KIM architecture, simply because those models had access to the specific lexical resources used to build the adversarial test. * **Modern Architecture Robustness (RoBERTa-MNLI, 2021):** An off-the-shelf RoBERTa model fine-tuned on the MultiNLI dataset was tested against the 2018 Glockner adversarial dataset. It achieved an astounding F1 score of 0.99 for contradiction, 0.92 for entailment, and an overall accuracy of 0.97. This proves modern transformer models can learn systematic lexical relations even when the dataset was not explicitly designed to solve that specific adversarial challenge. * **Probing Internal Representations of BERT-large (Tenney et al. 2019):** A visual chart mapped 24 layers of BERT-large. The data showed that lower layers focus heavily on syntactic elements (Parts of Speech, Constituents, Dependencies). As the layers progress toward layer 24, the model increasingly encodes complex discourse and semantic content (Entities, Semantic Role Labeling, Coreference, Relations). * **Feature Attribution via Integrated Gradients (Sundararajan et al. 2017):** Applied to a sentiment model evaluating the phrase *"They sell a mean apple pie"* (where "mean" idiomatically means "delicious"). * **Visual Coding:** Blue indicated a bias toward positive predictions; Red indicated a bias toward negative predictions. * **Test Variations:** For phrases like *"They sell a mean apple pie"*, *"They make..."*, and *"He makes..."*, the model correctly predicted Positive with high probabilities (0.85, 0.68, 0.97). * **Failure Mode:** For the phrase *"He sells a mean apple pie"*, the prediction inexplicably flipped to Negative (probability 0.15). The color-coded feature attribution revealed the model heavily weighted "mean" as dark red (negative) across all examples. It did not understand the context of "mean apple pie" and broke completely under a minor, incidental grammatical variation. ## 4. Actionable Takeaways (Implementation Rules) * **Rule 1: Implement targeted adversarial testing** - Do not trust standard benchmark accuracy alone. Systematically modify your test examples using specific lexical changes (synonyms, hypernyms, negations) to prove your system relies on genuine language comprehension rather than exploitable dataset artifacts. * **Rule 2: Utilize domain-specific adversarial datasets** - When analyzing a model, test it against established, stress-tested datasets built via adversarial dynamics. Use Zellers datasets for commonsense reasoning, Nie et al. for NLI, Bartolo et al. for QA, DynaSent for sentiment, and Vidgen et al. for hate speech. * **Rule 3: Map layer functionality with Probes** - If utilizing deep architectures (like 24-layer transformers), train supervised probes on intermediate layers to empirically prove what syntactic or semantic structures the model is actually learning during pre-training. * **Rule 4: Visualize decisions using Feature Attribution** - Apply techniques like Integrated Gradients to map word-level importance. Verify that your model is achieving correct predictions by focusing on the right contextual cues, rather than relying on brittle, default token associations that will fail in edge cases. ## 5. Pitfalls & Limitations (Anti-Patterns) * **Pitfall:** Writing unstructured, open-ended error analysis sections in papers. -> **Why it fails:** General discussions of misclassified examples lack scientific rigor and fail to pinpoint the mechanical reasons for the failure. -> **Warning sign:** The analysis section of your paper relies on phrases like "the model makes several errors" without providing structural probing or feature attribution data to explain why. * **Pitfall:** Assuming models understand context based on correct overall predictions. -> **Why it fails:** Models may correctly guess the sentiment of a sentence by relying heavily on incidental words while entirely misunderstanding idiomatic phrases. -> **Warning sign:** A model predicts positive sentiment for an idiomatic phrase, but feature attribution shows it assigned heavily negative weights to key words and only succeeded by accident. * **Pitfall:** Expecting models to generalize without systematic lexical knowledge. -> **Why it fails:** Models trained on standard datasets often learn fuzzy associations (like "wine" and "champagne" appearing together) without understanding strict logical boundaries (hypernyms/hyponyms). -> **Warning sign:** A model that scores 85% on a standard NLI task immediately begins predicting false entailments when specific nouns are swapped for related, but logically distinct, categories. ## 6. Key Quote / Core Insight "True evaluation of an NLP system goes beyond standard benchmark accuracy; it requires deliberately breaking the model with adversarial tests to expose its limits, and probing its internal layers to prove that it is solving problems for the right structural reasons." ## 7. Additional Resources & References * **Resource:** Breaking NLI (Glockner et al. 2018) - **Type:** Paper / Dataset - **Relevance:** Essential reference and dataset for testing the systematic lexical knowledge of Natural Language Inference models through targeted word replacement. * **Resource:** Zellers et al. (2018, 2019) - **Type:** Papers / Datasets - **Relevance:** Resources for adversarial training and testing in commonsense reasoning. * **Resource:** Nie et al. (2020) - **Type:** Paper / Dataset - **Relevance:** Resources for adversarial training and testing in NLI. * **Resource:** Bartolo et al. (2020) - **Type:** Paper / Dataset - **Relevance:** Resources for adversarial training and testing in Question Answering (QA). * **Resource:** DynaSent (Potts et al. 2020) - **Type:** Paper / Dataset - **Relevance:** Resource for adversarial training and testing in Sentiment Analysis. * **Resource:** Vidgen et al. (2020) - **Type:** Paper / Dataset - **Relevance:** Resource for adversarial training and testing in Hate Speech detection. * **Resource:** Tenney et al. (2019) - **Type:** Paper - **Relevance:** Foundational study on how to use probing to map syntactic and semantic representations across the layers of deep language models like BERT. * **Resource:** Integrated gradients (Sundararajan et al. 2017) - **Type:** Methodology / Paper - **Relevance:** A specific technique for feature attribution used to calculate and visualize the word-level importance of model predictions.