Evaluation Methods and Experimental Protocols in Natural Language Understanding

đź“‚ General
# Evaluation Methods and Experimental Protocols in Natural Language Understanding **Video Category:** Machine Learning / Data Science Tutorial ## 📋 0. Video Metadata **Video Title:** Evaluation Methods in NLP (Inferred from visible notebook) **YouTube Channel:** Stanford Engineering **Publication Date:** Spring 2019 (Inferred from slide text) **Video Duration:** ~1 hour 18 minutes ## 📝 1. Core Summary (TL;DR) This lecture outlines the rigorous experimental protocols required for Natural Language Understanding (NLU) projects, moving beyond simply chasing high accuracy numbers. It establishes that a project's true scientific value lies in formulating clear hypotheses, selecting appropriate evaluation metrics, constructing contextual baselines, and conducting fair hyperparameter optimization. Implementing these practices prevents "hill-climbing" on test sets and ensures that performance claims reflect genuine model understanding rather than statistical noise or data leakage. ## 2. Core Concepts & Frameworks * **Experimental Protocol:** A structured, pre-defined framework for an experiment that forces researchers to state their hypotheses, data, metrics, and models *before* final execution. **Application:** Prevents ad-hoc engineering and hacking by requiring a justified plan for how data and models will come together to test a specific claim. * **Train/Dev/Test Data Splits:** The fundamental practice of partitioning data to isolate evaluation. The model learns on the *Train* set, hyperparameter tuning is conducted on the *Dev* (development/validation) set, and final generalization is evaluated strictly once on the *Test* set. **Application:** Prevents models from memorizing the specific examples they are evaluated on, ensuring they can handle unseen data. * **Stratified Cross-Validation:** A resampling procedure (like K-Fold) where data is partitioned, but the distribution of classes is forced to remain proportional across every split. **Application:** Crucial for unbalanced datasets; ensures that rare classes are equally represented in both training and testing phases for every fold, preventing volatile evaluation numbers. * **Task-Specific Baselines:** Heuristic models designed to exploit specific biases or structures in a dataset without actually solving the intended underlying problem. **Application:** Used to establish a true lower-bound for performance. For example, testing if a model can answer a reading comprehension question just by looking at the answers and ignoring the passage. * **Hyperparameter Optimization:** The systematic search for the best external model settings (e.g., learning rate, regularization, network depth) that are not learned during standard training. **Application:** Required to ensure a fair comparison between a novel architecture and a baseline; a new model might only win because it was tuned extensively while the baseline used default settings. ## 3. Evidence & Examples (Hyper-Specific Details) * **Bake-off 4 Results (Data Imbalance):** The task was word-level natural language inference (predicting word entailment given two words). The dataset was highly unbalanced: 1767 negative labels vs. 446 positive labels. Because of this imbalance, the official evaluation metric was forced to be Macro F1 rather than Micro F1 or Accuracy. The baseline Macro F1 was ~0.67. * **1st Place Model (Group 26):** Achieved a Macro F1 score of 0.7852. They used a BERT Sequence Classification Model (pre-trained BERT in PyTorch). To handle the 1767/446 data imbalance, they employed a "Random Oversampler" during preprocessing to randomly sample minority class examples with replacement. * **2nd Place Model (Group 9):** Achieved a Macro F1 score of 0.7541. They used Facebook's InferSent Model, pre-trained on the SNLI (Stanford NLI corpus). To handle the data imbalance, they used a "Weighted Loss" function, setting weights to `[1, 5.3]`, giving 5.3 times more emphasis to class 1 than class 0 during loss calculation. * **Poor Performing Bake-off Models:** A qualitative analysis of the bottom 10 models showed they relied heavily on shallow networks, linear regression, and SVMs. The speaker noted these struggle on this task because "there is no obvious way to create clever hand-crafted feature representations," making deep neural classifiers strictly superior here. Additionally, models using element-wise multiplication as a function to combine vectors performed poorly. * **Hypothesis Example (Social Science):** A hypothesis formulated by a previous project team analyzing Project Gutenberg texts: "I believe that in processing Project Gutenberg files, I can tell whether the author is a man or a woman based on their portrayal of female characters." * **Baseline Context Example:** If a system gets 0.95 F1, it seems objectively great. However, if a simple baseline model also achieves 0.95, the task is revealed to be trivial. Conversely, an F1 of 0.60 might seem poor, but if human annotators only achieve 0.80 agreement, the 0.60 indicates strong traction on a highly difficult problem. * **Story Cloze Task Baseline (Schwartz et al., 2017):** In a task meant to distinguish between a coherent and incoherent ending for a story, researchers built a baseline that *only* looked at the ending options (ignoring the story entirely). This baseline performed exceptionally well, revealing severe biases in how the dataset was constructed. * **Extreme Hyperparameter Tuning (Rajkomar et al., 2018):** To illustrate the impracticality of "perfect" tuning for most researchers, the speaker cited a paper where models were tuned using Google Vizier for a total of >201,000 GPU hours. This demonstrates why most teams must use guided sampling or random search on a fixed budget. ## 4. Actionable Takeaways (Implementation Rules) * **Rule 1: Define Explicit Hypotheses Upfront** - Do not just build models to "see what happens." State a clear, testable claim regarding why a specific architecture or data representation should excel on your chosen dataset before writing code. * **Rule 2: Implement "Dummy" Baselines Immediately** - Always include naive baselines in your experimental protocol. Use `scikit-learn` tools like `DummyClassifier` (guessing majority class or random proportional) and `DummyRegressor` to establish the absolute floor of performance. * **Rule 3: Use Stratified Splitting for Classification** - When using `scikit-learn` for classification problems, always use `StratifiedShuffleSplit` or `StratifiedKFold` instead of standard random splits to ensure minority classes are not accidentally excluded from testing folds. * **Rule 4: Isolate the Test Set Strictly** - The test set must be run *exactly once* at the very end of the project. If you iterate on your model design after seeing the test set results, you are "hill-climbing" and invalidating your evaluation. Use Cross-Validation on the training data for interim development decisions. * **Rule 5: Specify the `scoring` Parameter in Scikit-Learn** - When using functions like `cross_val_score` or `GridSearchCV`, never leave the `scoring=None` default. It will default to accuracy, which is highly misleading on unbalanced datasets. Explicitly set it (e.g., to Macro F1). * **Rule 6: Tune Baselines as Aggressively as Novel Models** - To make a persuasive scientific case, you must grant the baseline models the same hyperparameter optimization budget (e.g., via `RandomizedSearchCV`) as your proposed model to prove the architectural difference matters. * **Rule 7: Use Wilcoxon Signed-Rank for Model Comparison** - To prove Model A is statistically significantly better than Model B, run both models at least 10 times on different random data splits and use the Wilcoxon signed-rank test (recommended by Demšar, 2006) rather than just looking for non-overlapping confidence intervals. ## 5. Pitfalls & Limitations (Anti-Patterns) * **Pitfall:** Judging a project solely by hitting state-of-the-art numbers. -> **Why it fails:** This incentivizes researchers to hide negative results, overfit their models to the test set, and ignore whether the metrics actually measure understanding. -> **Warning sign:** A paper with high numbers but no error analysis, weak baselines, or opaque hyperparameter search details. * **Pitfall:** Using standard K-Fold validation on small datasets. -> **Why it fails:** The size of 'K' dictates the size of your training data. In 3-fold CV, you train on 67% of data; in 10-fold, you train on 90%. If 'K' is small, the model might fail simply due to data starvation rather than architectural flaws. -> **Warning sign:** Highly variable performance metrics depending on which 'K' value is chosen. * **Pitfall:** Ignoring Data Structure during Splits. -> **Why it fails:** Standard random splits destroy structural dependencies. If evaluating a dialogue agent across users, a random split might put User A's first sentence in train and User A's second sentence in test, leaking user-specific quirks. -> **Warning sign:** High cross-validation scores that plummet in real-world deployment. (Fix: Use `LeavePGroupsOut` in scikit-learn). * **Pitfall:** Reporting overlapping confidence intervals to claim two models perform the same. -> **Why it fails:** Confidence intervals measure variance across runs, but if Model A beats Model B consistently on *every single specific split* by a tiny margin, the models are significantly different despite the overall intervals overlapping. -> **Warning sign:** Relying on simple mean/variance tables to dismiss a baseline without running paired statistical tests like Wilcoxon. ## 6. Key Quote / Core Insight "We will never evaluate a project based on how 'good' the results are. We are not subject to the constraints of publication venues that favor positive evidence. We evaluate your project based on the appropriateness of the metrics, the strength of the methods, and the extent to which you are open and clear-sighted about the limits of its findings." ## 7. Additional Resources & References * **Resource:** `scikit-learn` - **Type:** Software Library - **Relevance:** Essential toolkit for robust evaluation. Key tools referenced: `DummyClassifier`, `DummyRegressor`, `StratifiedKFold`, `GridSearchCV`, `RandomizedSearchCV`, `cross_validate`, `LeavePGroupsOut`. * **Resource:** `scikit-optimize` - **Type:** Software Library - **Relevance:** Recommended for conducting guided/Bayesian searches through hyperparameter grids. * **Resource:** Schwartz et al. (2017) - **Type:** Research Paper - **Relevance:** Cited as a key example of building "Task-Specific Baselines" in the Story Cloze task, proving models could guess endings without reading the story context. * **Resource:** Rajkomar et al. (2018) - **Type:** Research Paper - **Relevance:** Cited to demonstrate extreme hyperparameter optimization (tuning with Google Vizier for >201,000 GPU hours), highlighting why pragmatic random sampling is necessary for normal budgets. * **Resource:** Demšar (2006) - **Type:** Research Paper - **Relevance:** Recommended literature for understanding how to statistically compare classifiers across multiple datasets, specifically advocating for the Wilcoxon signed-rank test.