Homework and bake-off: Sentiment analysis

📂 General
# Homework and bake-off: Sentiment analysis **Video Category:** Programming Tutorial / Educational Walkthrough ## 📋 0. Video Metadata **Video Title:** Stanford Engineering **YouTube Channel:** Stanford Engineering **Publication Date:** Not shown in video **Video Duration:** ~14 minutes ## 📝 1. Core Summary (TL;DR) The video provides a comprehensive walkthrough of a Jupyter notebook for a supervised sentiment analysis assignment focusing on cross-domain generalization. The core challenge involves training a ternary sentiment classifier on movie reviews (SST-3 dataset) and evaluating its performance on a new, distinct dataset of restaurant reviews. It emphasizes the complete machine learning lifecycle—from establishing simple and neural baselines, conducting systematic error analysis, to adhering strictly to scientific integrity by prohibiting optimization on test data. ## 2. Core Concepts & Frameworks * **Cross-Domain Sentiment Analysis:** -> **Meaning:** The process of training a machine learning model on text from one specific domain (e.g., movie reviews) and evaluating its accuracy on text from a completely different domain (e.g., restaurant reviews). -> **Application:** Testing a model's ability to learn generalized sentiment features rather than just memorizing domain-specific vocabulary. * **Ternary Classification:** -> **Meaning:** A sentiment analysis framing where the objective is to classify input text into exactly one of three categories: positive, negative, or neutral. -> **Application:** Used in the provided assignment where reviews are not just binary (good/bad) but include a large volume of neutral or objective statements. * **Honor Code in ML Evaluation:** -> **Meaning:** The strict scientific principle of not evaluating, tuning, or exploring test datasets during the model development phase, treating them as strictly unseen data. -> **Application:** Students are forbidden from using the public SST-3 test set or the bake-off test set during development to ensure the final evaluation metric accurately reflects real-world generalization. * **Macro-F1 Score:** -> **Meaning:** An evaluation metric calculated by finding the F1 score for each class independently and then calculating their unweighted mean. -> **Application:** Used as the primary metric for the bake-off because it assigns equal weight to all classes regardless of their support size, preventing models from achieving high scores merely by guessing the majority class in highly skewed datasets. ## 3. Evidence & Examples (Hyper-Specific Details) * **SST-3 Training Dataset Size & Subtrees:** The standard training set contains 8,544 full sentence examples. The video demonstrates that by setting `include_subtrees=True`, the dataset size expands massively to 159,274 examples by including labeled sub-phrases from the parsed trees, significantly increasing training volume at the cost of compute time. * **Bake-off Dataset Structure:** The new assessment dataset consists of restaurant reviews split into dev and test sets. Examples shown include `example_id: 57`, `"I would recommend that you make reservations in advance."` labeled as 'neutral', and `example_id: 590`, `"We were welcomed warmly."` labeled as 'positive'. The `is_subtree` feature is always `0` because these are only full sentences. * **Label Distribution Skew:** The `bakeoff_dev` set label distribution is shown explicitly: 1019 neutral, 777 positive, and 545 negative. The instructor notes this severe imbalance will heavily impact optimization choices. * **Softmax Baseline Performance:** A simple baseline using unigram features (splitting on whitespace) and a `LogisticRegression` wrapper yielded a macro-F1 score of ~0.53 on the SST-3 dev set, but dropped significantly to ~0.31 on the new bake-off dev set, clearly demonstrating the cross-domain generalization penalty. * **RNN Baseline Experiment:** An RNN baseline (`TorchRNNClassifier`) was run over 49 epochs before triggering early stopping. It achieved a macro-F1 of ~0.41, showing performance comparable to the shallow Softmax baseline on this specific task. * **Targeted Error Analysis Demonstration:** The instructor demonstrates querying an `analysis` DataFrame to find specific failure modes. The code filters for 168 examples where `(analysis['predicted_x'] == analysis['gold'])` (Softmax is correct), `(analysis['predicted_y'] != analysis['gold'])` (RNN is incorrect), and `(analysis['gold'] == 'positive')`, providing a targeted subset of sentences to investigate why the neural model failed where the linear model succeeded. * **Tokenization Disparities (PTB History):** The instructor points out that the SST dataset has idiosyncratic tokenization rules (e.g., specific white-spacing) because it originated from the Penn Treebank (PTB) in 2005. This creates a hidden mismatch with the modern tokenization used in the new restaurant review data. * **DynaSent Dataset as External Data:** The instructor explicitly suggests using the DynaSent dataset for training. It contains restaurant reviews and was labeled using the exact same protocols as the new bake-off development data, making it highly valuable for this specific cross-domain task. * **BERT Encoding Implementation:** Students are tasked with using the Hugging Face `transformers` library to write a function (`hf_cls_phi`) that maps text through a `BertModel` and returns the final output representation of the `[CLS]` token as a dense vector to be used as features for downstream classifiers. ## 4. Actionable Takeaways (Implementation Rules) * **Rule 1: Enforce absolute separation of test data** - Confine all model development, hyperparameter tuning, feature engineering, and error analysis exclusively to the training and development data splits. Never load test data until the final system freeze. * **Rule 2: Augment training data using tree structures** - When using parsed datasets like SST, set the `include_subtrees=True` parameter in your data reader. This extracts labeled sub-phrases, drastically increasing the amount of training data and often leading to better model performance despite higher compute costs. * **Rule 3: Disable vectorizers for sequence models** - When integrating neural networks (like RNNs) into a standard experiment framework (like `sst.experiment`), set `vectorize=False`. This ensures raw token strings are passed to the model so it can handle its own embedding lookups, rather than passing sparse bag-of-words matrices. * **Rule 4: Mix target-domain data into training** - To improve cross-domain transfer, augment your primary training dataset (e.g., movie reviews) by injecting a small sample of data from the target domain (e.g., restaurant reviews dev set). This gives the model initial traction on new vocabulary and phrasing. * **Rule 5: Audit tokenization rules across domains** - Before assuming a model lacks reasoning capabilities, manually inspect the tokens. Resolve idiosyncratic spacing, punctuation, and tokenization rules inherited from historical datasets (like PTB) to ensure they match the format of the target evaluation data. * **Rule 6: Conduct intersectional error analysis** - Merge the predictions of multiple models (e.g., a simple baseline and a deep learning model) into a single dataframe alongside the gold labels. Programmatically isolate instances where the simple model succeeds but the complex model fails to identify architectural blind spots. ## 5. Pitfalls & Limitations (Anti-Patterns) * **Pitfall:** Developing or checking error analysis on the public SST-3 test set. -> **Why it fails:** It violates scientific integrity, essentially "cheating" by allowing the model design to overfit to the exact examples it will be evaluated on, destroying the validity of the benchmark. -> **Warning sign:** A model scores exceptionally high on the SST-3 test set but fails catastrophically on the unseen bake-off test set. * **Pitfall:** Relying on basic accuracy metrics for imbalanced data. -> **Why it fails:** The dev datasets are heavily skewed (e.g., nearly double the amount of neutral examples compared to negative ones). A model could achieve high overall accuracy simply by defaulting to the majority class, masking complete failure on minority classes. -> **Warning sign:** High overall accuracy, but a classification report shows near-zero precision or recall for the 'negative' class. * **Pitfall:** Passing DictVectorizer outputs to an RNN. -> **Why it fails:** Scikit-learn vectorizers create sparse feature matrices that discard word order. Deep learning models like RNNs require sequential token inputs to map to continuous embeddings and build context over time. -> **Warning sign:** The code throws dimensionality errors, or the RNN performs no better than a unigram logistic regression because it has lost all sequential information. ## 6. Key Quote / Core Insight "Much of the scientific integrity of our field depends on people adhering to this honor code, that is doing no development on what is test data, because test data is our only chance to get a really clear look at how our systems are generalizing to new examples and new experiences." ## 7. Additional Resources & References * **Resource:** Stanford Sentiment Treebank (SST-3) - **Type:** Dataset - **Relevance:** The primary training dataset containing parsed and ternary-labeled movie reviews, originating from the Penn Treebank. * **Resource:** DynaSent - **Type:** Dataset - **Relevance:** Recommended alternative dataset containing restaurant reviews labeled with the exact same protocol as the target bake-off data, useful for cross-domain training. * **Resource:** Hugging Face `transformers` library - **Type:** Software Tool - **Relevance:** The required Python library to instantiate `BertModel` and `BertTokenizer` for extracting contextualized vector representations. * **Resource:** `sst.experiment` - **Type:** Python Framework - **Relevance:** A custom codebase provided by the course to standardize training, evaluation, and error analysis without writing boilerplate code, requiring models to conform to a `predict_one` method contract.