Neural Dependency Parsing, Language Models, and Recurrent Neural Networks (RNNs)

📂 General
# Neural Dependency Parsing, Language Models, and Recurrent Neural Networks (RNNs) **Video Category:** Natural Language Processing Tutorial ## 📋 0. Video Metadata **Video Title:** Lecture 5: Language Models and Recurrent Neural Networks **YouTube Channel:** Stanford ENGINEERING **Publication Date:** Not shown in video **Video Duration:** ~80 minutes ## 📝 1. Core Summary (TL;DR) This lecture bridges the gap between neural dependency parsing and the foundational concepts of language modeling, ultimately introducing the architecture of Recurrent Neural Networks (RNNs). It details how transitioning from sparse, indicator-based features to dense, continuous representations allows for the creation of non-linear classifiers that significantly improve parsing accuracy and computation speed. Furthermore, the presentation deconstructs traditional n-gram language models, exposing their severe sparsity and storage limitations, and positions the RNN as a solution that elegantly handles variable-length textual sequences by applying shared weights across time steps. ## 2. Core Concepts & Frameworks * **Indicator Features vs. Distributed Representations:** -> **Meaning:** Indicator features are sparse, binary representations (e.g., 1 if a specific word precedes a verb, 0 otherwise) used in traditional ML, resulting in massive, incomplete feature spaces. Distributed representations are dense, continuous vectors (embeddings) of lower dimensionality where semantically similar items have close geometric vectors. -> **Application:** Used in neural dependency parsing to replace millions of sparse features with a compact, dense vector (~1000 dimensions), drastically reducing the 95% computation time previously spent on feature extraction. * **n-gram Language Model:** -> **Meaning:** A probabilistic model that predicts the next word based on a Markov assumption: the probability of the next word depends *only* on the preceding $n-1$ words. -> **Application:** Historically used for tasks like predictive typing or speech recognition by calculating probabilities through counting word frequencies in a large training corpus. * **Recurrent Neural Network (RNN):** -> **Meaning:** A family of neural architectures designed to process sequential data by maintaining a hidden state that updates iteratively. The defining characteristic is the application of the exact same weight matrices ($W_h$ and $W_e$) at every single time step. -> **Application:** Used to build neural language models capable of processing arbitrary-length text inputs without requiring the model size to grow with the length of the input, solving the fixed-window limitation of standard feed-forward networks. ## 3. Evidence & Examples (Hyper-Specific Details) * **Dependency Parser Performance Comparison:** A data table demonstrated the superiority of neural approaches over traditional parsing. MaltParser achieved 89.8 UAS (Unlabeled Attachment Score) at a speed of 469 sentences/sec. MSTParser achieved a higher 91.4 UAS but was extremely slow at 10 sent/sec. The Neural approach (Chen & Manning 2014) achieved 92.0 UAS at 654 sent/sec, providing the best balance of high accuracy and fast inference. * **Graph-based Neural Dependency Parser (Dozat and Manning 2017):** This model was cited as achieving state-of-the-art results (95.74 UAS, 94.08 LAS on PTB WSJ SD 3.3 corpus) by utilizing a biaffine scoring model over a neural sequence model. However, it was noted to be significantly slower than simple neural transition-based parsers because it must compute scores for all $n^2$ possible dependencies in a sentence of length $n$. * **Non-linear Decision Boundaries (ConvNetJS Visualization):** Visualizations built with Andrej Karpathy's ConvNetJS demonstrated that traditional ML classifiers (like standard softmax) create strictly linear decision boundaries, failing to separate complex, intertwined data distributions (plotted as green and red dots). A neural network warped the original feature space, allowing the final linear softmax layer to successfully draw non-linear boundaries in the original space to isolate the data points correctly. * **Dropout Regularization Method (Srivastava et al. 2014):** The video detailed this technique: during training time, randomly set 50% of the inputs to each neuron to 0 at each instance of evaluation (in SGD training). At test time, all model weights are halved (because they were trained seeing twice as many activations). This acts as strong, feature-dependent regularization to prevent feature co-adaptation. * **n-gram Sparsity Example:** To illustrate sparsity, the phrase "the students opened their ___" was used. If a 4-gram model evaluates the specific sequence "students opened their w" and that sequence never occurred in the training data, the probability becomes precisely 0. Worse, if the 3-gram context "students opened their" never occurred, the denominator becomes 0, rendering the calculation mathematically undefined without using smoothing or backoff techniques. * **Trigram Text Generation Coherece Failure:** A trigram language model trained on a 1.7 million word Reuters financial news corpus generated this text: *"today the price of gold per ton , while production of shoe lasts and shoe industry , the bank intervened just after it considered and rejected an imf demand to rebuild depleted european stocks , sept 30 end primary 76 cts a share ."* While locally grammatical (every 3 words fit together), the overall sentence is entirely incoherent, proving that short n-grams fail to model long-term semantic meaning. * **RNN Weight Sharing Equation:** In the simple RNN architecture, the hidden state at step $t$ is defined as $h^{(t)} = \sigma(W_h h^{(t-1)} + W_e x^{(t)} + b_1)$. The matrices $W_h$ (hidden weight matrix) and $W_e$ (embedding weight matrix) are completely identical for every time step $t$, enforcing symmetry and capping memory size regardless of input length. ## 4. Actionable Takeaways (Implementation Rules) * **Rule 1: Use dense feature representations instead of sparse indicators** - When building NLP classifiers, represent words, POS tags, and dependency labels as $d$-dimensional dense vectors (embeddings) and concatenate them to form the input layer, avoiding the massive computational overhead of checking millions of binary conditions. * **Rule 2: Default to ReLU non-linearities for deep networks** - Use the Rectified Linear Unit ($f(z) = \max(0, z)$) as your primary activation function. It trains quickly and allows for excellent gradient backflow without the saturation problems inherent to logistic (sigmoid) or tanh functions. * **Rule 3: Initialize weights with small random values to break symmetry** - Never initialize weight matrices to zero. Use uniform distributions (e.g., Uniform(-r, r)) or Xavier initialization (variance inversely proportional to fan-in and fan-out) to prevent neurons from learning identical specializations. * **Rule 4: Apply Dropout to prevent feature co-adaptation** - Randomly drop out neurons (typically 50% in hidden layers, less in input layers) during training to force the neural network to learn robust, independent features rather than relying on specific, brittle combinations of inputs. * **Rule 5: Use Adam optimizer as a reliable starting default** - While plain Stochastic Gradient Descent (SGD) works, it requires meticulous manual tuning of the learning rate. For complex networks, default to Adam (or similar adaptive optimizers like Adagrad or RMSprop) to automatically scale parameter adjustments. * **Rule 6: Vectorize your operations using matrices** - Avoid `for` loops when computing neural network operations. Loop over word vectors and concatenating them takes drastically longer (e.g., 639 μs) compared to executing a single matrix multiplication (e.g., 53.8 μs), yielding a 1 to 2 order of magnitude speedup, especially on GPUs. ## 5. Pitfalls & Limitations (Anti-Patterns) * **Pitfall:** Relying on indicator features in transition-based parsing -> **Why it fails:** Creates massive, sparse feature vectors that are often incomplete (missing contexts from training data) and highly computationally expensive, consuming >95% of parsing time just to compute features. -> **Warning sign:** Extremely slow training/inference and plateauing accuracy despite aggressive feature engineering. * **Pitfall:** Increasing 'n' in n-gram models to capture more context -> **Why it fails:** Exponentially worsens the sparsity problem (most longer n-grams will have zero counts in the corpus) and causes model storage to explode (storing counts for all possible combinations). -> **Warning sign:** Model probabilities constantly collapse to zero, necessitating aggressive backoff algorithms; model memory footprint exceeds available RAM. * **Pitfall:** Using a fixed-window neural language model -> **Why it fails:** A fixed window is fundamentally too small to capture long-range dependencies, and enlarging the window increases the weight matrix $W$ unsustainably. Furthermore, input words in different positions are multiplied by completely different weights, destroying symmetry. -> **Warning sign:** The model fails on tasks requiring coreference resolution across paragraphs or cannot handle inputs longer than the predefined window size. * **Pitfall:** Assuming simple RNNs perfectly access distant context -> **Why it fails:** While theoretically capable of utilizing information from many steps back, standard RNN computations are inherently sequential and slow, and in practice, gradients vanish, making it exceptionally difficult to access information from many steps back. -> **Warning sign:** The RNN performs well on recent context but "forgets" subjects or topics established at the beginning of a long document. ## 6. Key Quote / Core Insight "The core conceptual breakthrough of a Recurrent Neural Network is applying the exact same weight matrices repeatedly at every time step, allowing the model to process sequential inputs of arbitrary length without any increase to the model's parameter size." ## 7. Additional Resources & References * **Resource:** Chen and Manning 2014 - **Type:** Academic Paper - **Relevance:** Introduced the first successful, fast neural dependency parser utilizing dense word, POS, and dependency embeddings. * **Resource:** Dozat and Manning 2017 (also Dozat, Qi, and Manning 2017) - **Type:** Academic Paper - **Relevance:** Revived graph-based dependency parsing by utilizing a biaffine scoring model on top of a neural sequence model, achieving highest accuracy scores. * **Resource:** Dropout (Srivastava, Hinton, Krizhevsky, Sutskever, & Salakhutdinov 2012/JMLR 2014) - **Type:** Academic Paper - **Relevance:** The foundational paper introducing Dropout as a regularization method to prevent feature co-adaptation in neural networks. * **Resource:** ConvNetJS by Andrej Karpathy - **Type:** Web Tool/Software - **Relevance:** Visual demonstration tool used to show how neural networks warp input spaces to create non-linear decision boundaries using linear final layers. * **Resource:** PyTorch - **Type:** Software Framework - **Relevance:** The primary deep learning framework utilized for Assignment 3 to build the neural dependency parser.