Copy of CS224N PyTorch Tutorial.ipynb

📂 General
# Copy of CS224N PyTorch Tutorial.ipynb **Video Category:** Programming Tutorial ## 📋 0. Video Metadata **Video Title:** Copy of CS224N PyTorch Tutorial.ipynb **YouTube Channel:** Stanford Engineering **Publication Date:** Not shown in video **Video Duration:** ~47 minutes ## 📝 1. Core Summary (TL;DR) This tutorial provides a foundational walkthrough of PyTorch, focusing on its similarities to NumPy and its core role in building deep learning models. It introduces tensors as the fundamental data structures, explains how to manipulate them using vectorization and broadcasting, and demonstrates how to leverage PyTorch's `Autograd` feature for automatic differentiation. Ultimately, it shows how to compose these elements using the `torch.nn` module to build, train, and optimize a Multi-Layer Perceptron without writing manual backpropagation code. ## 2. Core Concepts & Frameworks * **PyTorch Tensors:** -> **Meaning:** Multi-dimensional arrays that act as the fundamental building blocks in PyTorch, heavily resembling NumPy arrays but optimized for GPU acceleration. -> **Application:** Used to represent all data inputs, hidden states, and outputs within a neural network (e.g., an image is a 3D tensor: channels x height x width). * **Broadcasting:** -> **Meaning:** A set of semantics that allows PyTorch to automatically expand the dimensions of tensors during element-wise operations or batched matrix multiplications, provided their shapes are compatible (e.g., matching dimensions or one dimension being a "dummy" dimension of size 1). -> **Application:** Allows for adding a bias vector to an entire batch of matrices without explicitly replicating the bias vector in memory. * **Autograd (Automatic Differentiation):** -> **Meaning:** PyTorch's engine that automatically calculates and stores gradients for tensor operations during the backward pass (`.backward()`). -> **Application:** Eliminates the need to manually derive and code calculus chain rules for backpropagation when training a neural network. * **`torch.nn` Module:** -> **Meaning:** A built-in library providing pre-defined, high-level neural network components like linear layers, activation functions, and loss functions. -> **Application:** Used to quickly construct complex network architectures (like a Multi-Layer Perceptron) by stacking layers, automatically handling the initialization of learnable weights and biases. ## 3. Evidence & Examples (Hyper-Specific Details) * **Tensor Creation from Lists:** The speaker demonstrates converting a nested Python list `[[1, 2, 3], [4, 5, 6]]` into a tensor using `torch.tensor(data)`. This creates a 2x3 tensor. * **Explicit Data Typing:** Using `torch.tensor(data, dtype=torch.float32)` ensures the tensor holds floating-point numbers. For example, inputting `[[1, 2], [3, 4]]` with this dtype results in `tensor([[1., 2.], [3., 4.]])`. * **Utility Tensor Generation:** The video shows `torch.zeros((2, 5))` to create a 2x5 matrix of zeros, `torch.ones((3, 4))` for a 3x4 matrix of ones, and `torch.arange(1, 10)` to create a 1D tensor `tensor([1, 2, 3, 4, 5, 6, 7, 8, 9])`. * **Element-wise Math:** Taking the `torch.arange(1, 10)` tensor (named `rr`) and adding 2 (`rr + 2`) outputs `tensor([3, 4, 5, 6, 7, 8, 9, 10, 11])`. * **Matrix Multiplication (`matmul`):** Multiplying tensor `a` (shape 3x2) and `b` (shape 2x4) using `a.matmul(b)` or the shorthand `a @ b` results in a new tensor of shape 3x4. * **Reshaping with `view`:** A 1D tensor of numbers 1 through 15 (shape `[15]`) is reshaped into a 5x3 matrix using `rr.view(5, 3)`, resulting in `tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12], [13, 14, 15]])`. * **NumPy Interoperability:** A NumPy array `arr = np.array([[1, 0, 5]])` is converted to PyTorch via `torch.tensor(arr)`. It is converted back to NumPy using `new_arr = data.numpy()`. * **Collapsing Dimensions (Summation):** Given a 5x7 tensor `data`, calling `data.sum(dim=0)` collapses the rows, resulting in a 1D tensor of shape `[7]` containing column sums. Calling `data.sum(dim=1)` collapses the columns, resulting in a shape of `[5]`. * **List Indexing:** On a 5x3 tensor `matr`, using `matr[[0, 2, 4]]` successfully extracts the 0th, 2nd, and 4th rows, outputting a 3x3 tensor. * **Scalar Extraction:** Executing `tensor(1.).item()` successfully converts the 1x1 tensor into the raw Python float `1.0`. * **Autograd Demonstration:** - `x = torch.tensor([2.], requires_grad=True)` - `y = x * x * 3` (represents $3x^2$) - `y.backward()` computes the derivative ($6x$). - `x.grad` correctly outputs `tensor([12.])`. - Re-running the equation `z = x * x * 3` and calling `z.backward()` without clearing the gradients changes `x.grad` to `tensor([24.])` due to accumulation. * **`nn.Linear` Example:** A linear layer `linear = nn.Linear(4, 2)` takes an input tensor of shape `2x3x4` (where 4 matches the input dimension) and outputs a tensor of shape `2x3x2`. * **Complete Training Loop:** The video shows a 10-epoch training loop on dummy data using `model = MultiLayerPerceptron(5, 3)`, `adam = optim.Adam(model.parameters(), lr=1e-1)`, and `loss_function = nn.BCELoss()`. The printed training loss decreases monotonically from `0.6496` at Epoch 0 to `0.0014` at Epoch 9. ## 4. Actionable Takeaways (Implementation Rules) * **Rule 1: Always print `.shape` when debugging** - Do not rely on visual inspection or error messages when stacking multiple operations. Insert `print(tensor.shape)` statements to verify that dimensions match your conceptual model, as PyTorch may silently broadcast or reshape data improperly. * **Rule 2: Pad jagged sequences to create uniform tensors** - PyTorch cannot process jagged tensors (e.g., sentences of varying word lengths) efficiently. Pad all inputs in a batch to match the maximum length using zeros or dummy tokens to create a fixed rectangular matrix. * **Rule 3: Use the `dim` argument to specify the dimension to collapse** - When using reduction functions like `.sum()` or `.mean()`, the `dim` parameter defines the dimension that will be eliminated. `dim=0` removes rows (summing down columns), and `dim=1` removes columns (summing across rows). * **Rule 4: Call `optimizer.zero_grad()` before every backward pass** - Inside your training loop, explicitly clear old gradients before calling `loss.backward()`. Failure to do so causes gradients from previous batches to accumulate, destroying the weight update calculation. * **Rule 5: Extract scalars with `.item()` to save memory** - When recording metrics like loss or accuracy, use `loss.item()` to pull the raw numerical value out of the computation graph. Storing raw tensors in a list for logging will cause memory leaks. * **Rule 6: Use `nn.Sequential` for linear architectures** - Instead of manually defining and passing inputs through individual layers in the `forward` function, group simple sequential layers together (e.g., Linear -> ReLU -> Linear -> Sigmoid) inside an `nn.Sequential` block during `__init__`. ## 5. Pitfalls & Limitations (Anti-Patterns) * **Pitfall:** Forgetting to zero out gradients (`optimizer.zero_grad()`) in the training loop. -> **Why it fails:** PyTorch's default behavior is to accumulate (sum) gradients across `.backward()` calls to facilitate complex architectures like Recurrent Neural Networks. -> **Warning sign:** The training loss will oscillate wildly, fail to converge, or explode to NaN, and inspecting `parameter.grad` will show values growing linearly with the number of epochs. * **Pitfall:** Misunderstanding the `dim` argument in reduction operations. -> **Why it fails:** Developers intuitively assume `dim=0` means "give me the row sums". It actually means "collapse the 0th dimension (rows) to produce column sums." -> **Warning sign:** The resulting tensor has a shape that does not match the expected downstream matrix multiplication, triggering a shape mismatch error later in the code. * **Pitfall:** Relying exclusively on error messages to catch shape mismatches. -> **Why it fails:** PyTorch includes internal optimizations that might silently reshape or broadcast tensors if they technically satisfy mathematical rules, even if the result violates the intended logic of the neural network. -> **Warning sign:** The code runs without crashing, but the model fails to learn or outputs garbage predictions. * **Pitfall:** Attempting to create tensors from jagged lists. -> **Why it fails:** PyTorch relies on highly optimized C++ and CUDA matrix operations that require rigid, fixed-dimensional arrays. -> **Warning sign:** Passing a list of unequal length lists to `torch.tensor()` will throw an error or force a highly inefficient fallback. ## 6. Key Quote / Core Insight "Printing out the shapes of all of your tensors is probably your best resource when it comes to debugging. It's kind of one of the hardest things to intuit exactly what's going on once you start stacking a lot of different operations together." ## 7. Additional Resources & References * **Resource:** Google Colab - **Type:** Tool - **Relevance:** The cloud-based notebook environment used by the speaker to run PyTorch code interactively. * **Resource:** PyTorch Documentation (`pytorch.org`) - **Type:** Website - **Relevance:** Recommended by the speaker for checking exact semantics on complex operations like broadcasting rules and batched matrix multiplications. * **Resource:** NumPy - **Type:** Python Library - **Relevance:** Frequently cited as the conceptual predecessor; knowing NumPy semantics translates directly to understanding PyTorch tensor manipulation.