CS Knowledge Hub

# PyTorch Primitives: Memory and Compute Accounting for Deep Learning **Video Category:** Programming Tutorial / Machine Learning Engineering ## ð 0. Video Metadata **Video Title:** PyTorch Primitives: Memory and Compute Accounting **YouTube Channel:** Stanford Engineering **Publication Date:** Not shown in video **Video Duration:** ~31 minutes ## ð 1. Core Summary (TL;DR) This lecture breaks down the foundational PyTorch primitives required to build and train deep learning models from scratch, with a rigorous focus on resource accounting. It solves the critical engineering problem of inefficient hardware utilization by teaching developers how to calculate the exact memory footprint and computational cost (FLOPs) of tensor operations. By mastering these low-level mechanicsâfrom understanding floating-point bit allocations to tracing tensor memory stridesâengineers can accurately estimate training times, prevent out-of-memory errors, and optimize their models for maximum hardware throughput. ## 2. Core Concepts & Frameworks * **Concept:** Tensors -> **Meaning:** The fundamental building blocks in PyTorch for storing parameters, gradients, optimizer states, and activations. Under the hood, a tensor is a pointer to a contiguous block of allocated memory (a 1D array), coupled with metadata (shape and stride) that dictates how to mathematically interpret and access that memory. -> **Application:** Used to represent all data within a neural network; understanding their memory structure is required for efficient data manipulation without unnecessary copying. * **Concept:** Floating-Point Precision (FP32, FP16, BF16) -> **Meaning:** The data types used to store tensor values, defined by how their bits are divided between the sign, exponent, and fraction. FP32 (32 bits) is the standard for precision, while lower-precision formats like BF16 (16 bits) save memory by truncating the fraction bits while maintaining the exponent range of FP32. -> **Application:** Used in Mixed Precision Training, where memory-heavy activations are stored in BF16 to save VRAM and increase speed, while critical parameters and optimizer states remain in FP32 to maintain numerical stability. * **Concept:** Floating-Point Operations (FLOPs) -> **Meaning:** A measure of computational work, defined as a basic operation like addition or multiplication. In deep learning, matrix multiplications dominate FLOP counts. -> **Application:** Used to estimate the total computational cost of training a model (e.g., calculating that a forward and backward pass takes roughly 6 FLOPs per parameter per token). * **Concept:** Model FLOPs Utilization (MFU) -> **Meaning:** The ratio of the actual FLOPs per second achieved by a model during training to the theoretical maximum "promised" FLOPs per second of the hardware (e.g., an Nvidia H100). -> **Application:** Used as a primary metric for hardware efficiency; an MFU above 0.5 (50%) indicates good utilization, while lower numbers signal that the model is bottlenecked by memory bandwidth or communication overhead rather than compute limits. ## 3. Evidence & Examples (Hyper-Specific Details) * **Napkin Math for Training Time:** The speaker calculates the time to train a 70 Billion parameter model on 15 Trillion tokens using 1024 Nvidia H100 GPUs. * Formula: `Total FLOPs = 6 * (70e9 parameters) * (15e12 tokens)`. * Hardware assumption: An H100 promises ~1979 TeraFLOPs (dense FP16/BF16), but with an assumed MFU of 0.5, it achieves half that. * Result: The calculation yields approximately 144 days of continuous training. * **Napkin Math for Max Model Size:** The speaker calculates the largest model that can fit on a node of 8 H100 GPUs (each with 80GB of HBM memory) using the AdamW optimizer. * Memory pool: `8 * 80e9 bytes`. * Bytes per parameter (AdamW in FP32): 16 bytes total (4 for the parameter, 4 for the gradient, 4 for the optimizer momentum, 4 for the optimizer variance). * Result: `(8 * 80e9) / 16` = 40 Billion parameters (Note: this rough estimate ignores activation memory, which depends on batch size and sequence length). * **Tensor Memory Calculation:** A simple tensor created via `x = torch.zeros(4, 8)` defaults to FP32. It has 32 elements, each taking 4 bytes, resulting in a total memory footprint of 128 bytes. Conversely, a single feedforward matrix in GPT-3 (`12288 x (12288 * 4)`) stored in FP16 takes up ~2.3 Gigabytes of memory. * **Dynamic Range and Underflow Example:** The speaker demonstrates the risk of FP16 by creating a tensor with a small value: `torch.tensor([1e-8], dtype=torch.float16)`. Because FP16 only allocates 5 bits to the exponent, it cannot represent numbers this small, and the value underflows to `0.0`. Doing the same with `bfloat16` retains the value because BF16 uses 8 bits for the exponent (matching FP32). * **Tensor Stride Mechanics:** Visualizing a 4x4 matrix, the speaker shows that moving to the next row (dimension 0) requires skipping 4 elements in the underlying 1D storage array (stride = 4). Moving to the next column (dimension 1) requires skipping 1 element (stride = 1). Finding the element at index `[1, 2]` involves the math: `1 * stride[0] + 2 * stride[1] = 1 * 4 + 2 * 1 = index 6`. * **Tensor Views vs. Copies:** Calling `y = x[0]` or `y = x.view(3, 2)` does not copy the tensor data. The code `assert same_storage(x, y)` passes. Mutating `x[0][0] = 100` simultaneously mutates `y[0][0]`. * **Linear Model FLOPs Calculation:** For a linear model `y = x @ W`, where `x` is shape `(B, D)` and `W` is shape `(D, K)`, the forward pass requires `2 * B * D * K` FLOPs (one multiplication and one addition per element). The backward pass requires exactly twice that amount (`4 * B * D * K`). This establishes the general rule that training requires ~6 FLOPs per parameter per token. ## 4. Actionable Takeaways (Implementation Rules) * **Rule 1: Explicitly move tensors to the GPU** - Tensors are created in CPU RAM by default. To utilize hardware acceleration, you must explicitly transfer them across the PCIe bus to GPU High Bandwidth Memory (HBM) using `x = x.to("cuda:0")` or by defining the device at creation: `torch.zeros(32, 32, device="cuda:0")`. * **Rule 2: Use BFloat16 (BF16) for matrix multiplications** - For forward and backward passes in deep learning, switch from FP32 to BF16. BF16 cuts memory usage in half while maintaining the same dynamic range as FP32, preventing the underflow/overflow crashes common with standard FP16. * **Rule 3: Keep optimizer states and parameters in FP32** - While activations can be calculated in BF16, the model's parameters and the optimizer's running averages must be stored and updated in FP32 (Mixed Precision Training). Updating parameters in BF16 lacks the fractional resolution required for tiny gradient updates, causing the model to stagnate. * **Rule 4: Use Einops instead of chained reshapes** - Do not use traditional PyTorch methods like `x.transpose(-2, -1).view(...)` for complex dimensional changes, as undocumented dimension tracking easily leads to silent bugs. Use the `einops` library to explicitly name dimensions: `einops.rearrange(x, "batch seq heads hidden -> batch seq (heads hidden)")`. * **Rule 5: Enforce memory contiguity before viewing** - If you perform an operation that changes the tensor's stride metadata without moving the data (like `.transpose()`), the memory is no longer contiguous. You must call `.contiguous()` before calling `.view()`, which forces PyTorch to allocate new memory and physically rearrange the data. * **Rule 6: Set manual seeds for reproducibility** - Randomness occurs in parameter initialization, dropout, and data shuffling. To effectively debug models, fix the randomness by calling `torch.manual_seed(seed)`, `np.random.seed(seed)`, and `random.seed(seed)` at the start of your script. ## 5. Pitfalls & Limitations (Anti-Patterns) * **Pitfall:** Assuming `.transpose()` or `.view()` creates a new, independent tensor. -> **Why it fails:** These operations only alter the metadata (shape/stride) while pointing to the exact same underlying memory block. -> **Warning sign:** Modifying a value in the reshaped tensor unexpectedly mutates the original data source. * **Pitfall:** Trusting the advertised "Peak FLOPs" on hardware specification sheets. -> **Why it fails:** Marketing materials (like Nvidia's 1979 TFLOPs for the H100) often quote numbers based on optimal conditions that don't apply to standard training, such as using FP8 data types or assuming 50% structured sparsity in the matrices. -> **Warning sign:** Calculating your Model FLOPs Utilization (MFU) yields a depressingly low percentage (e.g., 20%) because the denominator is based on an unachievable marketing metric. * **Pitfall:** Using FP16 for large-scale model training. -> **Why it fails:** FP16 only allocates 5 bits to the exponent, giving it a tiny dynamic range. During training, small gradients will underflow to zero, and large activations will overflow to infinity. -> **Warning sign:** The training loss suddenly spikes to `NaN` or collapses to `0.0`. * **Pitfall:** Loading entire large datasets into CPU memory at once. -> **Why it fails:** Language modeling datasets can be terabytes in size (e.g., LLaMA data is 2.8TB), which vastly exceeds the RAM of standard compute nodes. -> **Warning sign:** The training script crashes with a CPU Out-Of-Memory (OOM) error before training even begins. (Solution: serialize to numpy arrays and use `np.memmap` to lazy-load data from disk). ## 6. Key Quote / Core Insight "Efficiency is the name of the game. To be efficient, you have to know exactly how many FLOPs you are actually expending. When these numbers get large, these directly translate into dollars, and you want that cost to be as small as possible." ## 7. Additional Resources & References * **Resource:** Assignment 1 Handout - **Type:** Course Material - **Relevance:** Contains a detailed mathematical description of the Transformer architecture required for the course. * **Resource:** "The Illustrated Transformer" - **Type:** Blog Post/Article - **Relevance:** Recommended as visual reading material to understand the conceptual overview of the Transformer model. * **Resource:** "The Illustrated GPT-2" - **Type:** Blog Post/Article - **Relevance:** Recommended as visual reading material to understand decoder-only architectures. * **Resource:** `einops` (Einstein Operations) - **Type:** Python Library - **Relevance:** Demonstrated as the superior, less bug-prone method for manipulating and reshaping tensors with named dimensions. * **Resource:** Nvidia H100 Tensor Core GPU Datasheet - **Type:** Hardware Specification - **Relevance:** Used to demonstrate how to find and interpret promised FLOPs per second based on specific data types and sparsity assumptions.