The Systems Complexity of Training Huge LLMs: Parallelism Basics

📂 General
# The Systems Complexity of Training Huge LLMs: Parallelism Basics **Video Category:** Computer Science / Machine Learning Systems ## 📋 0. Video Metadata **Video Title:** Lecture 8: Parallelism Basics **YouTube Channel:** Stanford Engineering (CS336) **Publication Date:** Not shown in video **Video Duration:** ~2 hours ## 📝 1. Core Summary (TL;DR) Training modern frontier Large Language Models (LLMs) is impossible on a single machine due to severe compute and memory bottlenecks. To scale, the industry has shifted to treating the entire datacenter as the fundamental unit of compute, requiring complex distributed systems engineering. This lecture details the various parallelization paradigms—Data, Pipeline, Tensor, Expert, and Sequence parallelism—and how they are optimally combined to maximize GPU utilization while hiding the latency of network communication. ## 2. Core Concepts & Frameworks * **Intra-node vs. Inter-node Parallelism:** -> **Meaning:** Intra-node refers to communication between GPUs within a single server chassis using ultra-fast, high-bandwidth links (e.g., NVLink). Inter-node refers to communication between different servers using slower network switches (e.g., InfiniBand). -> **Application:** Network-heavy strategies like Tensor Parallelism are restricted to intra-node connections (max 8 GPUs), while Pipeline and Data Parallelism scale across the slower inter-node connections. * **Collective Communication Primitives:** -> **Meaning:** Algorithmic networking operations used to distribute data across nodes without manual packet routing. Key primitives include *All-Reduce* (sum all gradients and return the sum to all nodes), *Reduce-Scatter* (sum gradients but give each node only a specific chunk), and *All-Gather* (collect chunks from all nodes into a full tensor). -> **Application:** The mathematical equivalence that `All-Reduce = Reduce-Scatter + All-Gather` allows memory-saving algorithms (like ZeRO) to shard data without increasing overall communication cost. * **ZeRO (Zero Redundancy Optimizer) / FSDP:** -> **Meaning:** A suite of memory optimization techniques for Data Parallelism. Instead of every GPU holding a full copy of the model state (weights, gradients, optimizer states), ZeRO shards these components across GPUs. -> **Application:** Stage 1 shards optimizer states; Stage 2 shards gradients; Stage 3 (FSDP) shards parameters, dynamically gathering them via All-Gather only during the exact layer computation, enabling massive models to fit into cluster memory. * **Pipeline Parallelism (PP):** -> **Meaning:** Slicing the model vertically by layer depth, assigning different layers to different GPUs (e.g., GPU 1 gets layers 1-10, GPU 2 gets 11-20). -> **Application:** Used to span models across multiple physical machines. It requires "micro-batching" to keep all GPUs busy and minimize "pipeline bubbles" (idle time waiting for data to pass through previous layers). * **Tensor Parallelism (TP):** -> **Meaning:** Slicing the model horizontally by matrix width. A single matrix multiplication (like an Attention or MLP projection) is split so multiple GPUs compute fractions of it simultaneously. -> **Application:** Used to parallelize massive individual layers that cause memory or compute bottlenecks, but requires extremely fast All-Reduce operations after every layer. * **Expert Parallelism (EP):** -> **Meaning:** Used in Mixture-of-Experts (MoE) architectures where different "expert" neural networks are placed on different GPUs. Tokens are routed over the network to the correct expert. -> **Application:** Allows scaling the parameter count of a model to hundreds of billions without proportionally increasing the FLOPs required for inference or training. ## 3. Evidence & Examples (Hyper-Specific Details) * **Memory Cost Baseline Calculation:** In FP16/BF16 training using the Adam optimizer, every single parameter requires 16 bytes of memory (2 bytes for parameter, 2 bytes for gradient, 4 bytes for FP32 master weight, 4 bytes for Adam 1st moment, 4 bytes for Adam 2nd moment). A 7B parameter model requires 112GB of memory just for the state, which exceeds the capacity of an 80GB Nvidia A100 GPU. * **TPU vs GPU Hardware Topologies:** Google TPUs use a "Toroidal Mesh" (a 3D grid where chips connect to direct neighbors, wrapping around). This is cost-effective for dense models with predictable routing. Nvidia GPUs use "Fat Trees" (All-to-all spine switches). Google's recent TPUv8i shifted to an All-to-all topology specifically to handle the stochastic, heavy token routing required by modern MoE models. * **Groc SRAM Architecture:** Groc uses purely SRAM instead of HBM to achieve massive bandwidth. However, because SRAM lacks capacity, a model requires hundreds of chips. To achieve the same communication power, Groc requires 4x the power consumption of an equivalent Nvidia system. * **ZeRO Stage 1 Cost Equivalence:** Naive Data Parallelism uses one All-Reduce for gradients (Cost: 2x parameters). ZeRO Stage 1 uses a Reduce-Scatter for gradients, updates the sharded weights locally, and then an All-Gather for parameters (Cost: 1x + 1x = 2x parameters). It provides massive memory savings ($(4+K)/N_{gpu}$) for exactly zero extra communication cost. * **Llama 3 405B Hardware Failures:** Over a 54-day pre-training period on a massive cluster, Meta recorded 419 unexpected interruptions. 30.1% were GPU failures, 17.2% were HBM3 memory failures, 12.9% were software bugs, and 8.4% were network switch/cable issues. * **Llama 3 405B Parallelism Configuration:** Meta trained the 405B model using a combination of TP=8, PP=16, DP=128, and Context Parallelism (CP)=1 with a batch size of 32. This achieved a 43% theoretical Model FLOP/s utilization. * **Gemma 2 (27B) Parallelism Configuration:** Google trained Gemma 2 using ZeRO-3 (FSDP) alongside Tensor Parallelism (TP=2), but avoided Pipeline Parallelism (PP=0). * **Mixtral 8x22B Parallelism Configuration (via Megatron):** To efficiently train this MoE model, Nvidia recommends TP=4, PP=4, and Expert Parallelism (EP)=8. * **DeepSeek v3 Parallelism Configuration:** To handle an extreme MoE architecture (671B parameters), DeepSeek used massive Expert Parallelism (EP=64) alongside Pipeline Parallelism (PP=16) and a surprisingly low Tensor Parallelism (TP=1). * **Activation Recomputation (Checkpointing):** Standard activation memory per layer scales with the sequence length. The quadratic attention term ($5(as/h)$) can be completely dropped from memory by recalculating the attention matrices during the backward pass (often utilizing Flash Attention). ## 4. Actionable Takeaways (Implementation Rules) * **Rule 1: Constrain Tensor Parallelism to intra-node links** - Never scale TP beyond the boundary of a single physical server (typically 8 GPUs). TP requires constant All-Reduce communication that will choke on slow inter-node InfiniBand connections. * **Rule 2: Always enable ZeRO Stage 1** - Sharding the optimizer state is mathematically "free" regarding network bandwidth compared to naive Data Parallelism. There is no reason not to use it. * **Rule 3: Overlap communication with computation in FSDP (ZeRO-3)** - When sharding parameters, use the hardware's dual streams. While computing the forward/backward pass for layer $N$, simultaneously initiate the All-Gather over the network to fetch the parameters for layer $N+1$. * **Rule 4: Manage Pipeline bubbles with massive micro-batching** - If utilizing Pipeline Parallelism, the number of micro-batches must be substantially larger than the number of pipeline stages. The ratio of idle "bubble" time to useful compute is dictated by $(N_{stages}-1) / N_{microbatches}$. * **Rule 5: Decouple parallelisms for MoE models** - Do not apply Tensor Parallelism to the entire model if using Mixture-of-Experts. Apply TP to the dense Attention layers, and apply Expert Parallelism (EP) to the MLP layers to prevent extreme communication bottlenecks. * **Rule 6: Use Activation Checkpointing for long-context windows** - When sequence lengths increase, activation memory grows quadratically. Discard the intermediate attention matrices after the forward pass and recalculate them exactly when needed in the backward pass to prevent Out-Of-Memory (OOM) errors. ## 5. Pitfalls & Limitations (Anti-Patterns) * **Pitfall:** Using Naive Data Parallelism for 10B+ parameter models. -> **Why it fails:** Retaining full copies of FP32 master weights and optimizer moments on every single GPU requires 16 bytes per parameter, quickly exceeding 80GB hardware limits. -> **Warning sign:** Immediate OOM errors during optimizer initialization. * **Pitfall:** Scaling Tensor Parallelism too high (e.g., TP=32 or 64). -> **Why it fails:** Slicing matrix multiplications too finely reduces the arithmetic intensity of the operation. The GPUs spend more time waiting for the All-Reduce synchronizations than doing actual math. -> **Warning sign:** Adding more GPUs via TP decreases the overall Tokens/sec throughput. * **Pitfall:** Implementing Pipeline Parallelism with a small batch size. -> **Why it fails:** The pipeline cannot be filled with enough micro-batches, forcing GPUs to sit idle waiting for the forward or backward pass to propagate through the network layers. -> **Warning sign:** Profiling shows GPUs are active for only $1/N$ of the total time. * **Pitfall:** Applying Expert Parallelism (EP) across slow network links. -> **Why it fails:** MoE routing requires an all-to-all token dispatch. If tokens must travel across inter-node switches to find their expert, latency spikes. -> **Warning sign:** Network wait times dominate the MLP forward pass in profiling charts. ## 6. Key Quote / Core Insight "The new unit of compute is the data center." *Rewritten:* To train frontier AI models, we can no longer view computation as an event happening on a single chip. The true unit of compute is the datacenter itself—a massive, fragile supercomputer where the primary engineering challenge is orchestrating data flow across thousands of GPUs to prevent them from sitting idle. ## 7. Additional Resources & References * **Resource:** PyTorch FSDP (Fully Sharded Data Parallel) Tutorial - **Type:** Software Documentation - **Relevance:** The standard implementation of ZeRO-3 parameter sharding discussed in the lecture. * **Resource:** ZeRO: Memory Optimizations Toward Training Trillion Parameter Models (Rajbhandari et al.) - **Type:** Paper - **Relevance:** The foundational paper for ZeRO Stages 1, 2, and 3. * **Resource:** Megatron-LM Documentation (Nvidia) - **Type:** Software Library - **Relevance:** Industry standard for combining Tensor, Pipeline, and Expert parallelism. * **Resource:** Llama 3 405B Technical Report (Meta) - **Type:** Paper - **Relevance:** Contains exact 4D parallelism configurations and hardware failure statistics at scale. * **Resource:** DeepSeek V3 Technical Report - **Type:** Paper - **Relevance:** Showcases extreme Expert Parallelism (EP=64) routing strategies. * **Resource:** DeepSpeed (Microsoft) - **Type:** Software Library - **Relevance:** The original library implementing the ZeRO optimization stages. * **Resource:** Efficient Large-Scale Language Model Training on GPU Clusters (Narayanan et al. 2021) - **Type:** Paper - **Relevance:** Core literature establishing the optimal combination of 3D parallelism techniques.