GPU Benchmarking, Profiling, and Triton Kernel Programming
📂 General
# GPU Benchmarking, Profiling, and Triton Kernel Programming
**Video Category:** Programming Tutorial / Systems Engineering
## ð 0. Video Metadata
**Video Title:** Stanford CS336: Benchmarking/profiling + writing kernels
**YouTube Channel:** Stanford Engineering
**Publication Date:** Not shown in video
**Video Duration:** ~1 hour 26 minutes
## ð 1. Core Summary (TL;DR)
This lecture bridges the gap between high-level GPU architecture and practical, high-performance kernel implementation using Python and Triton. It reveals that while modern programming models offer clean abstractions, achieving peak performance requires a deep, mechanical understanding of underlying hardware quirksâsuch as warp scheduling, memory coalescing, bank conflicts, and register limits. By mastering these hardware realities, developers can use techniques like kernel fusion and tiling to eliminate slow memory bottlenecks and dramatically accelerate operations like matrix multiplication and activation functions.
## 2. Core Concepts & Frameworks
* **GPU Memory Hierarchy:** The structured layers of memory that dictate data access speeds.
* *High Bandwidth Memory (HBM):* The main, global memory (e.g., 80GB-192GB). It is large but very slow (high latency).
* *L2 Cache:* Shared across the whole GPU chip, faster than HBM.
* *L1 Cache & Shared Memory:* Local to a Streaming Multiprocessor (SM). Shared memory is explicitly programmer-controlled and incredibly fast, acting as a manual cache to avoid HBM roundtrips.
* *Registers:* The fastest memory, private to individual threads.
* **Programming Model Hierarchy (CUDA/Triton):** The abstraction used to dispatch work to the GPU.
* *Thread:* The smallest unit of execution, processing a specific piece of data.
* *Thread Block (CTA - Concurrent Thread Array):* A group of threads that execute on the same SM and can communicate/synchronize via Shared Memory.
* *Grid:* A collection of Thread Blocks that executes a kernel across the entire GPU.
* **Warps & Lockstep Execution:**
* **Meaning:** Within a thread block, threads are grouped into bundles of 32, called "warps". All 32 threads in a warp must execute the exact same instruction at the exact same time (in lockstep).
* **Application:** If code contains conditional branching (`if/else`) and threads within the same warp take different paths (Control Divergence), the GPU must execute both paths sequentially, masking out the inactive threads, which halves (or worse) processing efficiency.
* **Memory Coalescing:**
* **Meaning:** When 32 threads in a warp access HBM, the hardware attempts to combine their requests into fewer, larger transactions (typically 128-byte cache lines).
* **Application:** Code must be written so that adjacent threads access adjacent memory addresses (e.g., reading a contiguous row). If threads access scattered addresses (e.g., reading a column), the GPU fetches a full 128-byte line for every single element, wasting massive amounts of bandwidth.
* **Shared Memory Bank Conflicts:**
* **Meaning:** Shared memory is physically divided into 32 "banks" (each 4 bytes wide) that can be accessed simultaneously. However, each bank can only serve one address per cycle.
* **Application:** If multiple threads in a warp try to access different addresses that map to the *same* bank, the accesses are serialized (processed one by one), creating a severe bottleneck.
* **Warp Occupancy & Register Pressure:**
* **Meaning:** An SM has a fixed pool of registers (e.g., 65,536 on an H100). The more registers each individual thread uses, the fewer total threads (and warps) can be scheduled simultaneously on that SM.
* **Application:** High register usage lowers "occupancy". Low occupancy limits the SM's ability to hide memory latency by quickly switching to other ready warps.
* **Kernel Fusion:**
* **Meaning:** Combining multiple sequential element-wise operations into a single custom kernel.
* **Application:** Instead of running an `add` kernel (read A, read B, write C to HBM), then a `multiply` kernel (read C, read D, write E to HBM), a fused kernel reads inputs once, keeps intermediate values in fast registers, and writes to HBM only at the very end.
* **Tiling:**
* **Meaning:** Breaking large datasets (like matrices) into smaller blocks ("tiles") that perfectly fit into the fast, local Shared Memory of an SM.
* **Application:** Used heavily in Matrix Multiplication (MatMul). A thread block loads a tile of Matrix A and a tile of Matrix B from slow HBM into fast Shared Memory, computes all possible partial products locally, and only writes the final result back to HBM.
## 3. Evidence & Examples (Hyper-Specific Details)
* **Hardware Evolution Specs (A100 vs H100 vs B200):**
* SM Count: A100 (108), H100 (132), B200 (148).
* Registers per SM: Consistent at 256 KB (65,536 registers).
* HBM Size: A100 (80GB), B200 (up to 192GB).
* HBM Bandwidth: ~2 TB/s on older generations scaling up to ~8 TB/s.
* **Warp Occupancy Calculation Example:**
* *Scenario:* A thread block has 128 threads. Each thread uses 160 registers. The SM has 65,536 available registers and allows a maximum of 64 concurrent warps.
* *Math:* Registers per block = 128 * 160 = 20,480.
* Blocks per SM (limited by registers) = floor(65,536 / 20,480) = 3 blocks.
* Warps scheduled = 3 blocks * (128 threads / 32 threads/warp) = 12 warps.
* *Result:* Occupancy is 12 / 64 = 18.75%. The SM is underutilized because individual threads are too "fat" (using too many registers).
* **Bank Conflict Worst-Case Example:**
* *Scenario:* Matrix data is stored in row-major order in shared memory. 32 threads in a warp decide to read the first column of the matrix.
* *Result:* Because they are reading a column, they hit addresses like 0, 32, 64, 96. Because memory is divided into 32 banks, all 32 threads end up requesting data from Bank 0 simultaneously. This results in a 32-way bank conflict, fully serializing the read and destroying performance.
* **Wave Quantization (Block Occupancy) Example:**
* *Scenario:* A B200 GPU has 148 SMs. A developer launches a grid with 160 thread blocks.
* *Result:* The scheduler places 148 blocks onto the 148 SMs (Wave 0). Once those finish, it must schedule the remaining 12 blocks (Wave 1). During Wave 1, 136 SMs sit completely idle waiting for the 12 to finish.
* *Fix:* Make the total number of thread blocks a clean multiple of the total SM count to ensure uniform utilization.
* **GeLU Kernel Fusion Profiling:**
* *Naive PyTorch Implementation:* Writing the mathematical formula `0.5 * x * (1 + tanh(sqrt(2/pi) * (x + 0.044715 * x^3)))` directly in Python.
* *Profiler Output:* The profiler reveals this generates multiple separate, tiny CUDA kernels: `BinaryFunctor` (mul), `UnaryFunctor` (pow), `CUDAFunctor_add`, `tanh_kernel`. Every step forces a read/write to slow HBM.
* *Triton Implementation:* A single compiled Triton kernel (`triton_gelu_kernel`) performs all the math inside the SM registers, reducing HBM reads/writes drastically and achieving the speed of PyTorch's optimized `builtin_gelu`.
* **Triton GeLU Code Walkthrough:**
* Threads must calculate their global position to know which data to fetch.
* `pid = tl.program_id(axis=0)` gets the block ID.
* `start = pid * BLOCK_SIZE` calculates the starting index.
* `offsets = start + tl.arange(0, BLOCK_SIZE)` creates a vector of indices for this block.
* `mask = offsets < num_elements` ensures threads don't read past the end of an irregularly sized array.
* `x = tl.load(x_ptr + offsets, mask=mask)` performs a coalesced load from HBM to registers.
* **Naive MatMul vs Tiled MatMul Arithmetic Intensity:**
* *Naive:* For an MxK matrix times a KxN matrix to output an MxN matrix, computing each element of C requires reading a full row of A and a full column of B. Total HBM reads = `M * K * N`. Arithmetic Intensity is essentially `O(1)` â bounded completely by memory bandwidth.
* *Tiled (Idealized):* Load a tile of A and a tile of B into shared memory. Now, data is read from HBM only `M*K + K*N` times, while performing `M*K*N` operations. This shifts the arithmetic intensity to `O(tile_size)`, making the operation compute-bound rather than memory-bound.
## 4. Actionable Takeaways (Implementation Rules)
* **Rule 1: Always use CUDA Events and Synchronization for Benchmarking** - Never use standard Python `time.time()` around GPU functions without synchronization. The GPU executes asynchronously. Use `start_event = torch.cuda.Event(enable_timing=True)`, followed by your function, then `end_event.record()`, and crucially, `torch.cuda.synchronize()` before calculating elapsed time. Do this over multiple iterations (warm-up runs first) to get an accurate average.
* **Rule 2: Fuse Element-wise Operations** - If you are applying sequential math operations (add, multiply, activation functions) on large tensors, do not rely on raw PyTorch operations. Write a Triton kernel or use PyTorch Compiler (`torch.compile`) to fuse them into a single kernel to prevent catastrophic HBM bandwidth bottlenecking.
* **Rule 3: Mask Memory Loads in Triton** - Always generate a boolean mask based on array bounds when loading data in Triton (`mask = offsets < num_elements`). If the tensor size is not a perfect multiple of your `BLOCK_SIZE`, unmasked loads will read unallocated memory and crash the program.
* **Rule 4: Align Thread Blocks with SM Counts** - When determining your grid size (number of thread blocks), try to make it an even multiple of the physical SMs on your target GPU (e.g., multiples of 108, 132, or 148). This prevents "wave quantization" where a small tail of blocks leaves the majority of the GPU idle.
* **Rule 5: Thread Coarsening for Low Overhead** - If your element-wise operation is extremely simple (e.g., just adding two numbers), assigning one thread per element may result in too much scheduling overhead. Implement "thread coarsening" where a single thread processes a small vector of elements (e.g., 4 or 8) sequentially.
* **Rule 6: Use Swizzling to Avoid Bank Conflicts** - When loading matrices into shared memory (especially columns), use index manipulation techniques like "swizzling" (rearranging the logical mapping to physical banks) so that column accesses hit different banks, avoiding serialization.
## 5. Pitfalls & Limitations (Anti-Patterns)
* **Pitfall:** Measuring GPU code execution without a warm-up phase. -> **Why it fails:** The first execution of a CUDA or Triton kernel triggers lazy compilation and setup overhead, making it artificially slow. -> **Warning sign:** The first benchmarking run takes 500ms, while subsequent runs take 2ms.
* **Pitfall:** Writing code that causes Control Divergence. -> **Why it fails:** Warps execute in lockstep. If `Thread 0` takes an `if` path and `Thread 1` takes an `else` path, the SM must execute the `if` instructions for the whole warp (masking out Thread 1), then execute the `else` instructions for the whole warp (masking out Thread 0). -> **Warning sign:** Complex branching logic inside a kernel leads to execution times double or triple what is expected based on arithmetic alone.
* **Pitfall:** Assuming more threads is always better. -> **Why it fails:** If a complex kernel uses a massive number of registers per thread (e.g., 200 registers), the SM will exhaust its register pool quickly, forcing it to schedule very few warps. This low "occupancy" prevents the SM from hiding memory latency by swapping between active warps. -> **Warning sign:** Profiler shows low warp occupancy despite large grid sizes, and kernel execution is stalled waiting for memory.
* **Pitfall:** Processing column-major data with contiguous thread IDs. -> **Why it fails:** This breaks memory coalescing. Thread 0 requests element 0, Thread 1 requests element N. The hardware fetches 128-byte cache lines for *both* requests, using only 4 bytes from each. -> **Warning sign:** Profiling shows high "HBM reads/writes" but very low actual throughput relative to the theoretical max bandwidth.
## 6. Key Quote / Core Insight
"The programming model provides a beautiful abstraction of the hardware... In principle, you don't need to think about anything else for correctness. But in practice, performance is profoundly sensitive to the hardware. To actually squeeze out performance, you must deeply understand the hardware mechanismsâwarps, occupancy, and bank conflictsâthat sit beneath that abstraction."
## 7. Additional Resources & References
* **Resource:** PyTorch Profiler (`torch.profiler`) - **Type:** Tool - **Relevance:** Essential for diagnosing whether operations are fused, how much time is spent on HBM read/writes, and identifying the exact underlying CUDA kernels being called.
* **Resource:** OpenAI Triton (`triton-lang.org`) - **Type:** Language/Compiler - **Relevance:** The primary tool demonstrated for writing performant, fused kernels at the block-level without writing raw CUDA C++.
* **Resource:** NVIDIA Nsight Systems / Nsight Compute - **Type:** Tool - **Relevance:** Mentioned as the industry-standard deep profiling tools to view detailed hardware execution metrics, warp states, and memory bandwidth utilization.