Concurrency, Locks, and Synchronization: From Hardware Primitives to Lock-Free Data Structures

📂 General
# Concurrency, Locks, and Synchronization: From Hardware Primitives to Lock-Free Data Structures **Video Category:** Computer Science / Systems Programming ## 📋 0. Video Metadata **Video Title:** Stanford CS149: Parallel Computing - Locks and Synchronization **YouTube Channel:** Stanford Engineering **Publication Date:** Fall 2023 (Visible on slide watermarks) **Video Duration:** ~1 hour 15 minutes ## 📝 1. Core Summary (TL;DR) This lecture breaks down the fundamental challenges of concurrent programming—deadlock, livelock, and starvation—and maps exactly how synchronization primitives are implemented at the hardware-cache level. It reveals that poorly designed locks (like raw `test-and-set`) can cripple system performance by generating massive cache coherence interconnect traffic. By understanding hardware mechanisms like Compare-and-Swap (CAS) and cache invalidation storms, developers can implement more efficient strategies: from Test-and-Test-and-Set and Ticket Locks to fine-grained hand-over-hand locking and ultimately, non-blocking lock-free data structures. ## 2. Core Concepts & Frameworks * **Concept:** Deadlock -> **Meaning:** A state where a system has outstanding operations to complete, but no operation can make progress because resources are mutually held and requested. Requires four conditions: Mutual exclusion, Hold-and-wait, No preemption, and Circular wait. -> **Application:** Identifying system freezes where threads are permanently blocked on each other (e.g., circular message passing). * **Concept:** Livelock -> **Meaning:** A state where a system is executing many operations and states are actively changing, but no thread is making meaningful progress toward completion. -> **Application:** Debugging systems where threads continuously abort and retry transactions without ever succeeding (e.g., two threads constantly backing off and colliding again). * **Concept:** Starvation -> **Meaning:** The system overall is making progress, but specific processes or threads are indefinitely denied necessary resources. -> **Application:** Tuning scheduling algorithms or priority queues to ensure low-priority tasks eventually execute instead of yielding infinitely to high-priority streams. * **Concept:** MSI Cache Coherence -> **Meaning:** A protocol (Modified, Shared, Invalid) where a cache must broadcast its intent to write (`BusRdX`) to invalidate all other cached copies of that memory line before modifying it. -> **Application:** Understanding the hidden performance cost of synchronization; if multiple threads write to the same lock variable, it causes cache lines to bounce continuously between cores. * **Concept:** Lock-Free (Non-Blocking) Algorithms -> **Meaning:** An algorithm where if a thread is suspended (e.g., by the OS), other threads can still make system-wide progress. Usually relies on optimistic execution and Compare-and-Swap (CAS) retries instead of mutual exclusion. -> **Application:** Building highly scalable databases and web servers where OS context-switching a lock-holding thread would otherwise stall the entire system. ## 3. Evidence & Examples (Hyper-Specific Details) * **Deadlock Evidence (4-Way Intersection):** Four cars arrive simultaneously at a 4-way stop. They cannot back up (no preemption/hold-and-wait), and each needs the road in front to be clear (mutual exclusion). Car A waits on B, B waits on C, C waits on D, D waits on A (circular wait). Result: System freezes. * **Deadlock Evidence (Real-World San Francisco Traffic):** Visual photo shows a San Francisco city bus taking a wide turn, blocked by a sedan that ignored the stop line. Cars are piled up behind both vehicles. Neither can relinquish resources (back up), creating an effective deadlock. * **Deadlock Evidence (Nature):** Visual illustration of a bird trying to swallow a frog, while the frog is strangling the bird's neck. Neither trusts the other to relinquish their grip first, resulting in zero progress. * **Deadlock Evidence (Work Queues):** Thread A pulls from an input queue and outputs to Thread B's work queue. Thread B pulls from its queue and outputs to Thread A's work queue. If both output queues become full simultaneously, neither thread can pull from its input queue. Result: Deadlock. * **Livelock Evidence (Hallway Dance):** Two people walk toward each other. Both step to their right to yield. Both step to their left to yield. They continuously change state (operations are occurring) but fail to pass each other (no progress). * **Starvation Evidence (Half Moon Bay Traffic):** Yellow cars at a yield sign trying to turn onto Highway 1. A continuous, unbroken stream of green cars passes by. The system is completing transactions (green cars progress), but yellow cars wait indefinitely. * **Cache Coherence Trace (P0 and P1 memory ops):** * `P0: LD X`: P0 issues `BusRd`, loads line X into Shared (S) state. P1 does nothing. * `P0: ST X <- 1`: P0 issues `BusRdX`, invalidates other copies, moves to Modified (M). * `P1: ST X <- 3`: P1 issues `BusRdX`. P0 flushes X to memory and moves to Invalid (I). P1 gets X in M state. * **Basic Test-and-Set Lock (Interconnect Traffic Storm):** `test_and_set(addr)` atomically reads memory, and if 0, sets to 1. If 3 processors compete, P1 gets the lock. While P1 works, P2 and P3 continuously issue `test_and_set` operations. Because `test_and_set` requires write access, it issues `BusRdX` broadcasts. The cache line violently bounces between P2 and P3, generating massive bus traffic. When P1 finally unlocks (`store 0`), it suffers a cache miss because the line is no longer in its cache. * **Test-and-Test-and-Set Lock:** Code changed to `while (lock == 1) {}` followed by `if (test_and_set(&lock) == 0)`. During the critical section, P2 and P3 hold the cache line in Shared (S) state and spin on local cache hits. No bus traffic is generated. When P1 unlocks, it invalidates P2/P3. P2 and P3 take a cache miss, read 0, and then trigger a one-time "feeding frenzy" of `BusRdX` requests. Result: O(P) writes only on lock release, zero interconnect traffic during the critical section. * **Ticket Lock (Deli Counter Model):** Implemented with two integers: `next_ticket` and `now_serving`. Thread gets lock via `my_ticket = atomic_increment(&next_ticket)`. It spins reading `while (my_ticket != now_serving)`. To unlock: `now_serving++`. Result: Strictly fair (FIFO) and reduces lock acquisition to a single read spin, requiring only 1 write per lock release. * **Compare-and-Swap (CAS) Atomic Min:** The instructor demonstrates building an `atomic_min` operation using `atomicCAS(addr, compare, val)`. * Thread reads `old = *addr`. * Calculates `new = min(old, X)`. * Calls `atomicCAS(addr, old, new)`. If the value in memory is still `old`, it writes `new`. If another thread (e.g., writing 80) modified the value while the first thread was calculating, CAS fails. The first thread loops, reloads the new value (80), recalculates, and tries again. * **Linked List Double Insertion Data Corruption:** Thread 1 attempts to insert 6 after 5. Thread 2 attempts to insert 7 after 5. If Thread 1 points 5 to 6, but then Thread 2 points 5 to 7, node 6 is completely lost from the list. * **Linked List Hand-Over-Hand Locking:** To delete node 11, the thread must lock node 10, then lock node 11, modify pointers, and then release both. Unlocking 10 before locking 11 is like "letting go of the monkey bars"—another thread could insert or delete node 11 in that split second, breaking the traversal. ## 4. Actionable Takeaways (Implementation Rules) * **Rule 1: Eliminate circular dependencies to prevent deadlock.** - Map resource acquisition order. If Thread A always requests Resource 1 then Resource 2, ensure Thread B never requests Resource 2 then Resource 1. Force a strict global ordering for acquiring locks. * **Rule 2: Never use basic `test-and-set` for spinlocks under high contention.** - Raw `test-and-set` loops generate a continuous stream of cache invalidations (`BusRdX`). Always use "Test-and-Test-and-Set" (`while (lock == 1) {}`) so waiting threads spin locally on Shared cache lines without polluting the interconnect bus. * **Rule 3: Use Ticket Locks to guarantee fairness.** - If starvation of specific threads is unacceptable, replace standard spinlocks with a two-variable ticket system (`next_ticket` and `now_serving`) to enforce strict FIFO execution order. * **Rule 4: Build custom atomic operations using Compare-and-Swap (CAS) loops.** - To build thread-safe operations without locks (like `atomic_min` or `atomic_max`), read the current memory state, compute the desired state, and use CAS to attempt the write. If CAS fails (state changed), catch the failure, reload, and retry. * **Rule 5: Use hand-over-hand locking for fine-grained traversal.** - When traversing a linked data structure, never release the lock on node `N` until you have successfully acquired the lock on node `N+1`. You must always hold the locks covering your "blast radius" to prevent other threads from modifying the links beneath you. * **Rule 6: Profile before applying fine-grained locking.** - Locking every node in a linked list introduces immense overhead (cache misses and memory writes for every step). If traversal is long, a coarse-grained lock (or locking blocks of 10 nodes at a time) may actually outperform node-by-node fine-grained locking. * **Rule 7: Account for OS preemption with Lock-Free designs.** - In heavily threaded environments (e.g., massive web servers), rely on lock-free data structures. If a thread holding a standard lock is context-switched out by the OS, it blocks all other threads indefinitely. Lock-free designs (via CAS) ensure system-wide progress continues regardless of individual thread suspensions. ## 5. Pitfalls & Limitations (Anti-Patterns) * **Pitfall:** Coarse-grained locking on large data structures. -> **Why it fails:** Placing a single lock around an entire linked list or tree serializes all access. -> **Warning sign:** Multi-threaded application performance drops to single-threaded speeds because threads spend all their time blocked waiting for the global lock. * **Pitfall:** Dropping all locks mid-traversal (Breaking hand-over-hand). -> **Why it fails:** If a thread unlocks Node A before locking Node B, it loses all guarantees about the state of the data structure. Another thread can instantly delete Node B or insert between A and B. -> **Warning sign:** Segfaults, null pointer dereferences, or lost data during concurrent insertions/deletions. * **Pitfall:** Using C++ `volatile` instead of `std::atomic` for synchronization. -> **Why it fails:** On modern processors with relaxed memory consistency, operations can be reordered by the hardware or compiler. If memory fences are missing, a thread might see a lock released *before* the critical section writes are visible. -> **Warning sign:** Bizarre, impossible data corruption in a critical section despite the lock seemingly functioning. (Fix: Use C++11 `<atomic>` which inserts proper hardware memory fences). * **Pitfall:** Assuming LL/SC (Load-Linked / Store-Conditional) works exactly like CAS. -> **Why it fails:** LL/SC checks if *any* write happened to the address, whereas CAS checks if the *value* matches. CAS can suffer from the ABA problem (value changed from A to B back to A, tricking CAS into thinking no change occurred). -> **Warning sign:** Data corruption in lock-free stacks where memory is reclaimed and reallocated rapidly. ## 6. Key Quote / Core Insight "Having the lock and having the cache line are two entirely different things. You can be inside a critical section holding a lock, but if waiting threads are aggressively spinning on `test-and-set`, they are violently ripping the cache line away from your processor over the interconnect. You are getting DOSed by your own waiting threads." ## 7. Additional Resources & References * **Resource:** CUDA Standard Library - **Type:** Software Library - **Relevance:** Referenced for its built-in hardware atomic primitives (`atomicAdd`, `atomicMin`, `atomicCAS`) used in highly concurrent GPU programming. * **Resource:** C++11 `<atomic>` - **Type:** Programming Language Feature - **Relevance:** Essential for implementing thread-safe flags and locks; ensures compiler and hardware insert the appropriate memory fences to prevent relaxed-consistency reordering. * **Resource:** ARM Architecture (Load-Linked/Store-Conditional) - **Type:** Hardware Architecture - **Relevance:** Mentioned as the alternative hardware approach to Compare-and-Swap (using `LDREX` and `STREX` instructions) for implementing atomic primitives without relying solely on cache-coherence bus locking.