GPU Architecture

Thread Hierarchy

Grid (= entire kernel launch)
  -> Thread Block (= group of threads that can sync + share memory)
    -> Warp (= 32 threads executing in LOCKSTEP)
      -> Thread (= single execution unit)

Key Concepts

Lockstep Execution: - All 32 threads in a warp execute the SAME instruction at the SAME time - Branch divergence: if threads take different paths in if/else, BOTH paths are executed, threads inactive in the "wrong" branch - This is why GPUs prefer uniform control flow

Memory Hierarchy: - Registers: per-thread, fastest - Shared memory: per-block, fast (~L1 cache speed), limited (96-192 KB), explicitly managed - Global memory: per-device, slow, large, accessible by all threads and host

Memory Coalescing: - GPU loads memory in large cache lines - When threads in a warp access CONTIGUOUS memory -> 1 memory transaction (fast) - When threads access RANDOM locations -> many transactions (slow) - Rule: structured reads >> random reads; if both are possible, prefer gather/backpermute over scatter/permute - Scatter is often worse than gather because irregular writes also create extra synchronisation/cache-line traffic

Occupancy: - Ratio of active warps to max warps per SM - Limited by: registers per thread, shared memory per block, threads per block - Higher occupancy = better latency hiding (more warps to switch to when one stalls)

CPU vs GPU: - CPU: optimize latency (branch prediction, OoO execution, big caches) - GPU: optimize throughput (many simple cores, hide latency via massive parallelism) - CPU thread ~ GPU warp (both execute instructions) - CPU SIMD lane ~ GPU thread (both process one data element) - CPU SMT ~ GPU warp switching (both hide latency by interleaving)

GPU Scan Communication (exam favorite)

Reduce-then-scan: Thread blocks are independent within each kernel. Communication between blocks happens via global memory array of per-block results. Requires multiple kernel launches.
Chained scan: Blocks communicate by blocking/spinning within a single kernel. Block i waits for block i-1's result. Only one kernel launch needed.