Parallel Scan
What is a Scan?
A scan (prefix sum) computes running totals: scanl (+) 0 [1,2,3,4] = [0,1,3,6,10]
Each \(\text{output}[i]\) = operator applied to all elements \(0 \ldots i{-}1\).
Inclusive vs Exclusive Scan
- Exclusive scan (
scanl) writes the accumulator before consuming the current element - Example:
scanl (+) 0 [1,2,3,4] = [0,1,3,6,10] - Inclusive scan (
scanl1) writes the accumulator after consuming the current element - Example:
scanl1 (+) [1,2,3,4] = [1,3,6,10] - Exam trap: make sure you know whether the current element is included or excluded
Three-Phase Parallel Scan
- Reduce phase (up-sweep): compute partial sums bottom-up in a tree
- Top-level: identity element at root
-
Down-sweep phase: propagate prefix sums back down
-
Work: \(O(n)\), Span: \(O(\log n)\)
- Efficient: Yes (work matches sequential \(O(n)\))
- Optimal: Yes (\(W = \Theta(n) = \Theta(T_{\text{seq}})\))
GPU Scan Approaches
Reduce-Then-Scan (three-phase tiled)
- Each thread block computes a local reduce of its tile → per-block totals
- Scan the small array of per-block totals (one block or recursive)
-
Each thread block does a local scan, adding its scanned prefix
-
Work: \(O(n)\), Span: \(O(\log n)\) → efficient and optimal
- Communication: via global memory (per-block totals array)
- Requires multiple kernel launches between phases (global synchronisation)
- Thread blocks are independent within each kernel
Scan-Then-Propagate (naive tiled)
- Each thread block independently scans its tile locally
-
A second pass propagates the prefix sums from earlier blocks
-
Work: \(O(n \log n)\) — each element is touched in \(O(\log n)\) levels → NOT optimal (efficient only with \(O(\log n)\) overhead)
- Simpler to implement but wastes work
Chained Scan
- Each thread block does a local scan
- After finishing, writes its total to a shared buffer
-
Next thread block waits (spins) on the previous block's result, then adds it
-
Work: \(O(n)\), single kernel
- Communication: blocking/spinning within a single kernel, passing partial results between adjacent blocks
- Thread blocks are NOT independent (ordering dependency)
Scan Properties (frequently tested MC)
| Question | Answer |
|---|---|
| Required operator property? | Associativity only — commutativity is NOT required |
| Is reduce-then-scan efficient? | Yes — \(W = O(n)\) |
| Is reduce-then-scan optimal? | Yes — \(W = \Theta(n) = \Theta(T_{\text{seq}})\) |
| Is scan-then-propagate optimal? | No — \(W = O(n \log n)\), overhead \(= O(\log n)\) |
| Can three-phase scan use 1 read + 1 write per element? | No — needs multiple passes |
| Reconstructing scan input from output? | stencil: \(\text{output}[i] - \text{output}[i{-}1]\), or zipWith |
Associativity formally: \(\forall\, x\, y\, z : x \oplus (y \oplus z) = (x \oplus y) \oplus z\)
Segmented Scan
- A scan that resets at segment boundaries
- Implemented via operator lifting: pair each value with a boolean flag
- The lifted operator must be associative (common exam trap: checking if a proposed operator is associative)
- Segment information can be represented in two common ways:
- head flags: mark the first element of each segment
- segment lengths / offsets: describe segment sizes, then convert if needed
- Flattened nested parallelism often becomes a segmented scan over one flat array plus a segment descriptor
- Be careful with empty segments: exclusive segmented scans are especially easy to get wrong if the representation allows zero-length segments
Integral Image (2D Prefix Sum)
For a 2D array, the integral image \(I[i][j] = \sum_{r \leq i,\, c \leq j} A[r][c]\).
- Apply scanl row-wise, then scanl column-wise (two passes)
- Enables any rectangular sum query in \(O(1)\): \(\text{sum}(r_1,c_1,r_2,c_2) = I[r_2][c_2] - I[r_1{-}1][c_2] - I[r_2][c_1{-}1] + I[r_1{-}1][c_1{-}1]\)
- Used for variable-radius operations (e.g., computing local averages over arbitrary box sizes) without re-scanning per query