Parallel Scan

What is a Scan?

A scan (prefix sum) computes running totals: scanl (+) 0 [1,2,3,4] = [0,1,3,6,10]

Each \(\text{output}[i]\) = operator applied to all elements \(0 \ldots i{-}1\).

Exclusive scan (scanl) writes the accumulator before consuming the current element
Example: scanl (+) 0 [1,2,3,4] = [0,1,3,6,10]
Inclusive scan (scanl1) writes the accumulator after consuming the current element
Example: scanl1 (+) [1,2,3,4] = [1,3,6,10]
Exam trap: make sure you know whether the current element is included or excluded

Each thread block independently scans its tile locally
A second pass propagates the prefix sums from earlier blocks
Work: \(O(n \log n)\) — each element is touched in \(O(\log n)\) levels → NOT optimal (efficient only with \(O(\log n)\) overhead)
Simpler to implement but wastes work

Each thread block does a local scan
After finishing, writes its total to a shared buffer
Next thread block waits (spins) on the previous block's result, then adds it
Work: \(O(n)\), single kernel
Communication: blocking/spinning within a single kernel, passing partial results between adjacent blocks
Thread blocks are NOT independent (ordering dependency)

Question	Answer
Required operator property?	Associativity only — commutativity is NOT required
Is reduce-then-scan efficient?	Yes — \(W = O(n)\)
Is reduce-then-scan optimal?	Yes — \(W = \Theta(n) = \Theta(T_{\text{seq}})\)
Is scan-then-propagate optimal?	No — \(W = O(n \log n)\), overhead \(= O(\log n)\)
Can three-phase scan use 1 read + 1 write per element?	No — needs multiple passes
Reconstructing scan input from output?	stencil: \(\text{output}[i] - \text{output}[i{-}1]\), or zipWith

Associativity formally: \(\forall\, x\, y\, z : x \oplus (y \oplus z) = (x \oplus y) \oplus z\)

A scan that resets at segment boundaries
Implemented via operator lifting: pair each value with a boolean flag
The lifted operator must be associative (common exam trap: checking if a proposed operator is associative)
Segment information can be represented in two common ways:
head flags: mark the first element of each segment
segment lengths / offsets: describe segment sizes, then convert if needed
Flattened nested parallelism often becomes a segmented scan over one flat array plus a segment descriptor
Be careful with empty segments: exclusive segmented scans are especially easy to get wrong if the representation allows zero-length segments

For a 2D array, the integral image \(I[i][j] = \sum_{r \leq i,\, c \leq j} A[r][c]\).

Apply scanl row-wise, then scanl column-wise (two passes)
Enables any rectangular sum query in \(O(1)\): \(\text{sum}(r_1,c_1,r_2,c_2) = I[r_2][c_2] - I[r_1{-}1][c_2] - I[r_2][c_1{-}1] + I[r_1{-}1][c_1{-}1]\)
Used for variable-radius operations (e.g., computing local averages over arbitrary box sizes) without re-scanning per query