Skip to content

Parallel Scan

What is a Scan?

A scan (prefix sum) computes running totals: scanl (+) 0 [1,2,3,4] = [0,1,3,6,10]

Each \(\text{output}[i]\) = operator applied to all elements \(0 \ldots i{-}1\).

Inclusive vs Exclusive Scan

  • Exclusive scan (scanl) writes the accumulator before consuming the current element
  • Example: scanl (+) 0 [1,2,3,4] = [0,1,3,6,10]
  • Inclusive scan (scanl1) writes the accumulator after consuming the current element
  • Example: scanl1 (+) [1,2,3,4] = [1,3,6,10]
  • Exam trap: make sure you know whether the current element is included or excluded

Three-Phase Parallel Scan

  1. Reduce phase (up-sweep): compute partial sums bottom-up in a tree
  2. Top-level: identity element at root
  3. Down-sweep phase: propagate prefix sums back down

  4. Work: \(O(n)\), Span: \(O(\log n)\)

  5. Efficient: Yes (work matches sequential \(O(n)\))
  6. Optimal: Yes (\(W = \Theta(n) = \Theta(T_{\text{seq}})\))

GPU Scan Approaches

Reduce-Then-Scan (three-phase tiled)

  1. Each thread block computes a local reduce of its tile → per-block totals
  2. Scan the small array of per-block totals (one block or recursive)
  3. Each thread block does a local scan, adding its scanned prefix

  4. Work: \(O(n)\), Span: \(O(\log n)\)efficient and optimal

  5. Communication: via global memory (per-block totals array)
  6. Requires multiple kernel launches between phases (global synchronisation)
  7. Thread blocks are independent within each kernel

Scan-Then-Propagate (naive tiled)

  1. Each thread block independently scans its tile locally
  2. A second pass propagates the prefix sums from earlier blocks

  3. Work: \(O(n \log n)\) — each element is touched in \(O(\log n)\) levels → NOT optimal (efficient only with \(O(\log n)\) overhead)

  4. Simpler to implement but wastes work

Chained Scan

  1. Each thread block does a local scan
  2. After finishing, writes its total to a shared buffer
  3. Next thread block waits (spins) on the previous block's result, then adds it

  4. Work: \(O(n)\), single kernel

  5. Communication: blocking/spinning within a single kernel, passing partial results between adjacent blocks
  6. Thread blocks are NOT independent (ordering dependency)

Scan Properties (frequently tested MC)

Question Answer
Required operator property? Associativity only — commutativity is NOT required
Is reduce-then-scan efficient? Yes\(W = O(n)\)
Is reduce-then-scan optimal? Yes\(W = \Theta(n) = \Theta(T_{\text{seq}})\)
Is scan-then-propagate optimal? No\(W = O(n \log n)\), overhead \(= O(\log n)\)
Can three-phase scan use 1 read + 1 write per element? No — needs multiple passes
Reconstructing scan input from output? stencil: \(\text{output}[i] - \text{output}[i{-}1]\), or zipWith

Associativity formally: \(\forall\, x\, y\, z : x \oplus (y \oplus z) = (x \oplus y) \oplus z\)

Segmented Scan

  • A scan that resets at segment boundaries
  • Implemented via operator lifting: pair each value with a boolean flag
  • The lifted operator must be associative (common exam trap: checking if a proposed operator is associative)
  • Segment information can be represented in two common ways:
  • head flags: mark the first element of each segment
  • segment lengths / offsets: describe segment sizes, then convert if needed
  • Flattened nested parallelism often becomes a segmented scan over one flat array plus a segment descriptor
  • Be careful with empty segments: exclusive segmented scans are especially easy to get wrong if the representation allows zero-length segments

Integral Image (2D Prefix Sum)

For a 2D array, the integral image \(I[i][j] = \sum_{r \leq i,\, c \leq j} A[r][c]\).

  • Apply scanl row-wise, then scanl column-wise (two passes)
  • Enables any rectangular sum query in \(O(1)\): \(\text{sum}(r_1,c_1,r_2,c_2) = I[r_2][c_2] - I[r_1{-}1][c_2] - I[r_2][c_1{-}1] + I[r_1{-}1][c_1{-}1]\)
  • Used for variable-radius operations (e.g., computing local averages over arbitrary box sizes) without re-scanning per query