Skip to content

Data-Parallel Patterns (Combinators)

The most tested topic. You MUST know when to use each pattern.

Pattern Reference Table

Pattern Type Signature What it does Use when...
map (a -> b) -> [a] -> [b] Apply a function to every element independently. Output array has the same shape as input. Each output depends on exactly one input at the same index
zipWith (a -> b -> c) -> [a] -> [b] -> [c] Combine two arrays element-wise with a binary function. Like map but draws from two inputs. Each output depends on two inputs at the same index
fold (a -> a -> a) -> a -> [a] -> a Reduce an entire array to a single value by repeatedly applying a binary operator with an identity. Sequential: a for-loop accumulator. You need one aggregate result (sum, max, count, any/all)
scanl (a -> b -> a) -> a -> [b] -> [a] Like fold, but keeps all intermediate results. Output[i] is the fold of elements 0..i-1. The identity is output[0]. Each output depends on all previous elements (running total, prefix sum, cumulative max)
scanr (a -> b -> b) -> b -> [a] -> [b] Same as scanl but sweeps right-to-left. Output[i] is the fold of elements i+1..n. Each output depends on all following elements
stencil (Neighborhood a -> b) -> Boundary -> [a] -> [b] Each output is a function of a fixed local window around the corresponding input position. The window size/shape is fixed at compile time. Fixed-size local neighborhood (blur, Game of Life, convolution, finite differences)
backpermute Int -> (Int -> Int) -> [a] -> [a] Read from computed indices. Given an output size and an index function f, output[i] = input[f(i)]. Regular output writes, arbitrary input reads. Each output reads from one input at a computed index (reorder, slice, reverse, gather)
permute (a -> a -> a) -> [a] -> (Int -> Int) -> [a] -> [a] Write to computed indices. Each input[i] is written to output[f(i)], with a combining function for collisions. Regular input reads, potentially scattered output writes. Each input writes to a computed output index (scatter, stream compaction)
scatter Like permute Write specific values to specified output indices Explicit (index, value) pairs → output array
gather Like backpermute Read from specified input indices Explicit index array → read into output

Accelerate Equivalents

In the course practicals, these patterns are used via the Accelerate library. The core ideas are the same — Accelerate wraps values in Exp (scalars) and Acc (Array sh e) (parallel arrays).

Pattern Accelerate Signature
map (Exp a -> Exp b) -> Acc (Array sh a) -> Acc (Array sh b)
zipWith (Exp a -> Exp b -> Exp c) -> Acc (Array sh a) -> Acc (Array sh b) -> Acc (Array sh c)
fold (Exp a -> Exp a -> Exp a) -> Exp a -> Acc (Array (sh :. Int) a) -> Acc (Array sh a)
scanl (Exp a -> Exp a -> Exp a) -> Exp a -> Acc (Vector a) -> Acc (Vector a)
stencil (Stencil sh a -> Exp b) -> Boundary (Array sh a) -> Acc (Array sh a) -> Acc (Array sh b)
backpermute Exp sh' -> (Exp sh' -> Exp sh) -> Acc (Array sh a) -> Acc (Array sh' a)
permute (Exp a -> Exp a -> Exp a) -> Acc (Array sh' a) -> (Exp sh -> Exp sh') -> Acc (Array sh a) -> Acc (Array sh' a)

See the Accelerate Quick Reference for full details on Exp, Acc, shapes, and common idioms.

Decision Flowchart

Does each output depend on...
  -> exactly 1 input at SAME index?           --> map
  -> 2 inputs at SAME index?                  --> zipWith
  -> a FIXED LOCAL neighborhood?              --> stencil
  -> ALL previous elements?                   --> scanl
  -> ALL following elements?                  --> scanr
  -> ALL elements (single result)?            --> fold
  -> 1 input at a COMPUTED index?             --> backpermute/gather
  -> need to WRITE to computed indices?       --> permute/scatter

Key Rules

  • fold requires an associative operator (for parallel execution)
  • scan requires an associative operator
  • permute needs an associative and commutative combining function for collisions, because conflicting writes may be merged in arbitrary order
  • If both are possible on GPU, prefer backpermute / gather over permute / scatter
  • Scatter/permute is more expensive than gather/backpermute because of collisions, atomics/CAS/locks, and extra cache-line traffic
  • Nested data parallelism is a course topic, but exam questions usually expect you to reason in terms of flattening and segmented operators, not to write truly nested GPU kernels directly

Gather vs Scatter on GPU

  • Gather / backpermute: compute where to read from
  • Scatter / permute: compute where to write to
  • If an algorithm can be phrased either way, prefer gather
  • Common reason: even with no collisions, scatter can still be slower because writes to nearby cache lines from different threads interfere with each other

Nested Data Parallelism and Flattening

  • A nested array-of-arrays can be flattened into:
  • one flat data array
  • a segment descriptor telling where each logical sub-array starts and ends
  • Once flattened, nested operations become segmented operations:
  • segmented fold
  • segmented scan
  • segmented zipWith
  • This is the standard trick for expressing nested parallel structure using flat data parallelism

Advanced Techniques

Separable Stencils

A 2D stencil where the kernel can be factored into a horizontal pass followed by a vertical pass (or vice versa). Instead of an \(m \times m\) neighbourhood per output (work \(O(m^2)\) per element), you do two 1D passes (work \(O(m)\) per element).

  • Example: 2D Gaussian blur = horizontal 1D Gaussian + vertical 1D Gaussian
  • Not all kernels are separable — the kernel must be expressible as an outer product of two 1D kernels

Stencil Boundaries and Halo Regions

  • A stencil needs a rule for what happens when the neighborhood falls outside the array bounds
  • Common boundary choices: clamp, mirror, wrap, or a fixed constant value
  • In distributed/partitioned stencil computations, threads/processors often keep ghost cells / halo regions: copies of neighboring boundary data
  • In iterative stencil computations, the halo must be updated after each iteration

Integral Image (2D Prefix Sum)

For variable-radius box operations: precompute a 2D prefix sum (integral image) so any rectangular sum is an \(O(1)\) lookup.

  • Compute: scanl row-wise, then scanl column-wise
  • Query: sum(r1,c1,r2,c2) = I[r2][c2] - I[r1-1][c2] - I[r2][c1-1] + I[r1-1][c1-1]
  • Use when: you need to apply the same kind of aggregation over many different box sizes/positions

Stream Compaction (Mark–Scan–Scatter)

Extract a subset of elements that satisfy a predicate, preserving order:

  1. map — mark qualifying elements: flags[i] = predicate(input[i]) ? 1 : 0
  2. scanl — prefix sum of flags: indices = scanl (+) 0 flags — each qualifying element gets a unique output index
  3. permute — scatter qualifying elements to their computed positions

Water Trapping Pattern (2024 Exam)

"How much water is trapped above each bar?" — classic scan + zipWith composition:

  1. scanl max left-to-right → leftMax[i] = maximum height seen so far from the left
  2. scanr max right-to-left → rightMax[i] = maximum height seen so far from the right
  3. zipWith minwaterLevel[i] = min(leftMax[i], rightMax[i])
  4. zipWith (-) and map max 0trapped[i] = max(0, waterLevel[i] - height[i])

Worked Example: Parcel Delivery (2020 Exam)

  • Compute driving time between consecutive stops: stencil (or zipWith with shifted array)
  • Total driving time for route: fold (+)
  • Expected delivery time per stop: scanl (+) (running total of driving times)
  • Reverse part of route (2-Opt): backpermute or gather

Worked Example: Counting Populated Cells (2024 Retake Exam)

  • Simulate one time step of growth (8-neighbor rule): stencil
  • Count total populated cells: step 1 map (bool to int), step 2 fold (+)
  • Detect convexity (count transitions in a row): step 1 stencil (compare adjacent), step 2 fold (+)
  • Extract temperatures of filled cells: step 1 map (mark filled), step 2 scanl (prefix sum for indices), step 3 permute/backpermute (compact)