Data-Parallel Patterns (Combinators)

The most tested topic. You MUST know when to use each pattern.

Pattern Reference Table

Pattern	Type Signature	What it does	Use when...
map	`(a -> b) -> [a] -> [b]`	Apply a function to every element independently. Output array has the same shape as input.	Each output depends on exactly one input at the same index
zipWith	`(a -> b -> c) -> [a] -> [b] -> [c]`	Combine two arrays element-wise with a binary function. Like map but draws from two inputs.	Each output depends on two inputs at the same index
fold	`(a -> a -> a) -> a -> [a] -> a`	Reduce an entire array to a single value by repeatedly applying a binary operator with an identity. Sequential: a for-loop accumulator.	You need one aggregate result (sum, max, count, any/all)
scanl	`(a -> b -> a) -> a -> [b] -> [a]`	Like fold, but keeps all intermediate results. Output[i] is the fold of elements 0..i-1. The identity is output[0].	Each output depends on all previous elements (running total, prefix sum, cumulative max)
scanr	`(a -> b -> b) -> b -> [a] -> [b]`	Same as scanl but sweeps right-to-left. Output[i] is the fold of elements i+1..n.	Each output depends on all following elements
stencil	`(Neighborhood a -> b) -> Boundary -> [a] -> [b]`	Each output is a function of a fixed local window around the corresponding input position. The window size/shape is fixed at compile time.	Fixed-size local neighborhood (blur, Game of Life, convolution, finite differences)
backpermute	`Int -> (Int -> Int) -> [a] -> [a]`	Read from computed indices. Given an output size and an index function `f`, output[i] = input[f(i)]. Regular output writes, arbitrary input reads.	Each output reads from one input at a computed index (reorder, slice, reverse, gather)
permute	`(a -> a -> a) -> [a] -> (Int -> Int) -> [a] -> [a]`	Write to computed indices. Each input[i] is written to output[f(i)], with a combining function for collisions. Regular input reads, potentially scattered output writes.	Each input writes to a computed output index (scatter, stream compaction)
scatter	Like permute	Write specific values to specified output indices	Explicit (index, value) pairs → output array
gather	Like backpermute	Read from specified input indices	Explicit index array → read into output

Accelerate Equivalents

In the course practicals, these patterns are used via the Accelerate library. The core ideas are the same — Accelerate wraps values in Exp (scalars) and Acc (Array sh e) (parallel arrays).

Pattern	Accelerate Signature
map	`(Exp a -> Exp b) -> Acc (Array sh a) -> Acc (Array sh b)`
zipWith	`(Exp a -> Exp b -> Exp c) -> Acc (Array sh a) -> Acc (Array sh b) -> Acc (Array sh c)`
fold	`(Exp a -> Exp a -> Exp a) -> Exp a -> Acc (Array (sh :. Int) a) -> Acc (Array sh a)`
scanl	`(Exp a -> Exp a -> Exp a) -> Exp a -> Acc (Vector a) -> Acc (Vector a)`
stencil	`(Stencil sh a -> Exp b) -> Boundary (Array sh a) -> Acc (Array sh a) -> Acc (Array sh b)`
backpermute	`Exp sh' -> (Exp sh' -> Exp sh) -> Acc (Array sh a) -> Acc (Array sh' a)`
permute	`(Exp a -> Exp a -> Exp a) -> Acc (Array sh' a) -> (Exp sh -> Exp sh') -> Acc (Array sh a) -> Acc (Array sh' a)`

See the Accelerate Quick Reference for full details on Exp, Acc, shapes, and common idioms.

Decision Flowchart

Does each output depend on...
  -> exactly 1 input at SAME index?           --> map
  -> 2 inputs at SAME index?                  --> zipWith
  -> a FIXED LOCAL neighborhood?              --> stencil
  -> ALL previous elements?                   --> scanl
  -> ALL following elements?                  --> scanr
  -> ALL elements (single result)?            --> fold
  -> 1 input at a COMPUTED index?             --> backpermute/gather
  -> need to WRITE to computed indices?       --> permute/scatter

Key Rules

fold requires an associative operator (for parallel execution)
scan requires an associative operator
permute needs an associative and commutative combining function for collisions, because conflicting writes may be merged in arbitrary order
If both are possible on GPU, prefer backpermute / gather over permute / scatter
Scatter/permute is more expensive than gather/backpermute because of collisions, atomics/CAS/locks, and extra cache-line traffic
Nested data parallelism is a course topic, but exam questions usually expect you to reason in terms of flattening and segmented operators, not to write truly nested GPU kernels directly

Gather vs Scatter on GPU

Gather / backpermute: compute where to read from
Scatter / permute: compute where to write to
If an algorithm can be phrased either way, prefer gather
Common reason: even with no collisions, scatter can still be slower because writes to nearby cache lines from different threads interfere with each other

Nested Data Parallelism and Flattening

A nested array-of-arrays can be flattened into:
one flat data array
a segment descriptor telling where each logical sub-array starts and ends
Once flattened, nested operations become segmented operations:
segmented fold
segmented scan
segmented zipWith
This is the standard trick for expressing nested parallel structure using flat data parallelism

Advanced Techniques

Separable Stencils

A 2D stencil where the kernel can be factored into a horizontal pass followed by a vertical pass (or vice versa). Instead of an \(m \times m\) neighbourhood per output (work \(O(m^2)\) per element), you do two 1D passes (work \(O(m)\) per element).

Example: 2D Gaussian blur = horizontal 1D Gaussian + vertical 1D Gaussian
Not all kernels are separable — the kernel must be expressible as an outer product of two 1D kernels

Stencil Boundaries and Halo Regions

A stencil needs a rule for what happens when the neighborhood falls outside the array bounds
Common boundary choices: clamp, mirror, wrap, or a fixed constant value
In distributed/partitioned stencil computations, threads/processors often keep ghost cells / halo regions: copies of neighboring boundary data
In iterative stencil computations, the halo must be updated after each iteration

Integral Image (2D Prefix Sum)

For variable-radius box operations: precompute a 2D prefix sum (integral image) so any rectangular sum is an \(O(1)\) lookup.

Compute: scanl row-wise, then scanl column-wise
Query: sum(r1,c1,r2,c2) = I[r2][c2] - I[r1-1][c2] - I[r2][c1-1] + I[r1-1][c1-1]
Use when: you need to apply the same kind of aggregation over many different box sizes/positions

Stream Compaction (Mark–Scan–Scatter)

Extract a subset of elements that satisfy a predicate, preserving order:

map — mark qualifying elements: flags[i] = predicate(input[i]) ? 1 : 0
scanl — prefix sum of flags: indices = scanl (+) 0 flags — each qualifying element gets a unique output index
permute — scatter qualifying elements to their computed positions

Water Trapping Pattern (2024 Exam)

"How much water is trapped above each bar?" — classic scan + zipWith composition:

scanl max left-to-right → leftMax[i] = maximum height seen so far from the left
scanr max right-to-left → rightMax[i] = maximum height seen so far from the right
zipWith min → waterLevel[i] = min(leftMax[i], rightMax[i])
zipWith (-) and map max 0 → trapped[i] = max(0, waterLevel[i] - height[i])

Worked Example: Parcel Delivery (2020 Exam)

Compute driving time between consecutive stops: stencil (or zipWith with shifted array)
Total driving time for route: fold (+)
Expected delivery time per stop: scanl (+) (running total of driving times)
Reverse part of route (2-Opt): backpermute or gather

Worked Example: Counting Populated Cells (2024 Retake Exam)

Simulate one time step of growth (8-neighbor rule): stencil
Count total populated cells: step 1 map (bool to int), step 2 fold (+)
Detect convexity (count transitions in a row): step 1 stencil (compare adjacent), step 2 fold (+)
Extract temperatures of filled cells: step 1 map (mark filled), step 2 scanl (prefix sum for indices), step 3 permute/backpermute (compact)