Data-Parallel Patterns (Combinators)
The most tested topic. You MUST know when to use each pattern.
Pattern Reference Table
| Pattern | Type Signature | What it does | Use when... |
|---|---|---|---|
| map | (a -> b) -> [a] -> [b] |
Apply a function to every element independently. Output array has the same shape as input. | Each output depends on exactly one input at the same index |
| zipWith | (a -> b -> c) -> [a] -> [b] -> [c] |
Combine two arrays element-wise with a binary function. Like map but draws from two inputs. | Each output depends on two inputs at the same index |
| fold | (a -> a -> a) -> a -> [a] -> a |
Reduce an entire array to a single value by repeatedly applying a binary operator with an identity. Sequential: a for-loop accumulator. | You need one aggregate result (sum, max, count, any/all) |
| scanl | (a -> b -> a) -> a -> [b] -> [a] |
Like fold, but keeps all intermediate results. Output[i] is the fold of elements 0..i-1. The identity is output[0]. | Each output depends on all previous elements (running total, prefix sum, cumulative max) |
| scanr | (a -> b -> b) -> b -> [a] -> [b] |
Same as scanl but sweeps right-to-left. Output[i] is the fold of elements i+1..n. | Each output depends on all following elements |
| stencil | (Neighborhood a -> b) -> Boundary -> [a] -> [b] |
Each output is a function of a fixed local window around the corresponding input position. The window size/shape is fixed at compile time. | Fixed-size local neighborhood (blur, Game of Life, convolution, finite differences) |
| backpermute | Int -> (Int -> Int) -> [a] -> [a] |
Read from computed indices. Given an output size and an index function f, output[i] = input[f(i)]. Regular output writes, arbitrary input reads. |
Each output reads from one input at a computed index (reorder, slice, reverse, gather) |
| permute | (a -> a -> a) -> [a] -> (Int -> Int) -> [a] -> [a] |
Write to computed indices. Each input[i] is written to output[f(i)], with a combining function for collisions. Regular input reads, potentially scattered output writes. | Each input writes to a computed output index (scatter, stream compaction) |
| scatter | Like permute | Write specific values to specified output indices | Explicit (index, value) pairs → output array |
| gather | Like backpermute | Read from specified input indices | Explicit index array → read into output |
Accelerate Equivalents
In the course practicals, these patterns are used via the Accelerate library. The core ideas are the same — Accelerate wraps values in Exp (scalars) and Acc (Array sh e) (parallel arrays).
| Pattern | Accelerate Signature |
|---|---|
| map | (Exp a -> Exp b) -> Acc (Array sh a) -> Acc (Array sh b) |
| zipWith | (Exp a -> Exp b -> Exp c) -> Acc (Array sh a) -> Acc (Array sh b) -> Acc (Array sh c) |
| fold | (Exp a -> Exp a -> Exp a) -> Exp a -> Acc (Array (sh :. Int) a) -> Acc (Array sh a) |
| scanl | (Exp a -> Exp a -> Exp a) -> Exp a -> Acc (Vector a) -> Acc (Vector a) |
| stencil | (Stencil sh a -> Exp b) -> Boundary (Array sh a) -> Acc (Array sh a) -> Acc (Array sh b) |
| backpermute | Exp sh' -> (Exp sh' -> Exp sh) -> Acc (Array sh a) -> Acc (Array sh' a) |
| permute | (Exp a -> Exp a -> Exp a) -> Acc (Array sh' a) -> (Exp sh -> Exp sh') -> Acc (Array sh a) -> Acc (Array sh' a) |
See the Accelerate Quick Reference for full details on Exp, Acc, shapes, and common idioms.
Decision Flowchart
Does each output depend on...
-> exactly 1 input at SAME index? --> map
-> 2 inputs at SAME index? --> zipWith
-> a FIXED LOCAL neighborhood? --> stencil
-> ALL previous elements? --> scanl
-> ALL following elements? --> scanr
-> ALL elements (single result)? --> fold
-> 1 input at a COMPUTED index? --> backpermute/gather
-> need to WRITE to computed indices? --> permute/scatter
Key Rules
- fold requires an associative operator (for parallel execution)
- scan requires an associative operator
- permute needs an associative and commutative combining function for collisions, because conflicting writes may be merged in arbitrary order
- If both are possible on GPU, prefer backpermute / gather over permute / scatter
- Scatter/permute is more expensive than gather/backpermute because of collisions, atomics/CAS/locks, and extra cache-line traffic
- Nested data parallelism is a course topic, but exam questions usually expect you to reason in terms of flattening and segmented operators, not to write truly nested GPU kernels directly
Gather vs Scatter on GPU
- Gather / backpermute: compute where to read from
- Scatter / permute: compute where to write to
- If an algorithm can be phrased either way, prefer gather
- Common reason: even with no collisions, scatter can still be slower because writes to nearby cache lines from different threads interfere with each other
Nested Data Parallelism and Flattening
- A nested array-of-arrays can be flattened into:
- one flat data array
- a segment descriptor telling where each logical sub-array starts and ends
- Once flattened, nested operations become segmented operations:
- segmented fold
- segmented scan
- segmented zipWith
- This is the standard trick for expressing nested parallel structure using flat data parallelism
Advanced Techniques
Separable Stencils
A 2D stencil where the kernel can be factored into a horizontal pass followed by a vertical pass (or vice versa). Instead of an \(m \times m\) neighbourhood per output (work \(O(m^2)\) per element), you do two 1D passes (work \(O(m)\) per element).
- Example: 2D Gaussian blur = horizontal 1D Gaussian + vertical 1D Gaussian
- Not all kernels are separable — the kernel must be expressible as an outer product of two 1D kernels
Stencil Boundaries and Halo Regions
- A stencil needs a rule for what happens when the neighborhood falls outside the array bounds
- Common boundary choices: clamp, mirror, wrap, or a fixed constant value
- In distributed/partitioned stencil computations, threads/processors often keep ghost cells / halo regions: copies of neighboring boundary data
- In iterative stencil computations, the halo must be updated after each iteration
Integral Image (2D Prefix Sum)
For variable-radius box operations: precompute a 2D prefix sum (integral image) so any rectangular sum is an \(O(1)\) lookup.
- Compute:
scanlrow-wise, thenscanlcolumn-wise - Query:
sum(r1,c1,r2,c2) = I[r2][c2] - I[r1-1][c2] - I[r2][c1-1] + I[r1-1][c1-1] - Use when: you need to apply the same kind of aggregation over many different box sizes/positions
Stream Compaction (Mark–Scan–Scatter)
Extract a subset of elements that satisfy a predicate, preserving order:
- map — mark qualifying elements:
flags[i] = predicate(input[i]) ? 1 : 0 - scanl — prefix sum of flags:
indices = scanl (+) 0 flags— each qualifying element gets a unique output index - permute — scatter qualifying elements to their computed positions
Water Trapping Pattern (2024 Exam)
"How much water is trapped above each bar?" — classic scan + zipWith composition:
- scanl max left-to-right →
leftMax[i]= maximum height seen so far from the left - scanr max right-to-left →
rightMax[i]= maximum height seen so far from the right - zipWith min →
waterLevel[i] = min(leftMax[i], rightMax[i]) - zipWith (-) and map max 0 →
trapped[i] = max(0, waterLevel[i] - height[i])
Worked Example: Parcel Delivery (2020 Exam)
- Compute driving time between consecutive stops: stencil (or zipWith with shifted array)
- Total driving time for route: fold (+)
- Expected delivery time per stop: scanl (+) (running total of driving times)
- Reverse part of route (2-Opt): backpermute or gather
Worked Example: Counting Populated Cells (2024 Retake Exam)
- Simulate one time step of growth (8-neighbor rule): stencil
- Count total populated cells: step 1 map (bool to int), step 2 fold (+)
- Detect convexity (count transitions in a row): step 1 stencil (compare adjacent), step 2 fold (+)
- Extract temperatures of filled cells: step 1 map (mark filled), step 2 scanl (prefix sum for indices), step 3 permute/backpermute (compact)