Flashcards — Concurrency Quick-Fire Q&A

Try to recall the answer before clicking to reveal. Grouped by topic.

GPU Architecture

What set of threads runs in lockstep on a GPU?

All threads in a warp (32 threads).

What happens when threads in a warp take different branches in an if-else?

Both branches execute; threads not on the active branch are marked inactive. This is called branch divergence.

What is memory coalescing?

When threads in a warp access contiguous memory locations, the GPU can serve them in a single memory transaction. Scattered access requires multiple transactions.

Which is faster on a GPU: structured reads + random writes, or random reads + structured writes?

Structured reads + random writes, because reads use cache lines and writes can be performed asynchronously.

What determines GPU occupancy?

Registers per thread, shared memory per block, and threads per block. These limit how many warps can be active on an SM.

What is the GPU thread hierarchy from largest to smallest?

Grid > Thread Blocks > Warps (32 threads) > Threads.

What is the main difference between CPU and GPU design philosophy?

CPU avoids latency (branch prediction, OoO, big caches). GPU hides latency via massive parallelism (many lightweight threads).

CPU thread corresponds to what on a GPU?

A warp. CPU SIMD lane ~ GPU thread. CPU SMT ~ GPU warp switching.

Data-Parallel Patterns

When do you use a stencil?

When each output depends on a fixed-size local neighborhood of the input (e.g., blur, Game of Life).

When do you use scanl?

When each output depends on all previous elements (running total, prefix sum, cumulative operation).

When do you use fold?

When you need to reduce an entire array to a single value (sum, max, count).

When do you use backpermute?

When each output reads from one input at a computed index (structured reads from arbitrary positions).

When do you use permute?

When each input writes to a computed output index, with an associative and commutative combining function for collisions.

What is the difference between map and zipWith?

map applies a function to one array. zipWith combines two arrays element-wise.

If both gather/backpermute and scatter/permute are possible on a GPU, which should you usually prefer?

Gather / backpermute. Scatter is usually more expensive because of irregular writes, collisions, atomics/CAS/locks, and extra cache-line traffic.

Can you do a segmented map using just a regular map?

Yes — map is element-wise and doesn't cross segment boundaries anyway.

Can you do a segmented fold using just a regular fold?

No — fold reduces the entire array; you'd need to split by segments first.

How is nested data parallelism usually flattened?

Turn it into one flat array plus a segment descriptor, then use segmented operators like segmented scan/fold.

Parallel Scan

What property must a scan operator have?

Associativity: \((a \oplus b) \oplus c = a \oplus (b \oplus c)\). Commutativity is NOT required.

What is the difference between scanl and scanl1?

scanl is exclusive (current element not yet included in the written output). scanl1 is inclusive (current element is included).

Is reduce-then-scan (three-phase tiled) efficient?

Yes — \(W = O(n)\), matching sequential.

Is reduce-then-scan (three-phase tiled) optimal?

Yes — \(W = \Theta(n) = \Theta(T_{\text{seq}})\).

Is scan-then-propagate (naive tiled) optimal?

No — \(W = O(n \log n)\), overhead \(= O(\log n)\). It can still be efficient, but it is not optimal.

How does reduce-then-scan communicate between thread blocks?

Via global memory: each block writes its local result to an array, which is then scanned.

How does chained scan communicate between thread blocks?

Blocks wait/spin on the previous block's result within a single kernel.

How do you reconstruct scan input from scan output?

Use a stencil (or zipWith): \(\text{output}[i] - \text{output}[i{-}1]\) gives the original input.

What data structures commonly describe segments for segmented scan?

Head flags or segment lengths/offsets.

Concurrency Primitives

Is IORef read-modify-write atomic?

No. Another thread can interleave between read and write. Use atomicModifyIORef or CAS for atomicity.

Is MVar read-modify-write atomic?

Yes. takeMVar + putMVar ensures exclusive access (MVar is empty while held).

Can MVars deadlock?

Yes — when multiple MVars are used and threads acquire them in different orders. Single MVar: deadlock-free.

Is a single MVar starvation-free?

Yes — GHC uses FIFO wakeup for blocked threads on a single MVar. Multiple MVars: no such guarantee.

Is STM (TVar) deadlock-free?

Yes. STM uses optimistic concurrency, no locks, so circular wait is impossible.

Is STM starvation-free?

No. A long transaction can be repeatedly aborted by shorter conflicting transactions and re-execute indefinitely — starvation is possible.

What does retry do in STM?

Aborts the transaction and blocks until one of the read TVars changes, then retries.

Does STM guarantee fairness or progress?

No. When a read-set TVar changes, blocked transactions are woken and retried, but a long transaction can still starve.

What does orElse do in STM?

orElse a b: try transaction a; if it calls retry, try b instead.

What is the CAS loop pattern?

Read old value, compute new value, attempt atomic_compare_exchange. If it fails (value changed), loop and try again.

Work-Span Model

What is Work?

Total number of operations, as if run on 1 processor (sequential execution).

What is Span?

Length of the longest dependency chain (critical path with infinite processors).

How many processors are maximally useful?

\(\dfrac{W(n)}{S(n)}\) — the parallelism of the algorithm.

What does 'efficient' mean for a parallel algorithm?

Polylogarithmic span and polylogarithmic overhead.

What does 'optimal' mean?

Polylogarithmic span and constant overhead.

State Brent's theorem.

\(T_p \leq \dfrac{W}{p} + S\). Time with \(p\) processors is at most work/processors plus span.

Two parallel recursive calls on n/2, plus O(n) sequential work. What's the span?

\(S(n) = S\!\left(\frac{n}{2}\right) + O(n) = O(n)\) — dominated by the sequential work at top level.

Two parallel recursive calls on n/2, plus O(1) sequential work. What's the span?

\(S(n) = S\!\left(\frac{n}{2}\right) + O(1) = O(\log n)\).

Locking & Synchronization

Name the four conditions for deadlock.

(1) Mutual exclusion, (2) Hold and wait, (3) No preemption, (4) Circular wait.

How do you prevent deadlock?

Break one condition. Most common: impose a total order on lock acquisition to prevent circular wait.

Why is a single global lock bad for parallelism?

All threads serialize — only one can run at a time. No parallel speedup.

How does MVar work as a lock?

Full MVar = unlocked. takeMVar = acquire (blocks if already taken, empties it). putMVar = release (fills it back).

What are ghost cells / halo regions in stencil computations?

Boundary copies of neighboring data that a thread/processor keeps locally so it can compute the stencil near partition boundaries.

What must happen to the halo in an iterative stencil computation?

It must be updated after each iteration.

Why can a classic spin lock deadlock on a GPU?

Because threads in a warp execute in lockstep. One lane can hold the lock while another spins waiting, stalling the whole warp.

Definitions

Concurrency vs parallelism?

Concurrency = dealing with multiple things at once (can be single core). Parallelism = doing multiple things at once (requires multiple cores).

Task parallelism vs data parallelism?

Task: different threads do different work, explicit sync. Data: same operation on bulk data, implicit sync, scales massively.

AoS vs SoA?

Array of Structures = [{x,y}, {x,y}]. Structure of Arrays = {xs:[...], ys:[...]}. SoA is better for SIMD/GPU (contiguous memory per field).