Skip to content

Concurrency Exam Prep

Permute vs Backpermute

Permute vs Backpermute on GPU

Backpermute / gather: each output position computes which input position to read from.
Output writes are regular/structured.
Reads may be irregular if the index function jumps around.
Permute / scatter: each input position computes which output position to write to.
Input reads are regular/structured.
Writes may be irregular, and multiple inputs may target the same output.

Performance Rule of Thumb

If both formulations are possible, prefer backpermute / gather over permute / scatter.
Random reads are slower than structured reads, but random writes are also expensive.
Scatter is usually worse than gather because it also introduces extra synchronisation/cache-line traffic, even when there is no true collision.

Collisions

permute needs a combining function when multiple inputs map to the same output index.
On the GPU, collisions are commonly handled with:
an associative and commutative merge operation when possible
an atomic update / CAS loop for word-sized values
per-element locking as a last resort

Common Exam Intuition

Backpermute is the natural fit for reorder, slice, reverse, and transpose-style problems.
Permute is the natural fit for scatter, histogram-like writes, and stream compaction.
Matrix transpose is usually expressed as backpermute, often with tiling to improve locality.