Permute vs Backpermute on GPU
- Backpermute / gather: each output position computes which input position to read from.
- Output writes are regular/structured.
- Reads may be irregular if the index function jumps around.
- Permute / scatter: each input position computes which output position to write to.
- Input reads are regular/structured.
- Writes may be irregular, and multiple inputs may target the same output.
- If both formulations are possible, prefer backpermute / gather over permute / scatter.
- Random reads are slower than structured reads, but random writes are also expensive.
- Scatter is usually worse than gather because it also introduces extra synchronisation/cache-line traffic, even when there is no true collision.
Collisions
permute needs a combining function when multiple inputs map to the same output index.
- On the GPU, collisions are commonly handled with:
- an associative and commutative merge operation when possible
- an atomic update / CAS loop for word-sized values
- per-element locking as a last resort
Common Exam Intuition
- Backpermute is the natural fit for reorder, slice, reverse, and transpose-style problems.
- Permute is the natural fit for scatter, histogram-like writes, and stream compaction.
- Matrix transpose is usually expressed as backpermute, often with tiling to improve locality.