Practice Exam 2 -- INFOB3CC Concurrency

Duration: 2 hours | Closed book | 7 questions | 39 points total

Provide brief and concise answers.
When asked to explain your choice on a multiple choice question, your reasoning should explain why your chosen answer is correct and why the others are not correct.
You may assume that threads do not crash.

Question 1 -- GPU Architecture (5 pts)

This question is about the hardware of a typical GPU with lockstep execution.

1a (1 pt)

A GPU is designed to optimize a different metric than a CPU. Which of the following best describes the primary design goal of a GPU compared to a CPU?

a. GPUs optimize for low latency on individual operations by using large caches and branch prediction.
b. GPUs optimize for high throughput by using many simple cores and hiding latency through massive parallelism.
c. GPUs optimize for sequential instruction throughput by using out-of-order execution.
d. GPUs optimize for low power consumption by using fewer transistors than CPUs.

1b (1 pt)

Which of the following is stored in shared memory on a GPU?

a. Data accessible by all threads across all thread blocks in the entire grid.
b. Data accessible only by a single thread within a warp.
c. Data accessible by all threads within the same thread block, explicitly managed by the programmer.
d. Data accessible by all threads within the same warp but not other warps in the block.
e. Data that is automatically cached from global memory by the hardware with no programmer control.

1c (1 pt)

What is the typical number of threads in a warp on modern NVIDIA GPUs?

a. 8
b. 16
c. 32
d. 64
e. 128

1d (1 pt)

Which of the following can limit the occupancy (ratio of active warps to maximum warps) of a Streaming Multiprocessor? (There are multiple correct answers.)

a. The number of registers used per thread.
b. The amount of shared memory used per thread block.
c. The total size of global memory on the device.
d. The number of threads per thread block.
e. The clock speed of the GPU.

1e (1 pt)

When threads in a warp access memory, the GPU can combine multiple accesses into a single memory transaction. This optimization is called memory coalescing. Under which access pattern does memory coalescing work best?

a. Each thread accesses a random location in global memory.
b. All threads in a warp access the exact same memory address.
c. Threads in a warp access contiguous memory addresses (thread i accesses address base + i).
d. Threads access memory in reverse order (thread i accesses address base + (31 - i)).
e. Memory coalescing is equally effective regardless of access pattern.

Question 2 -- Atomic Operations (6 pts)

Besides atomic compare-and-swap, many processors also feature an atomic fetch-and-or function. Given a pointer (memory location or IORef) and a value, it atomically loads the current value in that location, computes the bitwise OR of the old value and the given value, writes the result back, and returns the old value. It is thus an atomic version of the following function:

int fetch_or(int *variable, int arg) {
  int old_value = *variable;
  *variable = old_value | arg;
  return old_value;
}

Alternatively, it is the atomic equivalent of the following Haskell code:

fetchOr :: IORef Int -> Int -> IO Int
fetchOr ref arg = do
  old <- readIORef ref
  writeIORef ref (old .|. arg)
  return old

Recall that atomic compare-and-swap (atomic_compare_exchange, or casIORef in Haskell) has the following signature:

struct Result { bool success; int original; }
Result atomic_compare_exchange(int *variable, int expected, int replacement);

It is possible to implement a lock using atomic fetch-and-or (without atomic compare-and-swap). This only requires a single variable as shared state. We call this variable state.

2a (1.5 pts)

What value(s) of state should denote that the lock is taken, and which should denote that the lock is free?

Choose the values such that the lock can be implemented using atomic fetch-and-or.

2b (2 pts)

Implement a function to acquire the lock. You may give code either in C or in Haskell. Use atomic_fetch_or to denote the atomic fetch-and-or operation.

2c (1 pt)

Implement a function to release the lock. You may give code either in C or in Haskell.

2d (1.5 pts)

While many processors support atomic_fetch_or natively, there exist some processors that do not. Fortunately, atomic_fetch_or can be implemented as a function in terms of atomic_compare_exchange.

Implement atomic_fetch_or in terms of atomic_compare_exchange (or casIORef).

Question 3 -- Concurrency Properties (6 pts)

In this question, we consider various functions in Haskell to transfer a value between two mutable variables.

3a (2 pts)

First, consider the following function using IORefs with atomicModifyIORef:

transfer :: IORef Int -> IORef Int -> Int -> IO ()
transfer from to amount = do
  atomicModifyIORef from (\v -> (v - amount, ()))
  atomicModifyIORef to   (\v -> (v + amount, ()))

Assume the IORefs are accessed from multiple threads, each calling transfer (possibly with different from/to arguments). Which properties hold for this implementation?

a. Each individual modify (decrement / increment) is performed atomically.
b. The entire transfer (both modifications together) is performed atomically.
c. It is wait-free.
d. It is deadlock-free.

3b (2 pts)

Now consider a transfer using MVars:

transferMVar :: MVar Int -> MVar Int -> Int -> IO ()
transferMVar from to amount = do
  fromVal <- takeMVar from
  toVal   <- takeMVar to
  putMVar from (fromVal - amount)
  putMVar to   (toVal + amount)

Assume both MVars are full at the beginning of the program. Two threads run concurrently:

Thread 1 calls transferMVar accountA accountB 10
Thread 2 calls transferMVar accountB accountA 20

Which properties hold for this implementation?

a. Each individual transfer is performed atomically (no intermediate state is visible).
b. It is wait-free.
c. It is deadlock-free.
d. It is starvation-free.

3c (2 pts)

Finally, consider a transfer using TVars with STM:

transferSTM :: TVar Int -> TVar Int -> Int -> IO ()
transferSTM from to amount = atomically $ do
  fromVal <- readTVar from
  check (fromVal >= amount)   -- 'check' calls 'retry' when the condition is False
  writeTVar from (fromVal - amount)
  writeTVar to   (toVal + amount)
  where toVal = error "unused"  -- ignore, just for illustration

Corrected version (the toVal is read properly):

transferSTM :: TVar Int -> TVar Int -> Int -> IO ()
transferSTM from to amount = atomically $ do
  fromVal <- readTVar from
  check (fromVal >= amount)
  writeTVar from (fromVal - amount)
  toVal <- readTVar to
  writeTVar to (toVal + amount)

Which properties hold for this implementation?

a. The entire transfer is performed atomically.
b. It is wait-free.
c. It is deadlock-free.
d. It is starvation-free.

Question 4 -- Data-Parallel Pattern Selection (6 pts)

A university wants to process student exam results in parallel. They have the following data:

An array scores of length n, containing each student's raw exam score (integer, 0--100).
An array bonusPoints of length n, containing bonus points earned per student (integer, 0--10).

4a (1 pt)

We want to compute the final score for each student, where finalScore[i] = scores[i] + bonusPoints[i]. Which pattern should be used?

	map	stencil	fold	scan	permute	backpermute	zipWith
	A	B	C	D	E	F	G
Step 1	O	O	O	O	O	O	O

4b (1.5 pts)

We now want to assign letter grades based on the final scores. First, we determine a pass/fail boolean for each student (pass if finalScore >= 55). Then we want to count the total number of passing students. Which patterns should be used?

	map	stencil	fold	scan	permute	backpermute	zipWith
	A	B	C	D	E	F	G
Step 1	O	O	O	O	O	O	O
Step 2	O	O	O	O	O	O	O

4c (1.5 pts)

For a "grade curve" display, we want to compute for each student (sorted by final score from lowest to highest) the cumulative number of students with that score or lower. Assume the array is already sorted. Which pattern should be used?

	map	stencil	fold	scan	permute	backpermute	zipWith
	A	B	C	D	E	F	G
Step 1	O	O	O	O	O	O	O

4d (2 pts)

We want to produce a "dean's list" -- a compacted array containing only the final scores of students who scored 90 or above. This requires three steps. Which patterns should be used for each step?

	map	stencil	fold	scan	permute	backpermute	zipWith
	A	B	C	D	E	F	G
Step 1	O	O	O	O	O	O	O
Step 2	O	O	O	O	O	O	O
Step 3	O	O	O	O	O	O	O

Question 5 -- Applied Data-Parallel Design (8 pts)

A sports analytics company wants to analyze lap times from a multi-stage cycling race. They have a one-dimensional array lapTimes of length n, where lapTimes[i] is the time in seconds that the cyclist took to complete stage i.

5a (1 pt)

The company first wants to compute the speed for each stage. They have a second array distances of length n, where distances[i] is the length of stage i in kilometres. The speed for stage i is distances[i] / lapTimes[i].

Which data-parallel pattern should be used?

5b (1.5 pts)

They want to compute the cumulative elapsed time after each stage. That is, elapsed[i] should be the total time from stage 0 through stage i.

Which data-parallel pattern should be used? Briefly explain why.

5c (1.5 pts)

They want to know the total race time (sum of all lap times). Which data-parallel pattern should be used?

5d (2 pts)

The organizers decide to reverse the order of a contiguous sub-section of stages (stages l through r inclusive) for a "what-if" analysis. Specifically, given the array of lap times and indices l and r, produce a new array where stages l..r are in reverse order and all other stages remain in place.

Which data-parallel pattern should be used? Explain briefly how you would compute the index mapping.

5e (2 pts)

Some stages are marked as "mountain stages" via a boolean array isMountain of length n. The company wants to extract a compacted array containing only the lap times of mountain stages, preserving their original order.

Describe the steps needed using data-parallel patterns. For each step, name the pattern and briefly explain what it computes.

Question 6 -- Work-Span Analysis (4 pts)

Consider the following algorithm, which performs three recursive calls of arrays of one-third the size and a loop of n iterations:

procedure phoenix(ps)
  n = length(ps)

  if n <= 1 { return 0 }

  a = phoenix(slice(ps, 0, n/3))
  b = phoenix(slice(ps, n/3, 2*n/3))
  c = phoenix(slice(ps, 2*n/3, n))

  d = 0
  for (i = 0; i < n; i++) {
    d += ps[i]
  }

  return a + b + c + d

The three recursive calls are executed in parallel. The for loop is sequential. The array length and slice functions can be executed in constant time (view the subarray between two indices, without copying).

6a (2 pts)

What is the asymptotic span of this algorithm? Show how you computed the span.

6b (2 pts)

What is the asymptotic work of this algorithm? Show how you computed the work.

Question 7 -- Efficiency and Optimality (4 pts)

Arjun has developed a sequential algorithm to simulate particle physics, which runs in O(n^2) time.

Bea has designed a new parallel algorithm for the same problem. Her algorithm has: - Work: O(n^2 log n) - Span: O(n)

7a (1 pt)

Is Bea's algorithm efficient? Motivate your answer.

7b (1 pt)

Is Bea's algorithm optimal? Motivate your answer.

7c (1 pt)

What is the overhead of Bea's algorithm compared to Arjun's sequential algorithm?

7d (1 pt)

Using Brent's theorem, how many processors (asymptotically) are required for Bea's parallel algorithm to be faster than Arjun's sequential algorithm? Motivate your response.

---

SOLUTIONS

Solution 1 -- GPU Architecture

1a: Answer: b

GPUs optimize for high throughput by using many simple cores and hiding latency through massive parallelism.

(a) Wrong: This describes a CPU design philosophy (large caches, branch prediction = latency optimization).
(c) Wrong: Out-of-order execution is a CPU technique for sequential performance.
(d) Wrong: GPUs actually use more transistors than CPUs for compute units; power efficiency is not their primary design goal distinction.

1b: Answer: c

Shared memory is per-block, explicitly managed by the programmer, and accessible by all threads within the same thread block.

(a) Wrong: That describes global memory (accessible across all blocks).
(b) Wrong: That describes registers (per-thread).
(d) Wrong: Shared memory is shared across the entire thread block, not just a single warp.
(e) Wrong: Shared memory is explicitly managed, not an automatic cache. (L1/L2 caches are automatic, but shared memory is separate and programmer-controlled.)

1c: Answer: c (32)

A warp consists of 32 threads executing in lockstep on NVIDIA GPUs.

(a), (b) Wrong: Too small.
(d), (e) Wrong: Too large. 64 is AMD's wavefront size, not NVIDIA's warp size.

1d: Answer: a, b, d

Occupancy is limited by resources per SM: registers per thread (a), shared memory per block (b), and threads per block (d) all consume finite SM resources that determine how many warps can be active simultaneously.

(c) Wrong: Global memory size does not affect occupancy -- occupancy is about SM resources, not device memory capacity.
(e) Wrong: Clock speed does not affect the ratio of active warps to maximum warps.

1e: Answer: c

When threads access contiguous addresses, the GPU can combine these into a single (or minimal number of) memory transaction(s).

(a) Wrong: Random access leads to many separate transactions, the worst case for coalescing.
(b) Wrong: While this results in one transaction via broadcast, it does not demonstrate coalescing of multiple distinct addresses.
(d) Wrong: Reverse order may still be coalesced in practice (within the same cache line), but the textbook answer for best coalescing is sequential/contiguous access.
(e) Wrong: Coalescing is highly dependent on access pattern.

Solution 2 -- Atomic Operations

2a

state = 0 means the lock is free.
state = 1 means the lock is taken.

Rationale: fetch_or(&state, 1) on a free lock (state = 0) returns 0 and sets state to 1. If the lock is already taken (state = 1), fetch_or(&state, 1) returns 1 and leaves the state unchanged.

2b

void acquire(int *state) {
  while (atomic_fetch_or(state, 1) != 0) {
    // spin
  }
}

The key insight: atomic_fetch_or(state, 1) returns the old value. If the old value was 0, we changed the lock state from free to taken and acquired it. If the old value was 1, the lock was already taken, so we keep spinning.

2c

void release(int *state) {
  *state = 0;
}

A direct store to 0 is correct because only the lock holder calls release.

2d

int atomic_fetch_or(int *variable, int arg) {
  int old;
  Result result;
  do {
    old = *variable;
    result = atomic_compare_exchange(variable, old, old | arg);
  } while (!result.success);
  return result.original;
}

This is the standard CAS-loop pattern: read the current value, compute the desired new value (old | arg), then attempt to atomically swap. If another thread modified the variable between our read and the CAS, the CAS fails and we retry.

Solution 3 -- Concurrency Properties

3a: Answer: a, d

(a) Correct: Each atomicModifyIORef is individually atomic — the read-modify-write of a single IORef is performed as one atomic step.
(b) Wrong: The two atomicModifyIORef calls are separate atomic operations. Another thread can observe the state between them (after from is decremented but before to is incremented). Money "disappears" temporarily — not an atomic transfer.
(c) Wrong: atomicModifyIORef is atomic for one IORef, but under contention it may retry internally. That means it is not generally wait-free.
(d) Correct: IORefs never block, so deadlock is impossible.

3b: Answer: a

(a) Correct: The takeMVar/putMVar pattern ensures that while a thread holds a value (between take and put), no other thread can access that MVar. Since both MVars are taken before either is modified, the transfer is atomic -- no intermediate state is visible.
(b) Wrong: takeMVar blocks when the MVar is empty. This is not wait-free.
(c) Wrong: This CAN deadlock! Thread 1 takes accountA then tries to take accountB. Thread 2 takes accountB then tries to take accountA. Both block forever -- classic circular wait deadlock.
(d) Wrong: Since it can deadlock, it is certainly not starvation-free.

3c: Answer: a, c

(a) Correct: STM guarantees that the entire atomically block executes as a single atomic transaction. Either all reads/writes are committed together, or none are.
(b) Wrong: The check (which calls retry) blocks the transaction until the condition becomes true. Also, STM transactions may be retried due to conflicts. This is not wait-free.
(c) Correct: STM uses optimistic concurrency control (no locks), so deadlock is impossible. Conflicting transactions are simply retried.
(d) Wrong: STM is NOT starvation-free. A long transaction can be repeatedly aborted by shorter conflicting transactions and re-execute indefinitely. The GHC STM runtime does not guarantee that every transaction eventually commits.

Solution 4 -- Data-Parallel Pattern Selection

4a: Answer: G (zipWith)

finalScore[i] = scores[i] + bonusPoints[i] -- each output depends on two inputs at the same index. This is exactly the definition of zipWith (+) scores bonusPoints.

Not map (single input), not stencil (no neighborhood), not fold (not reducing), not scan (not cumulative).

4b

Step 1: A (map) -- Apply a function to each final score independently: pass[i] = finalScore[i] >= 55. Each output depends on exactly one input at the same index.
Step 2: C (fold) -- Sum all the boolean-as-integer values to get a single count: fold (+) 0 passes. Reduces the entire array to one value.

4c: Answer: D (scan)

A cumulative count means cumulative[i] = number of students at index 0..i. This is a running total / prefix sum, which is exactly scanl (+). Each output depends on all previous elements.

Not fold (that gives a single value). Not map (needs more than one element per output). Not stencil (depends on ALL previous, not just fixed neighborhood).

4d

Step 1: A (map) -- Mark each student as qualifying or not: qualify[i] = if finalScore[i] >= 90 then 1 else 0. Each output depends on one input at the same index.
Step 2: D (scan) -- Compute prefix sum of the marks: indices = scanl (+) 0 qualify. This gives each qualifying student a unique output index. Each output depends on all previous elements.
Step 3: E (permute) or F (backpermute) -- Use the computed indices to write qualifying scores into a compacted output array. This is a scatter/gather using computed indices. permute writes each qualifying input to its computed output position.

This is the standard stream compaction pattern: mark, scan, scatter.

Solution 5 -- Applied Data-Parallel Design

5a: zipWith

speed[i] = distances[i] / lapTimes[i] -- each output depends on two inputs at the same index. Use zipWith (/) distances lapTimes.

5b: scanl

elapsed[i] is the sum of all lap times from index 0 through i. This is a prefix sum / running total: scanl (+) 0 lapTimes. Each output depends on ALL previous elements, which is the defining characteristic of scan.

5c: fold

The total race time is a single aggregate value: the sum of all elements. Use fold (+) 0 lapTimes. This reduces the entire array to one value.

5d: backpermute

Backpermute reads from computed input indices to produce each output. The index function is:

index(i) = if l <= i <= r then (l + r - i) else i

For positions outside [l, r], the index maps to itself. For positions inside [l, r], we reverse: position l maps to r, l+1 maps to r-1, etc. This is a structured read pattern where each output reads from one input at a computed index.

Not permute: if both are possible on GPU, we usually prefer backpermute/gather over permute/scatter. Not stencil: the neighborhood is not fixed/local, because we read from potentially distant indices.

5e: Three steps (stream compaction)

map -- Convert isMountain booleans to integers (1 for mountain, 0 otherwise): flags = map (\b -> if b then 1 else 0) isMountain. Each output depends on one input at the same index.
scanl -- Compute prefix sum of flags: indices = scanl (+) 0 flags. This assigns each mountain stage a unique consecutive output index. Each output depends on all previous elements.
permute -- For each position where isMountain[i] is True, write lapTimes[i] to position indices[i] in the output array. This writes to a computed output index.

This is the standard stream compaction pattern.

Solution 6 -- Work-Span Analysis

6a: Span

The three recursive calls execute in parallel, so we take the max (all are size n/3, so max = S(n/3)). The for loop is sequential with n iterations = O(n).

Recurrence:

S(n) = S(n/3) + O(n)

Solving by expansion:

S(n) = O(n) + O(n/3) + O(n/9) + O(n/27) + ...
     = O(n) * (1 + 1/3 + 1/9 + 1/27 + ...)
     = O(n) * (1 / (1 - 1/3))
     = O(n) * 3/2
     = O(n)

This is a geometric series dominated by the largest term. By the Master theorem: a=1, b=3, f(n)=n, n^(log_3(1)) = n^0 = 1 < n, so case 3 applies.

Span: O(n)

6b: Work

All three recursive calls contribute to work, plus the O(n) loop:

W(n) = 3 * W(n/3) + O(n)

By the Master theorem: a=3, b=3, so n^(log_3(3)) = n^1 = n. We have f(n) = O(n) = O(n^(log_b(a))), so this is case 2.

W(n) = O(n log n)

Work: O(n log n)

Solution 7 -- Efficiency and Optimality

Given: - Arjun's sequential algorithm: T_seq = O(n^2) - Bea's parallel algorithm: Work W = O(n^2 log n), Span S = O(n)

7a: Is Bea's algorithm efficient?

No. In the course terminology, an algorithm is efficient when it has polylogarithmic span and polylogarithmic overhead. Bea's algorithm has logarithmic overhead, but its span is \(O(n)\), which is not polylogarithmic.

7b: Is Bea's algorithm optimal?

No. An algorithm is optimal when it has polylogarithmic span and constant overhead. Bea's algorithm has linear span and logarithmic overhead, so it is not optimal.

7c: What is the overhead?

Overhead = W_parallel / W_sequential = O(n^2 log n) / O(n^2) = O(log n).

The parallel algorithm does a factor of log n more total work than the sequential algorithm.

7d: How many processors for speedup?

By Brent's theorem, the parallel time with p processors is: T_p <= W/p + S = (n^2 log n)/p + n.

For Bea's algorithm to be faster than Arjun's: T_p < T_seq

(n^2 log n) / p + n < n^2

The dominant terms give us:

(n^2 log n) / p < n^2
log n / p < 1
p > log n

We need p > log n processors (asymptotically, p = omega(log n)) for the parallel algorithm to be worthwhile over the sequential one.

With p = O(log n) processors, the work term (n^2 log n) / p = n^2 log n / log n = n^2 matches the sequential time, and the span term O(n) is lower order. So with slightly more than log n processors, we start gaining speedup.

The maximum useful number of processors is the parallelism: W/S = (n^2 log n) / n = O(n log n). Beyond this, adding processors gives no further benefit.