INFOB3CC - Concurrency - Practice Exam 1

Duration: 2 hours Number of questions: 7 Total points: 39

The exam is a closed book exam.
The exam must be made alone. No communication with others is allowed.
Provide brief and concise answers. Overly verbose responses or nonsense added to otherwise good answers can deduct from your grade.
When asked to explain your choice on a multiple choice question, your reasoning should explain why your chosen answer is correct and why the others are not correct.
If a question does not give you all the details you need, you may make reasonable assumptions. Your assumptions must be clearly stated.
You may assume that threads do not crash.
You have two hours to complete the exam. You can go back to previous questions.
Good luck! (:

Question 1 (5 pt)

This question is about the hardware of a typical GPU with lockstep execution.

a. (1 pt) What happens when threads executing in lockstep encounter a branch where some threads take the if-path and others take the else-path?

a. The warp splits into two independent warps, one for each branch.
b. Both branches are executed; threads not on the active branch are masked (inactive).
c. The scheduler picks one branch randomly and all threads follow it.
d. The kernel is aborted and relaunched with a different thread assignment.

b. (1 pt) A GPU kernel reads from an array where thread i in a warp reads element arr[i] (consecutive indices). Another kernel reads from the same array where thread i reads element arr[i * 137 % n] (pseudo-random indices). Which statement is correct?

a. Both kernels have the same memory performance because they read the same total number of elements.
b. The first kernel is faster because consecutive accesses are coalesced into fewer memory transactions.
c. The second kernel is faster because the random pattern distributes load more evenly across memory banks.
d. Performance is identical because the GPU cache handles both patterns equally well.

c. (1 pt) Which of the following factors limit the occupancy of a Streaming Multiprocessor? Select all that apply.

a. The number of registers used per thread.
b. The amount of shared memory used per thread block.
c. The total size of global memory on the device.
d. The number of threads per thread block.
e. The clock frequency of the GPU.

d. (1 pt) Why does higher occupancy generally improve GPU performance?

a. It increases the clock speed of the Streaming Multiprocessor.
b. It allows the SM to switch to other warps when one warp stalls on a memory access, hiding latency.
c. It reduces the total amount of work the GPU has to do.
d. It enables threads within a warp to execute different instructions simultaneously.

e. (1 pt) Assume we can solve a problem with either (i) structured reads and random writes, or (ii) random reads and structured writes. The reads and writes are non-overlapping. Which option is likely faster on a GPU?

a. Structured reads and random writes, because reads benefit more from coalescing and writes can be buffered asynchronously.
b. Random reads and structured writes, because writes must be coalesced but reads can be cached.
c. They perform the same, since GPUs treat reads and writes symmetrically.
d. Random reads and structured writes, because random writes would cause race conditions.

Question 2 (6 pt)

In this question, we consider various functions in Haskell to change the value of a mutable variable. For each implementation, indicate which properties hold. Multiple answers are possible per sub-question.

First, consider the following function to change the value of an IORef:

changeIORef :: IORef Int -> IO ()
changeIORef ref = do
  currentValue <- readIORef ref
  let newValue = currentValue * 3 - 1
  writeIORef ref newValue

Assume that the IORef is only accessed via changeIORef. It is accessed from multiple threads.

a. (2 pt) Which properties hold for this implementation?

a. It performs the change atomically
b. It is wait-free
c. It is deadlock-free
d. It is starvation-free

Now consider an implementation using MVars:

changeMVar :: MVar Int -> IO ()
changeMVar var = do
  currentValue <- takeMVar var
  let newValue = currentValue * 3 - 1
  putMVar var newValue

Assume the MVar is only accessed via changeMVar and that it is filled at the beginning of the program. It is accessed from multiple threads.

b. (2 pt) Which properties hold for this implementation?

a. It performs the change atomically
b. It is wait-free
c. It is deadlock-free
d. It is starvation-free

Finally, consider an implementation using TVars:

changeTVar :: TVar Int -> IO ()
changeTVar var = atomically $ do
  currentValue <- readTVar var
  let newValue = currentValue * 3 - 1
  writeTVar var newValue

The TVar is only accessed via changeTVar, from multiple threads.

c. (2 pt) Which properties hold for this implementation?

a. It performs the change atomically
b. It is wait-free
c. It is deadlock-free
d. It is starvation-free

Question 3 (6 pt)

A bicycle repair shop maintains an inventory of spare parts. Each repair job requires a specific set of parts. Multiple mechanics work simultaneously, each taking parts from the shared inventory to complete their assigned job.

For example, the shop has parts: chain (C), brake pad (B), tire (T), spoke (S). Mechanic 1 needs {C, B} for their job. Mechanic 2 needs {B, T} for their job. Mechanic 3 needs {T, S} for their job. Each part has a limited stock count.

Each mechanic must acquire exclusive access to the stock of each required part (to decrement the count and ensure availability) before starting the repair.

a. (1.5 pt) One approach is to use a single global lock that a mechanic must acquire before accessing any part's stock. Why is this a bad solution?

b. (1.5 pt) Suppose instead each part has its own lock (an MVar). A mechanic must acquire the locks for all parts they need before starting. Show how a deadlock can occur. Give a concrete interleaving of steps for two mechanics.

c. (1.5 pt) How can the deadlock from (b) be prevented? Describe the strategy and explain which deadlock condition it breaks.

d. (1.5 pt) Implement the function acquireParts in Haskell using MVars, incorporating the deadlock prevention strategy from (c). Each part is identified by an Int and its stock is stored in an MVar Int. You are given a list of (Int, MVar Int) pairs (part ID and its MVar).

acquireParts :: [(Int, MVar Int)] -> IO [Int]
-- Should acquire all parts (take all MVars), returning the stock values.
-- Must be deadlock-free.

Question 4 (6 pt)

This question is about parallel scan algorithms.

A scan can be executed sequentially with one (memory) read and one (memory) write per element, in linear time.

a. (1 pt) Can a three-phase parallel scan be implemented with only one read and one write per element?

a. Yes
b. No

b. (1 pt) Is reduce-then-scan (a three-phase parallel scan) efficient?

a. Yes
b. No

c. (1 pt) Is reduce-then-scan (a three-phase parallel scan) optimal?

a. Yes
b. No

d. (1 pt) What property is absolutely required for the operator of a parallel scan?

a. Commutativity: for all x, y: x + y = y + x
b. Idempotency: for all x: x + x = x
c. Associativity: for all x, y, z: x + (y + z) = (x + y) + z
d. Distributivity: for all x, y, z: x * (y + z) = x * y + x * z

e. (1 pt) The input of a scanl with operator + can be reconstructed from its output. Which parallel pattern should be used to convert the output of a scanl back to its corresponding input?

a. map
b. fold
c. scanl
d. scanr
e. stencil (or zipWith)
f. permute
g. backpermute

f. (1 pt) In a segmented scan, the operator is "lifted" to work on pairs of (flag, value). What property must this lifted operator satisfy for the segmented scan to be parallelizable?

a. Commutativity
b. Associativity
c. Idempotency
d. Distributivity

Question 5 (8 pt)

A city has n traffic intersections arranged along a main road. Each intersection has a sensor that records the number of cars passing through it during each minute. We are given a one-dimensional array traffic[0..n-1] of integers representing the car counts at each intersection for a single minute.

The city traffic department wants to perform several analyses on this data using data-parallel patterns.

a. (1 pt) The department wants to compute the total number of cars that passed through all intersections combined. Which data-parallel pattern(s) should be used?

b. (1.5 pt) For congestion analysis, the department needs the cumulative car count at each intersection: position i should contain the sum of traffic[0] + traffic[1] + ... + traffic[i-1] (the total number of cars from the start of the road up to but not including intersection i). Which pattern should be used? Give the operator and initial value.

c. (2 pt) The department wants to detect "congestion boundaries": intersections where the traffic count changes sharply compared to the previous intersection. Specifically, intersection i is a congestion boundary if |traffic[i] - traffic[i-1]| > threshold for a given threshold. Produce a boolean array marking the congestion boundaries. Which pattern(s) should be used? Explain.

d. (1.5 pt) Given the boolean array from (c), the department wants to count the total number of congestion boundaries. Describe a two-step approach using data-parallel patterns.

e. (2 pt) Finally, the department wants to extract only the traffic counts at congestion boundaries into a compact array (no gaps). For instance, if intersections 3, 7, and 12 are boundaries, the result should be [traffic[3], traffic[7], traffic[12]]. Describe a multi-step approach using data-parallel patterns. You may use results from previous sub-questions.

Question 6 (4 pt)

Consider the following algorithm, which performs two recursive calls of arrays of half the size and a loop of log(n) iterations:

procedure falcon(ps)
  n = length(ps)

  if n <= 1 { return 0 }

  a = falcon(slice(ps, 0, n * 0.5))
  b = falcon(slice(ps, n * 0.5, n))

  c = 0
  for (i = 0; i < log2(n); i++) {
    c += ps[i]
  }

  return a + b + c

The two recursive calls before the for loop are executed in parallel. The log2 function computes the base-2 logarithm. The array length and slice (view the subarray between two indices, without copying) functions can be executed in constant time.

a. (2 pt) What is the asymptotic span of this algorithm? Show how you computed the span.

b. (2 pt) What is the asymptotic work of this algorithm? Show how you computed the work.

Question 7 (4 pt)

Charles has developed a sequential algorithm to sort a dataset, which runs in O(n^2) time. Tina has the idea for a new algorithm which can be parallelised to run in O(n) steps (span), at the expense of increasing the work to O(n^2 log n).

a. (2 pt) Charles thinks their company will not be able to afford the processors necessary to make the new algorithm worthwhile. How many processors will be required for Tina's algorithm to be worthwhile (asymptotically) over Charles' algorithm? Motivate your response.

b. (2 pt) What is the maximum number of processors that the company should purchase for Tina's algorithm (asymptotically)? Motivate your response.

---

FULL SOLUTIONS

Solution Q1: GPU Architecture MC

Q1a: Answer: b

b is correct. When threads in a warp diverge at a branch, both paths are executed sequentially. Threads not on the currently active path are masked (disabled). This is called branch divergence.

a is wrong. Warps do not split into sub-warps. The warp remains a single unit of 32 threads; it just serializes the two paths.
c is wrong. The scheduler does not pick randomly. Both branches are executed deterministically.
d is wrong. The kernel is not aborted. Branch divergence is handled transparently by the hardware.

Q1b: Answer: b

b is correct. When threads in a warp access consecutive memory addresses, the GPU can coalesce these into a single (or few) memory transaction(s) from a single cache line. Random access patterns like arr[i * 137 % n] scatter across many cache lines, requiring many separate memory transactions.

a is wrong. Total number of elements read is irrelevant; what matters is the access pattern and how many memory transactions are needed.
c is wrong. Random patterns do not help with "load distribution" -- there are no memory banks in global memory in the way shared memory has banks. The issue is cache line utilization.
d is wrong. Caches help with spatial locality (consecutive access). Random access defeats caching.

Q1c: Answer: a, b, d

a is correct. More registers per thread means fewer threads can fit on the SM (fixed register file size), reducing occupancy.

b is correct. More shared memory per block means fewer blocks can be co-resident on the SM, reducing occupancy.

d is correct. The number of threads per block affects how many warps per block, and the SM has a maximum number of threads/warps it can handle.

c is wrong. Global memory size is a device-level resource, not an SM-level resource. It does not limit how many warps can be active on a single SM.
e is wrong. Clock frequency does not affect occupancy. Occupancy is about how many warps are active, not how fast they execute.

Q1d: Answer: b

b is correct. Higher occupancy means more warps are resident on the SM. When one warp stalls (e.g., on a global memory access), the SM can immediately switch to another ready warp, hiding the memory latency. This is the GPU's primary latency-hiding mechanism.

a is wrong. Occupancy does not change clock speed.
c is wrong. Occupancy does not reduce total work; it improves how efficiently the hardware is utilized.
d is wrong. Threads within a warp always execute the same instruction (lockstep). Occupancy does not change this.

Q1e: Answer: a

a is correct. Per the study guide: structured reads benefit significantly from memory coalescing (threads in a warp read a contiguous cache line in one transaction). Writes can be buffered and performed asynchronously, so random writes incur less penalty than random reads. Therefore, structured reads + random writes is preferred over random reads + structured writes.

b is wrong. This is the worse option. Random reads miss out on coalescing, which is the more important optimization.
c is wrong. GPUs do not treat reads and writes symmetrically. Reads depend more on coalescing; writes can be asynchronous.
d is wrong. The question states writes are non-overlapping, so race conditions are not an issue.

Solution Q2: Concurrency Properties

Q2a (IORef): Answer: b, c, d

Trace through the code: 1. readIORef ref -- reads the current value (non-blocking, always succeeds) 2. compute newValue 3. writeIORef ref newValue -- writes (non-blocking, always succeeds)

(a) Atomic? NO. The read and write are two separate operations. Another thread can read (or write) the IORef between steps 1 and 3, causing a lost update (race condition).
(b) Wait-free? YES. Neither readIORef nor writeIORef ever blocks. The operation always completes in bounded steps regardless of what other threads do.
(c) Deadlock-free? YES. Since IORef operations do not block, there is no way to create circular waiting.
(d) Starvation-free? YES. A thread performing this operation is never blocked by other threads; the operation always completes.

Answer: b, c, d. The IORef read-modify-write is not atomic, but it is wait-free, deadlock-free, and starvation-free.

Q2b (MVar): Answer: a, c, d

Trace through the code: 1. takeMVar var -- atomically takes the value; MVar becomes empty. Blocks if already empty. 2. compute newValue 3. putMVar var newValue -- puts value back; MVar becomes full.

(a) Atomic? YES. Between takeMVar and putMVar, no other thread can access the MVar (it is empty, so any other takeMVar will block). The read-modify-write is effectively atomic.
(b) Wait-free? NO. takeMVar blocks if the MVar is empty (another thread is currently holding it).
(c) Deadlock-free? YES. There is only a single MVar. A thread takes it, computes, and puts it back. No circular dependency is possible with a single lock. (Deadlock requires multiple locks with circular wait.)
(d) Starvation-free? YES (for a single MVar). In the course material, blocked threads on a single MVar are woken in FIFO order, which gives the fairness guarantee needed here.

Q2c (TVar): Answer: a, c

Trace through the code: 1. atomically executes the block as a single transaction. 2. readTVar var and writeTVar var happen within the transaction.

(a) Atomic? YES. STM guarantees the entire atomically block executes as one atomic transaction.
(b) Wait-free? NO. STM uses optimistic concurrency. If another thread commits a conflicting transaction, this transaction is retried. It may retry multiple times.
(c) Deadlock-free? YES. STM does not use locks. It uses optimistic concurrency control, so there are no lock orderings and no possibility of circular wait.
(d) Starvation-free? NO. STM does not guarantee fairness or progress. A long transaction can be retried indefinitely and starve.

Solution Q3: Locking Scenario

Q3a

A single global lock means only one mechanic can access the inventory at a time. All mechanics are serialized: even if they need completely disjoint sets of parts (e.g., Mechanic 1 needs {C, B} and Mechanic 3 needs {T, S}), they cannot work simultaneously. This eliminates all parallelism and creates a bottleneck, making the system no faster than having a single mechanic.

Q3b

Consider Mechanic 1 needing parts {B, C} and Mechanic 2 needing parts {B, T}. Suppose the mechanics acquire locks in the order they appear in their job description:

Mechanic 1 acquires the lock for B (takeMVar mvarB)
Mechanic 2 acquires the lock for T (takeMVar mvarT)
Mechanic 1 tries to acquire the lock for C -- succeeds (no contention)

This does not deadlock. Let us choose a better example:

Mechanic 1 needs {B, T}, acquires in order B then T. Mechanic 2 needs {T, B}, acquires in order T then B.

Mechanic 1: takeMVar mvarB -- succeeds, holds B
Mechanic 2: takeMVar mvarT -- succeeds, holds T
Mechanic 1: takeMVar mvarT -- blocks (T held by Mechanic 2)
Mechanic 2: takeMVar mvarB -- blocks (B held by Mechanic 1)

Deadlock! Circular wait: Mechanic 1 holds B, waits for T. Mechanic 2 holds T, waits for B.

Q3c

Impose a total order on lock acquisition. Assign each part a unique ID (or use alphabetical order). All mechanics must acquire locks in increasing order of part ID, regardless of their job description.

This breaks the circular wait condition. If all threads acquire locks in the same global order, a cycle in the wait-for graph is impossible. In the example above, both mechanics would acquire B before T. Mechanic 2 would try B first, and if Mechanic 1 already holds B, Mechanic 2 blocks without holding T -- so Mechanic 1 can proceed to acquire T and finish.

Q3d

acquireParts :: [(Int, MVar Int)] -> IO [Int]
acquireParts parts = do
  let sorted = sortBy (compare `on` fst) parts  -- sort by part ID
  mapM (\(_, mvar) -> takeMVar mvar) sorted

By sorting the parts by their ID before acquiring, we ensure all threads acquire MVars in the same global order, preventing deadlock. mapM processes the list sequentially, taking each MVar in order and collecting the stock values into a list.

(Requires import Data.Function (on) and import Data.List (sortBy) or equivalent.)

Solution Q4: Parallel Scan

Q4a: Answer: b (No)

A three-phase parallel scan requires multiple passes over the data (up-sweep reduce, then down-sweep propagation). Each phase reads and writes elements. This requires more than 1 read + 1 write per element total. Only the sequential scan can achieve 1 read + 1 write per element.

Q4b: Answer: a (Yes)

Reduce-then-scan (three-phase parallel scan) has work O(n), which matches the sequential scan's O(n). Since W(n) = O(T_seq(n)), it is efficient by definition.

Q4c: Answer: a (Yes)

Reduce-then-scan is the standard three-phase tiled scan. Its work is \(O(n)\), matching the best sequential scan up to a constant factor, so the overhead is constant. With logarithmic span, it is optimal in the course's terminology.

Answer: a (Yes). Reduce-then-scan is both efficient and optimal.

Q4d: Answer: c (Associativity)

The scan operator must be associative: for all x, y, z: x + (y + z) = (x + y) + z. This is required so that the order in which partial results are combined in the tree does not affect the final result. Commutativity is NOT required.

a is wrong. Commutativity is not needed. Scan preserves the left-to-right order of elements.
b is wrong. Idempotency (x + x = x) is not required. Standard addition is not idempotent but works fine for scan.
d is wrong. Distributivity involves two operators and is not relevant to scan.

Q4e: Answer: e (stencil or zipWith)

Given the output of scanl (+) 0, where output[i] = sum of input[0..i-1], we can recover the input by computing input[i] = output[i+1] - output[i]. This is a fixed-size local neighborhood operation (each result depends on two adjacent elements), which is a stencil (or equivalently zipWith on the array and a shifted version of itself).

a (map) is wrong. Map processes each element independently; we need two adjacent elements.
b (fold) is wrong. Fold reduces to a single value.
c (scanl) is wrong. Scanl produces running totals, not differences.
d (scanr) is wrong. Same reasoning as scanl.
f (permute) is wrong. No index remapping needed.
g (backpermute) is wrong. No index remapping needed.

Q4f: Answer: b (Associativity)

The lifted operator in segmented scan must be associative for the same reason as regular scan: the parallel tree-based evaluation requires that regrouping partial computations yields the same result. This is a common exam trap -- students must verify that the proposed lifted operator is indeed associative.

Solution Q5: Applied Data-Parallel Design (Traffic)

Q5a

fold (+) -- sum all elements of the traffic array to get the total car count. Fold reduces the entire array to a single value using addition.

Q5b

scanl (+) 0 -- a left scan with addition and initial value 0. Position i in the output contains traffic[0] + ... + traffic[i-1], which is exactly the definition of a left exclusive scan.

Pattern: scanl. Operator: (+). Initial value: 0.

Q5c

stencil (or equivalently zipWith on the array and a shifted copy). Each output element at position i depends on traffic[i] and traffic[i-1] -- a fixed local neighborhood of size 2. The stencil computes |traffic[i] - traffic[i-1]| > threshold, producing a boolean.

We use stencil because each output depends on a fixed-size neighborhood of the input (current and previous element).
Alternatively: zipWith on traffic and tail(traffic) (or a shifted version), computing \a b -> abs(a - b) > threshold.
This cannot be map because each output depends on two elements, not one.

Q5d

Two-step approach:

map -- convert the boolean array to integers: True -> 1, False -> 0.
fold (+) -- sum the integer array to get the total count of congestion boundaries.

Step 1 uses map because each output depends on exactly one input at the same index. Step 2 uses fold because we reduce the entire array to a single aggregate value.

Q5e

Multi-step approach to extract/compact traffic values at boundary positions:

map the boolean boundary array to integers (True -> 1, False -> 0). (Already done in Q5d step 1.)
scanl (+) 0 on the integer array. This gives each boundary a unique index in the output. For a boundary at position i, scan[i] is its destination index in the compact array.
permute (or backpermute): for each position i where boundary[i] == True, write traffic[i] to position scan[i] in the output array.

Alternatively, step 3 can use scatter/permute: each element that is a boundary writes its traffic value to the index given by the scan output. This is the standard stream compaction pattern: map to flags, scan for indices, permute to compact.

The total number of boundaries (from Q5d or equivalently the last element of the scan + last flag) gives the length of the output array.

Solution Q6: Work-Span Analysis

Q6a: Span

The two recursive calls execute in parallel, so we take the max (which is just one call since both are size n/2). The for loop runs log2(n) iterations sequentially with O(1) work each, contributing O(log n).

Span recurrence:

S(n) = S(n/2) + O(log n)

Expand:

S(n) = S(n/2) + log(n)
     = S(n/4) + log(n/2) + log(n)
     = S(n/8) + log(n/4) + log(n/2) + log(n)
     = ...
     = S(1) + sum_{k=0}^{log(n)-1} log(n / 2^k)
     = sum_{k=0}^{log(n)-1} (log(n) - k)
     = log(n) * log(n) - sum_{k=0}^{log(n)-1} k
     = log^2(n) - log(n)*(log(n)-1)/2
     = log^2(n) / 2 + log(n)/2

S(n) = O(log^2 n)

Q6b: Work

Both recursive calls contribute to total work, so we sum them. The for loop does O(log n) work.

Work recurrence:

W(n) = 2*W(n/2) + O(log n)

Apply the Master theorem: a = 2, b = 2, f(n) = log(n).

n^(log_b(a)) = n^(log_2(2)) = n^1 = n.

Compare f(n) = log(n) with n^(log_b(a)) = n. Since log(n) = O(n^(1-epsilon)) for any epsilon > 0 (e.g., epsilon = 0.5: log(n) = O(n^0.5)), we are in Case 1 of the Master theorem.

W(n) = O(n)

Solution Q7: Processor Count

Setup

Charles' sequential algorithm: T_seq(n) = O(n^2)
Tina's parallel algorithm: Span S(n) = O(n), Work W(n) = O(n^2 log n)
By Brent's theorem, Tina's algorithm with p processors runs in time: T_p = W/p + S = n^2 log(n) / p + n

Q7a: How many processors for Tina's to be worthwhile?

We need T_p < T_seq, i.e.:

n^2 log(n) / p + n < n^2

For this to hold, we need at minimum the dominant term to be smaller:

n^2 log(n) / p < n^2
p > log(n)

More precisely, n^2 log(n)/p + n < n^2 requires n^2 log(n)/p < n^2 - n ~ n^2, so p > log(n).

We need p > log(n) processors (asymptotically) for Tina's algorithm to be worthwhile. This is because the extra factor of log(n) in the work must be divided away by having at least log(n) processors, so that the work term W/p becomes comparable to n^2. The span term O(n) is already less than O(n^2), so it does not dominate once the work term is handled.

Q7b: Maximum useful processors?

The maximum useful number of processors is the parallelism:

p_max = W(n) / S(n) = n^2 log(n) / n = n log(n)

The maximum useful number of processors is O(n log n). Beyond this, adding more processors does not reduce the running time because the execution is span-limited: T_p cannot be less than S(n) = O(n), and with p = n log(n) processors we get T_p = n^2 log(n) / (n log(n)) + n = n + n = O(n), which already matches the span.