Cumulative Sum

by: burt rosenberg
at: university of miami
date: 28 sep 2022

Goals

Thinking in parallel.

Specific steps

The git home for the class is found at github.com/accelerate-gpu.


initial array:

   +---+---+---+---+---+---+---+---+
   | 0   1   2   3   4   5   6   7 |
   +---+---+---+---+---+---+---+---+

add in the lower half of the block pair:

       +---+   +---+   +---+   +---+
+      | 0       2       4       6 |
       +---+   +---+   +---+   +---+
========================================
   +---+---+---+---+---+---+---+---+
   | 0   1 | 2   3 | 4   5 | 6   7 |
   +---+---+---+---+---+---+---+---+

blocks now of size 2, 
add in lower half of the block pair:

           +---+---+       +---+---+
+          | 1   1           5   5 |
           +---+---+       +---+---+
========================================
   +---+---+---+---+---+---+---+---+
   | 0   1   2   3 | 4   5   6   7 |
   +---+---+---+---+---+---+---+---+
   
blocks are now of size 4,
add in lower half of the block pair:

                   +---+---+---+---+
+                  | 3   3   3   3 |
                   +---+---+---+---+
========================================
   +---+---+---+---+---+---+---+---+
   | 0   1   2   3 | 4   5   6   7 |
   +---+---+---+---+---+---+---+---+

block size 8 solved.

partial-sums.ipynb

The task is given an array a[n] to calculate A[i] = ∑_j=0..i-ia[j].

Method 1

The python file gives two methods, that can be parallelized for the GPU. The first method is a bottom up method which maintains as the loop invariant that the partial sum problem is solved for each block, the block size in phase i being B=2^i.

In phase 1, this consists of updating a[i] += a[i-1] for all odd i.

To move to the next phase, the blocks are paired by their parity in the sequence of blocks, and every element in the top block is incremented by the value of the topmost element of the bottom block.

The read and write access of this method is favorable. In each phase any cell is written at most once. Each cell is read at most once (assuming a broadcast read), and the reads and writes are disjoint. There are n/2 threads used in each of log(n) phases.

Method 2

The second method shown keeps the invariant that for block size B, the phase Loop Invariant is that

    a[i] = sum_{j} a'[i+B*j]

for all i mod B, starting with B = n/2 and updating until B=1, and where we prime the a to designate the original values in the array.

Note that for this description, we are flipping the direction of the cumulative sum so that the final a[i] is the sum of all a[j] for j>=i. This simplifies the case where n is not a pure power of 2 because the will have a loop invariant that speaks of chains that continue from a definite number up as far as needed for the array size n.

Update by halving the block size, and noting that,

    sum_{j} a[i+B/2*j] =  sum_{j} a[i+B*j] + sum_{j} a[i+ B/2 + B*j]

The data access seems a little problematic, as most cells are both read from and written to in a phase. So the phase must be divided into a read sub-phase, then a write sub-phase.

The technique of synchronizing by kernel launches will need to store the read value between launches but this incurs a cost of n/2 memory overhead. While some synchronization can be by thread blocking, that is only possible across theads in a block, limited to 1024 theads. So a software mechanism is needed.

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

author: burton rosenberg
update: 28 sep 2022