Lecture 8 Data Parallel Algorithms, aka Tricks with Trees¶

Review: Single Instruction Multiple Data (SIMD)

It tells us, one single instruction is responsible for multiple data.

Since then, we can naturally introduce Data Parallelism: perform the same operation on multiple values (often array elements)

Nowadays, many parallel programming models use some data parallelism.

SIMD units (and previously SIMD supercomputers)
CUDA / GPUs
MapReduce
MPI collectives

Some Basic Operation¶

Unary and Binary¶

alt text

Broadcast¶

alt text

scalar -> all elements in one vector

Memory Operations¶

Strided and Scatter / Gather

Array assignment works if the arrays are the same shape

C
A: double [0:4]
B: double [0:4] = [0.0, 1.1, 2.2, 3.3, 4.4]

A=B

Then A = [0.0, 1.1, 2.2, 3.3, 4.4].

May have a stride, i.e., not be contiguous in memory

C
A = B [0:4:2] // copy with stride 2 (every other element)

// now A = [0.0, 2.2]

C: double [0:4, 0:4] // C Matrix: 5x5

// Now C = 
// 00 01 02 03 04
// 10 11 12 13 14
// 20 21 22 23 24
// 30 31 32 33 34
// 40 41 42 43 44

A = C [*,3] // copy column of C, A = [03 13 23 33 43]

Gather (indexed) values from one array

C
X: int [0:4] = [3, 0, 4, 2, 1] // a permutation of indices 0-4 
A = B[X] // A now is [3.3, 0.0, 4.4, 2.2, 1.1]

A=B[X]: A[0,1,2,3,4] = B[3,0,4,2,1]
Hence, A = [3.3, 0.0, 4.4, 2.2, 1.1]

Scatter (indexed) values from one array

C
A[X] = B // A now is [1.1, 4.4, 3.3, 0.0, 2.2]

A[X] = B: A[3,0,4,2,1] = B[0,1,2,3,4]
A[3] = B[0] = 0.0, A[0] = B[1] = 1.1, A[4] = B[2] = 2.2, A[2] = B[3] = 3.3, A[1] = B[4] = 4.4
Hence, A = [1.1, 4.4, 3.3, 0.0, 2.2]

Mask¶

alt text

Reduce¶

alt text

Vector -> Scalar

Scans¶

alt text

Two Variant: Inclusive and Exclusive Scans

alt text

Obviously:

C
scan_inclusive(X) = X ◎ scan_exclusive(X).

Idealized Hardware and Performance Model¶

alt text

Cost on Ideal Machine (Span)¶

alt text

Sum Scan in Parallel¶

aka. prefix sum

(1) Pairwise Sum Algorithm

alt text

(2) Non-recursive exclusive scan (Blelloch Algorithm)

Genius! But it's really hard for beginners.

alt text

Applications of Data Parallelism¶

almost using scans

Stream Compression¶

Use flag to 0-1 compress a stream

alt text

How to use tree-algorithm to realize the 0-1 compression?

alt text

压缩步骤

Step 1: 计算前缀和（Exclusive Add Scan）

对flags数组进行前缀和计算，得到index数组
index = [0, 1, 1, 2, 3, 3, 3, 4]
这个index数组表示每个元素在压缩后的位置

Step 2: 数据分散（Scatter）

只保留flags为1的对应values值
根据index数组确定位置
当flags[i]=1时，将values[i]放到result[index[i]]中

结果映射示例

values=3 → result=3 (因为flags=1)
values=4 → result[1]=4 (因为flags=1)
values=1 → result=1 (因为flags=1)
values=3 → result=3 (因为flags=1)
values=1 → result=1 (因为flags=1)

List Ranking with Pointer Doubling¶

Problem:

Given a linked list of N nodes, find the distance (#hops) from each node to the end of the list.

C
d(n) = 0 if n.next is null
     = 1 + d(n.next) otherwise Approach: put a processor at every node

alt text

Fibonacci via Matrix Multiply Prefix¶

alt text

Based on this idea, we can convert so many recursive issues into matrix multiplication process:

(1) Adding n-bit integers in O(log n) time

alt text

Just take a look :)

(2) Lexical analysis (tokenizing, scanning)

alt text

(3) Inverting triangular n-by-n matrices

alt text

Mapping Data Parallelism to Real Hardware¶

Ignored here.