Lecture 8 Data Parallel Algorithms, aka Tricks with Trees¶
Review: Single Instruction Multiple Data (SIMD)
It tells us, one single instruction is responsible for multiple data.
Since then, we can naturally introduce Data Parallelism: perform the same operation on multiple values (often array elements)
Nowadays, many parallel programming models use some data parallelism.
- SIMD units (and previously SIMD supercomputers)
- CUDA / GPUs
- MapReduce
- MPI collectives
Some Basic Operation¶
Unary and Binary¶


Broadcast¶

scalar -> all elements in one vector
Memory Operations¶
Strided and Scatter / Gather
Array assignment works if the arrays are the same shape
| C | |
|---|---|
| 1 2 3 4 |  | 
Then A = [0.0, 1.1, 2.2, 3.3, 4.4].
May have a stride, i.e., not be contiguous in memory
| C | |
|---|---|
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 |  | 
Gather (indexed) values from one array
| C | |
|---|---|
| 1 2 |  | 
- A=B[X]:- A[0,1,2,3,4] = B[3,0,4,2,1]
- Hence, A = [3.3, 0.0, 4.4, 2.2, 1.1]
Scatter (indexed) values from one array
| C | |
|---|---|
| 1 |  | 
- A[X] = B:- A[3,0,4,2,1] = B[0,1,2,3,4]
- A[3] = B[0] = 0.0,- A[0] = B[1] = 1.1,- A[4] = B[2] = 2.2,- A[2] = B[3] = 3.3,- A[1] = B[4] = 4.4
- Hence, A = [1.1, 4.4, 3.3, 0.0, 2.2]
Mask¶

Reduce¶

Vector -> Scalar
Scans¶

Two Variant: Inclusive and Exclusive Scans

Obviously:
| C | |
|---|---|
| 1 |  | 
Idealized Hardware and Performance Model¶

Cost on Ideal Machine (Span)¶


Sum Scan in Parallel¶
aka. prefix sum
(1) Pairwise Sum Algorithm


(2) Non-recursive exclusive scan (Blelloch Algorithm)
Genius! But it's really hard for beginners.

Applications of Data Parallelism¶
almost using scans
Stream Compression¶
Use
flagto 0-1 compress a stream

How to use tree-algorithm to realize the 0-1 compression?

压缩步骤
Step 1: 计算前缀和(Exclusive Add Scan)
- 对flags数组进行前缀和计算,得到index数组
- index = [0, 1, 1, 2, 3, 3, 3, 4]
- 这个index数组表示每个元素在压缩后的位置
Step 2: 数据分散(Scatter)
- 只保留flags为1的对应values值
- 根据index数组确定位置
- 当flags[i]=1时,将values[i]放到result[index[i]]中
结果映射示例
- values=3 → result=3 (因为flags=1)
- values=4 → result[1]=4 (因为flags=1)
- values=1 → result=1 (因为flags=1)
- values=3 → result=3 (因为flags=1)
- values=1 → result=1 (因为flags=1)
List Ranking with Pointer Doubling¶
Problem:
Given a linked list of N nodes, find the distance (#hops) from each node to the end of the list.
| C | |
|---|---|
| 1 2 |  | 

Fibonacci via Matrix Multiply Prefix¶

Based on this idea, we can convert so many recursive issues into matrix multiplication process:
(1) Adding n-bit integers in O(log n) time

Just take a look :)
(2) Lexical analysis (tokenizing, scanning)

(3) Inverting triangular n-by-n matrices

Mapping Data Parallelism to Real Hardware¶
Ignored here.