Lecture 8 Data Parallel Algorithms, aka Tricks with Trees¶
Review: Single Instruction Multiple Data (SIMD)
It tells us, one single instruction is responsible for multiple data.
Since then, we can naturally introduce Data Parallelism: perform the same operation on multiple values (often array elements)
Nowadays, many parallel programming models use some data parallelism.
- SIMD units (and previously SIMD supercomputers)
- CUDA / GPUs
- MapReduce
- MPI collectives
Some Basic Operation¶
Unary and Binary¶
Broadcast¶
scalar -> all elements in one vector
Memory Operations¶
Strided and Scatter / Gather
Array assignment works if the arrays are the same shape
C | |
---|---|
1 2 3 4 |
|
Then A = [0.0, 1.1, 2.2, 3.3, 4.4]
.
May have a stride, i.e., not be contiguous in memory
C | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
Gather (indexed) values from one array
C | |
---|---|
1 2 |
|
A=B[X]
:A[0,1,2,3,4] = B[3,0,4,2,1]
- Hence,
A = [3.3, 0.0, 4.4, 2.2, 1.1]
Scatter (indexed) values from one array
C | |
---|---|
1 |
|
A[X] = B
:A[3,0,4,2,1] = B[0,1,2,3,4]
A[3] = B[0] = 0.0
,A[0] = B[1] = 1.1
,A[4] = B[2] = 2.2
,A[2] = B[3] = 3.3
,A[1] = B[4] = 4.4
- Hence,
A = [1.1, 4.4, 3.3, 0.0, 2.2]
Mask¶
Reduce¶
Vector -> Scalar
Scans¶
Two Variant: Inclusive and Exclusive Scans
Obviously:
C | |
---|---|
1 |
|
Idealized Hardware and Performance Model¶
Cost on Ideal Machine (Span)¶
Sum Scan in Parallel¶
aka. prefix sum
(1) Pairwise Sum Algorithm
(2) Non-recursive exclusive scan (Blelloch Algorithm)
Genius! But it's really hard for beginners.
Applications of Data Parallelism¶
almost using scans
Stream Compression¶
Use
flag
to 0-1 compress a stream
How to use tree-algorithm to realize the 0-1 compression?
压缩步骤
Step 1: 计算前缀和(Exclusive Add Scan)
- 对flags数组进行前缀和计算,得到index数组
- index =
[0, 1, 1, 2, 3, 3, 3, 4]
- 这个index数组表示每个元素在压缩后的位置
Step 2: 数据分散(Scatter)
- 只保留flags为1的对应values值
- 根据index数组确定位置
- 当
flags[i]=1
时,将values[i]
放到result[index[i]]
中
结果映射示例
- values=3 → result=3 (因为flags=1)
- values=4 → result[1]=4 (因为flags=1)
- values=1 → result=1 (因为flags=1)
- values=3 → result=3 (因为flags=1)
- values=1 → result=1 (因为flags=1)
List Ranking with Pointer Doubling¶
Problem:
Given a linked list of N nodes, find the distance (#hops) from each node to the end of the list.
C | |
---|---|
1 2 |
|
Fibonacci via Matrix Multiply Prefix¶
Based on this idea, we can convert so many recursive issues into matrix multiplication process:
(1) Adding n-bit integers in O(log n) time
Just take a look :)
(2) Lexical analysis (tokenizing, scanning)
(3) Inverting triangular n-by-n matrices
Mapping Data Parallelism to Real Hardware¶
Ignored here.