跳转至

Lecture 8 Data Parallel Algorithms, aka Tricks with Trees

Review: Single Instruction Multiple Data (SIMD)

It tells us, one single instruction is responsible for multiple data.

Since then, we can naturally introduce Data Parallelism: perform the same operation on multiple values (often array elements)

Nowadays, many parallel programming models use some data parallelism.

  1. SIMD units (and previously SIMD supercomputers)
  2. CUDA / GPUs
  3. MapReduce
  4. MPI collectives

Some Basic Operation

Unary and Binary

alt text

alt text

Broadcast

alt text

scalar -> all elements in one vector

Memory Operations

Strided and Scatter / Gather

Array assignment works if the arrays are the same shape

C
1
2
3
4
A: double [0:4]
B: double [0:4] = [0.0, 1.1, 2.2, 3.3, 4.4]

A=B

Then A = [0.0, 1.1, 2.2, 3.3, 4.4].

May have a stride, i.e., not be contiguous in memory

C
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
A = B [0:4:2] // copy with stride 2 (every other element)

// now A = [0.0, 2.2]

C: double [0:4, 0:4] // C Matrix: 5x5

// Now C = 
// 00 01 02 03 04
// 10 11 12 13 14
// 20 21 22 23 24
// 30 31 32 33 34
// 40 41 42 43 44

A = C [*,3] // copy column of C, A = [03 13 23 33 43]

Gather (indexed) values from one array

C
1
2
X: int [0:4] = [3, 0, 4, 2, 1] // a permutation of indices 0-4 
A = B[X] // A now is [3.3, 0.0, 4.4, 2.2, 1.1]
  1. A=B[X]: A[0,1,2,3,4] = B[3,0,4,2,1]
  2. Hence, A = [3.3, 0.0, 4.4, 2.2, 1.1]

Scatter (indexed) values from one array

C
1
A[X] = B // A now is [1.1, 4.4, 3.3, 0.0, 2.2]
  1. A[X] = B: A[3,0,4,2,1] = B[0,1,2,3,4]
  2. A[3] = B[0] = 0.0, A[0] = B[1] = 1.1, A[4] = B[2] = 2.2, A[2] = B[3] = 3.3, A[1] = B[4] = 4.4
  3. Hence, A = [1.1, 4.4, 3.3, 0.0, 2.2]

Mask

alt text

Reduce

alt text

Vector -> Scalar

Scans

alt text

Two Variant: Inclusive and Exclusive Scans

alt text

Obviously:

C
1
scan_inclusive(X) = X  scan_exclusive(X).

Idealized Hardware and Performance Model

alt text

Cost on Ideal Machine (Span)

alt text

alt text

Sum Scan in Parallel

aka. prefix sum

(1) Pairwise Sum Algorithm

alt text

alt text

(2) Non-recursive exclusive scan (Blelloch Algorithm)

Genius! But it's really hard for beginners.

alt text

Applications of Data Parallelism

almost using scans

Stream Compression

Use flag to 0-1 compress a stream

alt text

How to use tree-algorithm to realize the 0-1 compression?

alt text

压缩步骤

Step 1: 计算前缀和(Exclusive Add Scan)

  • 对flags数组进行前缀和计算,得到index数组
  • index = [0, 1, 1, 2, 3, 3, 3, 4]
  • 这个index数组表示每个元素在压缩后的位置

Step 2: 数据分散(Scatter)

  • 只保留flags为1的对应values值
  • 根据index数组确定位置
  • flags[i]=1时,将values[i]放到result[index[i]]

结果映射示例

  • values=3 → result=3 (因为flags=1)
  • values=4 → result[1]=4 (因为flags=1)
  • values=1 → result=1 (因为flags=1)
  • values=3 → result=3 (因为flags=1)
  • values=1 → result=1 (因为flags=1)

List Ranking with Pointer Doubling

Problem:

Given a linked list of N nodes, find the distance (#hops) from each node to the end of the list.

C
1
2
d(n) = 0 if n.next is null
     = 1 + d(n.next) otherwise Approach: put a processor at every node

alt text

Fibonacci via Matrix Multiply Prefix

alt text

Based on this idea, we can convert so many recursive issues into matrix multiplication process:

(1) Adding n-bit integers in O(log n) time

alt text

Just take a look :)

(2) Lexical analysis (tokenizing, scanning)

alt text

(3) Inverting triangular n-by-n matrices

alt text

Mapping Data Parallelism to Real Hardware

Ignored here.