Lecture 10: Advanced MPI and Collective Communication Algorithms¶
With the development of LLMs, We need to handle more tasks in deep learning to achieve parallelism. Nowadays, Distributed deep learning is all about collectives.
Collective communications library:
- Facebook: gloo
- Nvidia: nccl
But we need a standard library.
We offer MPI as a standard, and today we will introduce this library.
Collective Data Movement¶
Some basic operations are introduced in chapter 9.
Now we can take a summary of Collective Data Movement routines.
- Many Routines:
Allgather
,Allgatherv
,Allreduce
,Alltoall
,Alltoallv
,Bcast
,Gather
,Gatherv
,Reduce
,Reduce_scatter
,Scan
,Scatter
,Scatterv
. All
versions deliver results to all participating processes, not just root.V versions
allow the chunks to have variable sizes.Allreduce
,Reduce
,Reduce_scatter
, andScan
take both built-in and user-defined combiner functions.
Introduction¶
We use a SUMMA issue to show how to use MPI to communicate with different processes.
SUMMA: Scalable Universal Matrix Multiply
C | |
---|---|
1 |
|
It's a MPI’s internal Algorithm:
- Use
MPI_Allgather
to get the color and key from each process - Count the number of processes with the same color; create a communicator with that many processes. If this process has
MPI_UNDEFINED
as the color, create a process with a single member. - Use key to order the ranks.
MPI_Comm_split
MPI_Comm_split
的主要作用是将一个已有的通信器(communicator)分割成多个子通信器。它就像把一群人分成不同的小组。
color参数是分组的关键:
- color就像是队服的颜色,相同color值的进程会被分到同一个新的通信器中
- color值必须是非负整数或
MPI_UNDEFINED
- 如果进程的color值是
MPI_UNDEFINED
,该进程将不会被分配到任何新的通信器中
假设有6个进程(rank 0-5),我们想把它们分成两组:
C | |
---|---|
1 2 3 |
|
跟之前讲到的一样,一行代码,适配于多个process,但对于不同的process有不一样的效果。
这样会形成两个新的通信器:
- color=0的组:包含rank 0,2,4的进程
- color=1的组:包含rank 1,3,5的进程
key参数决定了每个新通信器内部进程的排序:
- key值较小的进程会得到较小的新rank值
- 如果key值相同,则按照原来通信器中的rank顺序排序
- 通常可以直接使用原始的rank作为key
Code for SUMMA optimization:
C | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
|
MPI Built-in Collective Computation Operations
MPI_MAX
: maximumMPI_MIN
: minimumMPI_PROD
: productMPI_SUM
: sumMPI_LAND
: logical andMPI_LOR
: logical orMPI_LXOR
: logical xorMPI_BAND
: binary andMPI_BOR
: binary orMPI_BXOR
: binary xorMPI_MAXLOC
: maximum value and locationMPI_MINLOC
: minimum value and location
Implementation¶
How are collectives implemented in MPI?
Overview¶
Here we just focus on Broadcast and AllGather.
AllGather¶
Ring Algorithm
Recursive Doubling Algorithm
The Bruck Algorithm
Broadcast¶
Broadcast based on binomial tree
Scenario: Don't confuse it with scenario mentioned before. Here only root node has original "total data", and it needs to send to all other nodes.
Steps by timestamp
时间0
- 处理器0拥有初始数据值X
- 处理器0 → 处理器4:发送数据X
- 状态:
- 处理器0:持有数据X
- 处理器4:收到并持有数据X
- 其他处理器:无数据
时间1
- 处理器0 → 处理器2:发送数据X
- 处理器4 → 处理器6:发送数据X
- 状态:
- 处理器0:持有数据X
- 处理器2:收到并持有数据X
- 处理器4:持有数据X
- 处理器6:收到并持有数据X
- 其他处理器:无数据
时间2
- 处理器0 → 处理器1:发送数据X
- 处理器2 → 处理器3:发送数据X
- 处理器4 → 处理器5:发送数据X
- 处理器6 → 处理器7:发送数据X
- 最终状态:
- 所有处理器(0-7)都持有相同的数据X
这就是为什么这个算法被称为广播(Broadcast):最终所有处理器都获得了与根节点相同的数据X。通过二叉树结构,数据在log₂(P)=log₂(8)=3步内完成了传播。
Broadcast based on Scatter / Allgather
Synchronization¶
C | |
---|---|
1 |
|
Example:
MPI_Barrier就像是一个集合点或者检查站,它确保所有进程都到达这个点后才能继续执行。想象一个场景:
- 进程0 执行得很快,2秒就到达barrier
- 进程1 执行得较慢,需要5秒到达barrier
- 进程2 执行得最慢,需要8秒到达barrier
- 所有进程都必须等到最慢的进程(8秒)才能继续往下执行
PS:
- Blocks until all processes in the group of the communicator comm call it.
- Almost never required in a parallel program
- barrier 会强制进程等待,降低并行效率
- 大多数MPI集体操作(如
MPI_Bcast
,MPI_Reduce
)已经包含了隐式的同步
- Occasionally useful in measuring performance and load balancing
C 1 2 3 4 5 6
MPI_Barrier(comm); // ensure all processes starting simultaneously start_time = MPI_Wtime(); // ... (execution part to be measured) MPI_Barrier(comm); // ensure all processes finished end_time = MPI_Wtime(); duration = end_time - start_time; // total time for all processes
Nonblocking Collective Communication is also existing in MPI, but we don't need to focus on it currently.
Hybrid Programming with Threads¶
- MPI describes parallelism between processes (with separate address spaces)
- Thread parallelism provides a shared-memory model within a process
OpenMP, Pthreads, and MPI are common models.
- All MPI
- MPI between processes both within a node and across nodes
- MPI internally uses shared memory to communicate within a node
- MPI + OpenMP
- Use OpenMP within a node and MPI across nodes
- MPI + Pthreads
- Use Pthreads within a node and MPI across nodes
The latter two approaches are known as "hybrid programming".
diff
- We always use MPI between processes.
- Use OpenMP or Pthreads within a process.
Thread Model in MPI¶
MPI defines four levels of thread safety
MPI_THREAD_SINGLE
: only one thread exists in the applicationMPI_THREAD_FUNNELED
: multithreaded, but only the main thread makes MPI calls (the one that called MPI_Init_thread)MPI_THREAD_SERIALIZED
: multithreaded, but only one thread at a time makes MPI callsMPI_THREAD_MULTIPLE
: multithreaded and any thread can make MPI calls at any time (with some restrictions to avoid races – see next slide)
Thread levels are in increasing order. The looser the restrictions, the higher the thread level.
-> If an application works in FUNNELED
mode, it can work in SERIALIZED
.
C | |
---|---|
1 |
|
required
: the thread support level requested by the applicationprovided
: the thread support level actually provided by MPI implementation
MPI_THREAD_SINGLE¶
There are no threads in the system
- E.g. there are no OpenMP parallel regions
C | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
MPI_THREAD_FUNNELED¶
All MPI calls are made by the master thread.
C | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
|
MPI_THREAD_SERIALIZED¶
Only one thread can make MPI calls at a time
- Protected by OpenMP critical regions
C | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
MPI_THREAD_MULTIPLE¶
Any thread can make MPI calls any time (w/ restrictions)
C | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
- A fully thread-safe implementation will support
MPI_THREAD_MULTIPLE
.- “真正的安全能防护底线”
- A program that calls
MPI_Init
(instead ofMPI_Init_thread
) should assume that onlyMPI_THREAD_SINGLE
is supported.- “启动始终要考虑最坏情况”
Specification
- Ordering: When multiple threads make MPI calls concurrently, the outcome will be as if the calls executed sequentially in some (any) order
- Blocking: Blocking MPI calls will block only the calling thread and will not prevent other threads from running or executing MPI functions
Ordering and Blocking issues can lead to deadlocks and race conditions. You can refer to previous slides.
Currently
"Easiest" OpenMP programs only need FUNNELED
(level - 2).
One-sided Communication¶
The basic idea of one-sided communication models is to decouple data movement with process synchronization.
- Should be able move data without requiring that the remote process synchronize
- Each process exposes a part of its memory to other processes
- Other processes can directly read from or write to this memory
Examples¶
Two-sided Communication
One-sided Communication
Scenario: Delay
这张图对比了两种不同的进程间通信方式:一侧通信(One-sided)和双侧通信(Two-sided)的区别。
上半部分:双侧通信
- Process 0 执行
SEND
操作 - Process 1 执行
RECV
操作 - 如果 Process 0 发送消息时出现延迟,整个通信过程都会受到影响
下半部分:一侧通信
- Process 0 使用PUT操作发送数据,使用GET操作接收数据
- Process 1 虽然有延迟(DELAY),但不会影响Process 0的操作
- Process 0 可以==独立完成==数据的发送和获取
Public Memory and local Memory¶
- Any memory used by a process is, by default, only locally accessible.
C 1
malloc(sizeof(int)); // accessed by its own process
- User has to make an explicit MPI call to declare a memory region as remotely accessible.
- MPI terminology for remotely accessible memory is a "window".
- A group of processes collectively create a "window".
- Window: 所有 "允许window存在" 的 process 的 "灯下黑" 地盘
- Once a memory region is declared as remotely accessible, all processes in the window can read/write data to this memory without explicitly synchronizing with the target process.
windows: 特殊区域(免税区),一旦几个process约定好建立这个“特殊免税区”,后续对此区域内的数据操作时,不需要显式“考虑别人的感受”。
Window creation models¶
MPI_WIN_CREATE
: You already have an allocated buffer that you would like to make remotely accessible.MPI_WIN_ALLOCATE
: You want to create a buffer and directly make it remotely accessible.MPI_WIN_CREATE_DYNAMIC
: dynamic change attribute of the window.- You don't have a buffer yet, but will have one in the future.
- You may want to dynamically add/remove buffers to/from the window.
- Most flexible :)
MPI_WIN_ALLOCATE_SHARED
: You want multiple processes on the same node share a buffer
MPI_WIN_ALLOCATE¶
- Create a remotely accessible memory region in an RMA window.
- Only data exposed in a window can be accessed with RMA ops.
C | |
---|---|
1 2 3 4 5 6 7 |
|
- 这是一个集体调用(collective call),所有在通信子(communicator)中的进程都必须执行此调用
- 每个进程会分配至少size字节的本地内存,并返回指向该内存的指针baseptr
- 分配的内存对每个进程来说都是本地的(local),从baseptr地址开始
- 虽然内存是本地的,但这些内存可以被其他进程通过RMA(Remote Memory Access)操作访问
- 返回的win对象可以被通信子中的所有进程用来执行RMA操作
返回值是local memory,但是这个memory可以被其他process访问
C | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
MPI_WIN_CREATE_DYNAMIC¶
- Create an RMA window, to which data can later be attached
- Only data exposed in a window can be accessed with RMA ops
- Initially "empty"
- Application can dynamically attach/detach memory to this window by calling
MPI_Win_attach/detach
- Application can access data on this window only after a memory region has been attached
- Application can dynamically attach/detach memory to this window by calling
C | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
|
Data movement¶
MPI provides ability to read, write and atomically modify data in remotely accessible memory regions.
MPI_PUT
MPI_GET
MPI_ACCUMULATE
MPI_GET_ACCUMULATE
MPI_COMPARE_AND_SWAP
MPI_FETCH_AND_OP
RMA Synchronization Models¶
Two Mode
(1) Passive Mode: One-sided, asynchronous communication
Target does not participate in communication operation.
(2) Active Mode: Two-sided, synchronous communication
Active and Passive
- Active Target Mode(主动目标模式)
- 通信双方都需要参与和协调
- 发起方使用Post开始,Wait结束
- 目标方使用Start开始,Complete结束
- 整个过程需要双方同步配合
- Passive Target Mode(被动目标模式)
- 只有发起方参与通信操作,目标进程无需参与
- 使用Lock和Unlock操作来控制对内存的访问
- 类似于共享内存的访问模式
- 目标进程完全不知道有其他进程在访问它的内存