Lecture 10: Advanced MPI and Collective Communication Algorithms¶

With the development of LLMs, We need to handle more tasks in deep learning to achieve parallelism. Nowadays, Distributed deep learning is all about collectives.

Collective communications library:

Facebook: gloo
Nvidia: nccl

But we need a standard library.

We offer MPI as a standard, and today we will introduce this library.

Collective Data Movement¶

Some basic operations are introduced in chapter 9.

alt text

Now we can take a summary of Collective Data Movement routines.

Many Routines: Allgather, Allgatherv, Allreduce, Alltoall, Alltoallv, Bcast, Gather, Gatherv, Reduce, Reduce_scatter, Scan, Scatter, Scatterv.
All versions deliver results to all participating processes, not just root.
V versions allow the chunks to have variable sizes.
Allreduce, Reduce, Reduce_scatter, and Scan take both built-in and user-defined combiner functions.

Introduction¶

We use a SUMMA issue to show how to use MPI to communicate with different processes.

SUMMA: Scalable Universal Matrix Multiply

C
int MPI_Comm_split( MPI_Comm comm,int color, int key, MPI_Comm *newcomm)

It's a MPI’s internal Algorithm:

Use MPI_Allgather to get the color and key from each process
Count the number of processes with the same color; create a communicator with that many processes. If this process has MPI_UNDEFINED as the color, create a process with a single member.
Use key to order the ranks.

MPI_Comm_split

MPI_Comm_split的主要作用是将一个已有的通信器(communicator)分割成多个子通信器。它就像把一群人分成不同的小组。

color参数是分组的关键：

color就像是队服的颜色，相同color值的进程会被分到同一个新的通信器中
color值必须是非负整数或MPI_UNDEFINED
如果进程的color值是MPI_UNDEFINED，该进程将不会被分配到任何新的通信器中

假设有6个进程(rank 0-5)，我们想把它们分成两组：

C
// 根据进程号的奇偶性分组
int color = rank % 2;  // 偶数rank的进程color=0，奇数rank的进程color=1
MPI_Comm_split(MPI_COMM_WORLD, color, rank, &newcomm);

跟之前讲到的一样，一行代码，适配于多个process，但对于不同的process有不一样的效果。

这样会形成两个新的通信器：

color=0的组：包含rank 0,2,4的进程
color=1的组：包含rank 1,3,5的进程

key参数决定了每个新通信器内部进程的排序：

key值较小的进程会得到较小的新rank值
如果key值相同，则按照原来通信器中的rank顺序排序
通常可以直接使用原始的rank作为key

alt text

Code for SUMMA optimization:

C
void SUMMA(double *mA, double *mB, double *mc, int p_c) {

    int row_color = rank / p_c; // p_c = sqrt(p) for simplicity 
    MPI_Comm row_comm; 
    MPI_Comm_split(MPI_COMM_WORLD, row_color, rank, &row_comm);
    // row i, j, k... into one communicator (for A)

    int col_color = rank % p_c; 
    MPI_Comm col_comm; 
    MPI_Comm_split(MPI_COMM_WORLD, col_color, rank, &col_comm);
    // row a, b, c... into another communicator (for B)

    for (int k = 0; k < p_c; ++k) {
        if (col_color == k) memcpy(Atemp, mA, size);
        if (row_color == k) memcpy(Btemp, mB, size);

        MPI_Bcast(Atemp, size, MPI_DOUBLE, k, row_comm); 
        MPI_Bcast(Btemp, size, MPI_DOUBLE, k, col_comm);

        SimpleDGEMM(Atemp, Btemp, mc, N/p, N/p, N/p);
    }
}

MPI Built-in Collective Computation Operations

MPI_MAX: maximum
MPI_MIN: minimum
MPI_PROD: product
MPI_SUM: sum
MPI_LAND: logical and
MPI_LOR: logical or
MPI_LXOR: logical xor
MPI_BAND: binary and
MPI_BOR: binary or
MPI_BXOR: binary xor
MPI_MAXLOC: maximum value and location
MPI_MINLOC: minimum value and location

Implementation¶

How are collectives implemented in MPI?

Overview¶

alt text

Here we just focus on Broadcast and AllGather.

AllGather¶

alt text

Ring Algorithm

alt text

Recursive Doubling Algorithm

alt text

The Bruck Algorithm

alt text

Broadcast¶

Broadcast based on binomial tree

alt text

Scenario: Don't confuse it with scenario mentioned before. Here only root node has original "total data", and it needs to send to all other nodes.

Steps by timestamp

时间0

处理器0拥有初始数据值X
处理器0 → 处理器4：发送数据X
状态：
- 处理器0：持有数据X
- 处理器4：收到并持有数据X
- 其他处理器：无数据

时间1

处理器0 → 处理器2：发送数据X
处理器4 → 处理器6：发送数据X
状态：
- 处理器0：持有数据X
- 处理器2：收到并持有数据X
- 处理器4：持有数据X
- 处理器6：收到并持有数据X
- 其他处理器：无数据

时间2

处理器0 → 处理器1：发送数据X
处理器2 → 处理器3：发送数据X
处理器4 → 处理器5：发送数据X
处理器6 → 处理器7：发送数据X
最终状态：
- 所有处理器(0-7)都持有相同的数据X

这就是为什么这个算法被称为广播(Broadcast)：最终所有处理器都获得了与根节点相同的数据X。通过二叉树结构，数据在log₂(P)=log₂(8)=3步内完成了传播。

Broadcast based on Scatter / Allgather

alt text

Synchronization¶

C
MPI_Barrier( comm )

Example:

MPI_Barrier就像是一个集合点或者检查站，它确保所有进程都到达这个点后才能继续执行。想象一个场景：

进程0 执行得很快，2秒就到达barrier
进程1 执行得较慢，需要5秒到达barrier
进程2 执行得最慢，需要8秒到达barrier
所有进程都必须等到最慢的进程(8秒)才能继续往下执行

PS:

Blocks until all processes in the group of the communicator comm call it.
Almost never required in a parallel program
- barrier 会强制进程等待，降低并行效率
- 大多数MPI集体操作(如MPI_Bcast, MPI_Reduce)已经包含了隐式的同步

Occasionally useful in measuring performance and load balancing

C
MPI_Barrier(comm);  // ensure all processes starting simultaneously
start_time = MPI_Wtime();
// ... (execution part to be measured)
MPI_Barrier(comm);  // ensure all processes finished
end_time = MPI_Wtime();
duration = end_time - start_time; // total time for all processes

Nonblocking Collective Communication is also existing in MPI, but we don't need to focus on it currently.

Hybrid Programming with Threads¶

MPI describes parallelism between processes (with separate address spaces)
Thread parallelism provides a shared-memory model within a process

OpenMP, Pthreads, and MPI are common models.

All MPI
- MPI between processes both within a node and across nodes
- MPI internally uses shared memory to communicate within a node
MPI + OpenMP
- Use OpenMP within a node and MPI across nodes
MPI + Pthreads
- Use Pthreads within a node and MPI across nodes

The latter two approaches are known as "hybrid programming".

diff

We always use MPI between processes.
Use OpenMP or Pthreads within a process.

Thread Model in MPI¶

alt text

MPI defines four levels of thread safety

MPI_THREAD_SINGLE: only one thread exists in the application
MPI_THREAD_FUNNELED: multithreaded, but only the main thread makes MPI calls (the one that called MPI_Init_thread)
MPI_THREAD_SERIALIZED: multithreaded, but only one thread at a time makes MPI calls
MPI_THREAD_MULTIPLE: multithreaded and any thread can make MPI calls at any time (with some restrictions to avoid races – see next slide)

Thread levels are in increasing order. The looser the restrictions, the higher the thread level.

-> If an application works in FUNNELED mode, it can work in SERIALIZED.

C
int MPI_Init_thread(int *argc, char ***argv, int required, int *provided)

required: the thread support level requested by the application
provided: the thread support level actually provided by MPI implementation

MPI_THREAD_SINGLE¶

There are no threads in the system

E.g. there are no OpenMP parallel regions

C
int main(int argc, char ** argv) 
{

    int buf[100]; 
    MPI_Init(&argc, &argv); // start to use MPI

    for (i = 0; i < 100; i++)
        compute(buf[i]); 

    /* Do MPI stuff */ 

    MPI_Finalize(); // end MPI

    return 0;
}

MPI_THREAD_FUNNELED¶

All MPI calls are made by the master thread.

C
int main(int argc, char ** argv) 
{ 
    int buf[100], provided; 
    MPI_Init_thread(&argc, &argv, MPI_THREAD_FUNNELED, &provided);
    // Application asks for MPI_THREAD_FUNNELED level support.

    // Check that the MPI implementation provides 
    // the required level of thread support.
    if (provided < MPI_THREAD_FUNNELED)
        MPI_Abort(MPI_COMM_WORLD, 1);
        // If the provided support level is lower than the requested MPI_THREAD_FUNNELED level,
        // call MPI_Abort to terminate all MPI processes.


#pragma omp parallel for
    for (i = 0; i < 100; i++)
        compute(buf[i]);

    /* Do MPI stuff */

    MPI_Finalize(); 
    return 0;
}

MPI_THREAD_SERIALIZED¶

Only one thread can make MPI calls at a time

Protected by OpenMP critical regions

C
int main(int argc, char ** argv) { 
    int buf[100], provided;
    MPI_Init_thread(&argc, &argv, MPI_THREAD_SERIALIZED, &provided); 

    if (provided < MPI_THREAD_SERIALIZED) 
        MPI_Abort(MPI_COMM_WORLD, 1);

#pragma omp parallel for
    for (i = 0; i < 100; i++) {
        compute(buf[i]); 
#pragma omp critical 
    /* Do MPI stuff */ 
    }

    MPI_Finalize(); 
    return 0;
}

MPI_THREAD_MULTIPLE¶

Any thread can make MPI calls any time (w/ restrictions)

C
int main(int argc, char ** argv) { 
    int buf[100], provided; 

    MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided); 

    if (provided < MPI_THREAD_MULTIPLE) 
        MPI_Abort(MPI_COMM_WORLD, 1);

#pragma omp parallel for
    for (i = 0; i < 100; i++) {
        compute(buf[i]); 
        /* Do MPI stuff */
    }

    MPI_Finalize(); 
    return 0;
}

A fully thread-safe implementation will support MPI_THREAD_MULTIPLE.
- “真正的安全能防护底线”
A program that calls MPI_Init (instead of MPI_Init_thread) should assume that only MPI_THREAD_SINGLE is supported.
- “启动始终要考虑最坏情况”

Specification

Ordering: When multiple threads make MPI calls concurrently, the outcome will be as if the calls executed sequentially in some (any) order
Blocking: Blocking MPI calls will block only the calling thread and will not prevent other threads from running or executing MPI functions

Ordering and Blocking issues can lead to deadlocks and race conditions. You can refer to previous slides.

Currently

"Easiest" OpenMP programs only need FUNNELED (level - 2).

One-sided Communication¶

The basic idea of one-sided communication models is to decouple data movement with process synchronization.

Should be able move data without requiring that the remote process synchronize
Each process exposes a part of its memory to other processes
Other processes can directly read from or write to this memory

alt text

Examples¶

Two-sided Communication

alt text

One-sided Communication

alt text

Scenario: Delay

alt text

这张图对比了两种不同的进程间通信方式：一侧通信(One-sided)和双侧通信(Two-sided)的区别。

上半部分：双侧通信

Process 0 执行SEND操作
Process 1 执行RECV操作
如果 Process 0 发送消息时出现延迟，整个通信过程都会受到影响

下半部分：一侧通信

Process 0 使用PUT操作发送数据，使用GET操作接收数据
Process 1 虽然有延迟(DELAY)，但不会影响Process 0的操作
Process 0 可以==独立完成==数据的发送和获取

Public Memory and local Memory¶

Any memory used by a process is, by default, only locally accessible.
C
1
malloc(sizeof(int)); // accessed by its own process
User has to make an explicit MPI call to declare a memory region as remotely accessible.
- MPI terminology for remotely accessible memory is a "window".
- A group of processes collectively create a "window".
- Window: 所有 "允许window存在" 的 process 的 "灯下黑" 地盘
Once a memory region is declared as remotely accessible, all processes in the window can read/write data to this memory without explicitly synchronizing with the target process.

windows: 特殊区域(免税区)，一旦几个process约定好建立这个“特殊免税区”，后续对此区域内的数据操作时，不需要显式“考虑别人的感受”。

Window creation models¶

MPI_WIN_CREATE: You already have an allocated buffer that you would like to make remotely accessible.
MPI_WIN_ALLOCATE: You want to create a buffer and directly make it remotely accessible.
MPI_WIN_CREATE_DYNAMIC: dynamic change attribute of the window.
- You don't have a buffer yet, but will have one in the future.
- You may want to dynamically add/remove buffers to/from the window.
- Most flexible :)
MPI_WIN_ALLOCATE_SHARED: You want multiple processes on the same node share a buffer

MPI_WIN_ALLOCATE¶

Create a remotely accessible memory region in an RMA window.
Only data exposed in a window can be accessed with RMA ops.

C
int MPI_Win_allocate(
    MPI_Aint size, // size of local data in bytes (nonnegative integer)
    int disp_unit, // local unit size for displacements, in bytes (positive integer)
    MPI_Info info, // info argument (handle)
    MPI_Comm comm, // communicator (handle)
    void *baseptr, // pointer to exposed local data
    MPI_Win *win); // window (handle)

这是一个集体调用(collective call)，所有在通信子(communicator)中的进程都必须执行此调用
每个进程会分配至少size字节的本地内存，并返回指向该内存的指针baseptr
分配的内存对每个进程来说都是本地的(local)，从baseptr地址开始
虽然内存是本地的，但这些内存可以被其他进程通过RMA(Remote Memory Access)操作访问
返回的win对象可以被通信子中的所有进程用来执行RMA操作

返回值是local memory，但是这个memory可以被其他process访问

C
int main(int argc, char ** argv) {
    int *a;
    MPI_Win win;

    MPI_Init(&argc, &argv); /* collectively create remote accessible memory in a window */

    MPI_Win_allocate(1000*sizeof(int), sizeof(int), MPI_INFO_NULL, MPI_COMM_WORLD, &a, &win);
    /* Array ‘a’ is now accessible from all processes in MPI_COMM_WORLD */

    MPI_Win_free(&win);
    MPI_Finalize();

    return 0;
}

MPI_WIN_CREATE_DYNAMIC¶

Create an RMA window, to which data can later be attached
- Only data exposed in a window can be accessed with RMA ops
Initially "empty"
- Application can dynamically attach/detach memory to this window by calling MPI_Win_attach/detach
- Application can access data on this window only after a memory region has been attached

C
int main(int argc, char ** argv) { 
    int *a; 
    MPI_Win win;

    MPI_Init(&argc, &argv); 
    MPI_Win_create_dynamic(MPI_INFO_NULL, MPI_COMM_WORLD, &win); // win is empty now
    /* create private memory */ 
    a = (int *) malloc(1000 * sizeof(int)); 
    /* use private memory like you normally would */
    a[0] = 1;
    a[1] = 2;

    /* locally declare memory as remotely accessible */ 
    MPI_Win_attach(win, a, 1000*sizeof(int));

    /* Array ‘a’ is now accessible from all processes */

    /* undeclare remotely accessible memory */ 
    MPI_Win_detach(win, a); free(a); 
    MPI_Win_free(&win);

    MPI_Finalize(); 
    return 0;

Data movement¶

MPI provides ability to read, write and atomically modify data in remotely accessible memory regions.

MPI_PUT
MPI_GET
MPI_ACCUMULATE
MPI_GET_ACCUMULATE
MPI_COMPARE_AND_SWAP
MPI_FETCH_AND_OP

alt text

RMA Synchronization Models¶

alt text

Two Mode

alt text

(1) Passive Mode: One-sided, asynchronous communication

Target does not participate in communication operation.

(2) Active Mode: Two-sided, synchronous communication

Active and Passive

Active Target Mode(主动目标模式)
- 通信双方都需要参与和协调
- 发起方使用Post开始，Wait结束
- 目标方使用Start开始，Complete结束
- 整个过程需要双方同步配合
Passive Target Mode(被动目标模式)
- 只有发起方参与通信操作，目标进程无需参与
- 使用Lock和Unlock操作来控制对内存的访问
- 类似于共享内存的访问模式
- 目标进程完全不知道有其他进程在访问它的内存