Collective Communication Algorithms¶

Classic Collective Communication Algorithms¶

Though the collective routines are presented to the programmer through a clean API, the collective communication library must implement algorithms that perform the actual intra-node and inter-node communications.

尽管通过干净的API向程序员展示了集体例程，但集体通信库必须实现执行实际节点内和节点间通信的算法。

In recent years, new advancements have been made in the development of collective communication algorithms. Table 2 briefly overviews the representative state-of-the-art collective communication algorithms and their features. Among these algorithms, the Ring, Binomial Tree, Recursive-Doubling, and Recursive-Halving algorithms are the most widely used in HPC workloads. Hence, we will focus on these algorithms in particular.

近年来，集体通信算法的发展取得了新的进展。表2简要概述了代表性的最新集体通信算法及其特征。在这些算法中，环算法、二项树算法、递归倍增算法和递归减半算法是高性能计算工作负载中最广泛使用的。因此，我们将特别关注这些算法。

alt text

To estimate the latency and bandwidth of collective communication algorithms, we use the \(\alpha - \beta\) cost model [50]. The time taken by the bidirectional communication between processes is \(\alpha + n \beta\) and the unidirectional communication is \(\alpha_{\text{uni}} + n \beta_{\text{uni}}\) [51]. In this function, \(\alpha\) is the latency startup time per message, \(\beta\) is the transfer time per byte, \(n\) is the number of transferred bytes, and \(p\) is the number of processes in the communication. In the case of reduction operations, \(\gamma\) is the computation cost per byte for one process.

为了估算集体通信算法的延迟和带宽，我们使用\(\alpha - \beta\)成本模型 [50]。进程间双向通信所花费的时间是\(\alpha + n \beta\)，单向通信的时间是\(\alpha_{\text{uni}} + n \beta_{\text{uni}}\) [51]。在这个函数中，\(\alpha\)是每条消息的启动延迟时间，\(\beta\)是每字节的传输时间，\(n\)是传输的字节数，\(p\)是通信中的进程数量。在归约操作的情况下，\(\gamma\)是一个进程每字节的计算成本。

Relationship

Ring - All-Reduce and All-Gather
Binomial Tree - Broadcast and Reduce
Recursive-Doubling - All-Gather and Reduce-Scatter and All-Reduce
Recursive-Halving - Reduce-Scatter

Ring¶

The Ring algorithm is traditionally utilized for All-Gather. The implementation of All-Gather in this method is that data are transferred around a virtual ring of processes. First, each process sends its chunk of data to the following process in the ring and receives the chunk of data from the previous process in the ring. From the second step, each process sends the data it received from the previous process in the first step to the following process. If \(p\) is the number of processes, it takes \(p - 1\) steps to complete the entire algorithm. If \(n\) is the total amount of data to be gathered on each process, then at every step, each process sends and receives \(n/p\) amounts of data. Therefore, the time taken by this algorithm is \(T_{\text{ring All-Gather}} = (p - 1)\alpha + \left(\frac{(p - 1)}{p}\right)n \beta\) [43].

环算法传统上用于全收集（All-Gather）。这种方法实现全收集的方式是将数据传输到一个虚拟环中的进程。首先，每个进程将其数据块发送到环中的下一个进程，并从环中的前一个进程接收数据块。从第二步开始，每个进程将其在第一步中从前一个进程接收到的数据发送到下一个进程。如果\(p\)是进程的数量，则需要\(p - 1\)步才能完成整个算法。如果\(n\)是每个进程要收集的总数据量，那么在每一步，每个进程发送和接收\(n/p\)的数据量。因此，该算法所需的时间为\(T_{\text{ring All-Gather}} = (p - 1)\alpha + \left(\frac{(p - 1)}{p}\right)n \beta\) [43]。

The Ring algorithm is also used for All-Reduce. There are two phases in All-Reduce: Reduce-Scatter and All-Gather. The Ring algorithm can be applied for All-Gather, and the Reduce-Scatter phase can be performed in the Pairwise-Exchange algorithm. When the number of processes is not a power of 2, this algorithm performs well in bandwidth utilization. However, the latency of this algorithm grows linearly as the number of processes increases. Therefore, this algorithm is only suitable for small or medium-sized processes or large vectors. For All-Reduce, the time taken is \(T_{\text{ring All-Reduce}} = 2(p - 1)\alpha + 2n \beta + n \gamma - \left(\frac{1}{p}\right)(2n \beta + n \gamma)\) [44].

环算法也用于全规约（All-Reduce）。全规约分为两个阶段：规约-散布（Reduce-Scatter）和全收集（All-Gather）。环算法可以应用于全收集，而规约-散布阶段可以通过配对交换（Pairwise-Exchange）算法来执行。当进程数量不是2的幂时，该算法在带宽利用率方面表现良好。然而，该算法的延迟随着进程数量的增加而线性增长。因此，该算法仅适用于小型或中型进程或大向量。对于全规约，所需的时间为\(T_{\text{ring All-Reduce}} = 2(p - 1)\alpha + 2n \beta + n \gamma - \left(\frac{1}{p}\right)(2n \beta + n \gamma)\) [44]。

alt text

Binomial Tree¶

The Binomial Tree algorithm is commonly used for Broadcast in MPICH. First, process (root + (p/2)) receives data from the root. From the second step, this process and the root act as new roots in their respective subtrees. This algorithm runs recursively and takes a total of \(\log p\) steps. In this algorithm, each process communicates \(n\) bytes of data at any step. Therefore, the time taken by this algorithm to perform Broadcast is \(T_{\text{tree Broadcast}} = (\log p)(\alpha + n \beta)\) [43]. This algorithm performs well when communicating short messages because of the logarithmic latency term. As a result, the Binomial Tree algorithm can be a good choice when working with short messages (e.g., < 12 KB) or when the number of processes is less than 8.

二项树算法通常用于 MPICH 中的广播。首先，进程 (root + (p/2)) 从根接收数据。从第二步开始，该进程和根在各自的子树中充当新的根。该算法递归运行，总共需要 \(\log p\) 步。在此算法中，每个进程在任何步骤中都会传输 \(n\) 字节的数据。因此，该算法执行广播所需的时间为 \(T_{\text{tree Broadcast}} = (\log p)(\alpha + n \beta)\) [43]。由于对数延迟项，该算法在传输短消息时表现良好。因此，当处理短消息（例如，小于 12 KB）或进程数量少于 8 时，二项树算法是一个不错的选择。

The Binomial Tree algorithm can also efficiently implement Reduce. The Binomial Tree algorithm takes \(\log p\) steps to complete the process, and the amount of data is \(n\) at each step. In general, the time taken by this algorithm is \(T_{\text{tree Reduce}} = (\log p)(\alpha + n \beta + n \gamma)\) [43]. Owing to the \(\log p\) steps, the Binomial Tree algorithm performs Reduce efficiently for short messages. For Reduce, the Binomial Tree algorithm is used for short messages (e.g., ≤ 2 KB) when the reduction operation is predefined. Because the user-defined reduction operations may pass or break up derived datatypes to do the complex Reduce-Scatter, the Binomial Tree algorithm is used for all message sizes when the reduction operations are user-defined. When executing the All-Reduce process, the algorithm first does a Reduce to rank 0 and then performs a Broadcast.

二项树算法还可以高效地实现规约。二项树算法需要 \(\log p\) 步来完成该过程，每步的数据量为 \(n\)。通常，该算法所需的时间为 \(T_{\text{tree Reduce}} = (\log p)(\alpha + n \beta + n \gamma)\) [43]。由于有 \(\log p\) 步，二项树算法在处理短消息时高效地执行规约。对于规约，当规约操作是预定义时，二项树算法用于短消息（例如，≤ 2 KB）。因为用户定义的规约操作可能会传递或分解派生数据类型以执行复杂的规约-散布操作，当规约操作是用户定义的时，二项树算法用于所有消息大小。在执行全规约过程时，该算法首先对 rank 0 进行规约，然后执行广播。

Binomial Tree Algorithm

初始化：根进程开始拥有数据。
递归分发：根进程将数据发送给一半进程，然后这些进程再将数据递归地发送给其他进程。这个过程一直持续，直到所有进程都收到数据。
递归步骤数：该算法递归运行，总共需要 \(\log p\) 步。在每一步中，每个进程传输 \(n\) 字节的数据。

假设有8个进程（编号0到7），进程0是根进程，初始拥有数据。

Text Only
     0
    / \
    1   2
  / |   |\
 3  4   5 6
/
7

数据传输步骤

第一步：根进程（进程 0 ）将数据发送给进程 1

Text Only
0 -> 1

第二步：进程 0 将数据发送给进程 2，进程 1 将数据发送给进程 3

Text Only
0 -> 2
1 -> 3

第三步：进程 0 将数据发送给进程 4，进程 1 将数据发送给进程 5，进程 2 将数据发送给进程6，进程 3 将数据发送给进程7

Text Only
-> 4
-> 5
-> 6
-> 7

到此为止，所有进程都收到了数据

Text Only
Step 1:           Step 2:           Step 3:
0                 0                 0
|                 |                 |
1                 1                 1
                / |               / | \
                2  3              2  3  4
                |                / \  / \
                0               0  6 5  7

这样，每个进程都会在\(\log p\)步内收到数据。

二项树算法通过递归的方式在多进程之间传递数据，每一步都有多个进程同时传递数据，从而高效地完成广播操作。这种结构使得它在处理短消息和进程数量较少的情况下非常高效。

Recursive-Doubling¶

Recursive-Doubling is an efficient algorithm for All-Gather. In the first step, each process sends and receives data from its neighbors. In the second step, each process sends and receives data from the process that is two processes away from it. In the third step, the process exchanges data from the process that is four processes away from it, and so on. In this way, when the number of processes is a power of 2, all data communication can be completed in \(\log p\) steps. The amount of data exchanged by each process is \(n/p\) in the first step, \(2n/p\) in the second step, and so on. In the last step, the amount of data is \((2^{\log(p-1)} n)/p\). In general, the total time taken by this algorithm is \(T_{\text{rec-dbl}} = \log p \alpha + ((p - 1)/p)n \beta\) [43]. Due to the communication's mathematical features, Recursive-Doubling works very well for situations where the number of processes is a power of 2, but it does not work well when the number of processes is not a power of 2.

递归加倍（Recursive-Doubling）是一种高效的全收集（All-Gather）算法。在第一步，每个进程从其邻居处发送和接收数据。在第二步，每个进程从距离它两个进程的位置发送和接收数据。在第三步，进程从距离它四个进程的位置交换数据，依此类推。通过这种方式，当进程数是2的幂时，所有的数据通信可以在\(\log p\)步内完成。每个进程在第一步中交换的数据量为\(n/p\)，在第二步中为\(2n/p\)，依此类推。在最后一步中，数据量为\((2^{\log(p-1)} n)/p\)。一般来说，该算法所需的总时间为\(T_{\text{rec-dbl}} = \log p \alpha + ((p - 1)/p)n \beta\) [43]。由于通信的数学特性，递归加倍在进程数量是2的幂的情况下效果非常好，但在进程数量不是2的幂时效果不好。

The Recursive-Doubling algorithm can be used in Reduce-Scatter, similar to the one used in All-Gather. However, more data are communicated in Reduce-Scatter than in All-Gather. In step 1 of Reduce-Scatter, the data needed for their result in each process is not exchanged, and the amount of data exchanged by each process is \((n - \frac{n}{p})\); in step 2, the data required by themselves and by the processes communicated in the previous step in each process are not exchanged, and the amount of data exchanged by each process is \((n - \frac{2n}{p})\); in step 3, it is \((n - \frac{4n}{p})\); and so on. Therefore, the time taken by this algorithm is \(T_{\text{rec-dbl Reduce-Scatter}} = \log p \alpha + (\log p - \frac{(p - 1)}{p})n \beta + \log p (\frac{(p - 1)}{p})n \gamma\) [43]. This algorithm works well for concise messages (e.g., < 32 B).

递归加倍（Recursive-Doubling）算法可以用于Reduce-Scatter，类似于在All-Gather中使用的方式。然而，在Reduce-Scatter中比在All-Gather中传输更多的数据。在Reduce-Scatter的第1步中，每个进程所需的数据不会被交换，每个进程交换的数据量为\((n - \frac{n}{p})\)；在第2步中，每个进程所需的数据以及在上一步中通信的进程的数据不会被交换，每个进程交换的数据量为\((n - \frac{2n}{p})\)；在第3步中为\((n - \frac{4n}{p})\)，依此类推。因此，该算法所需的时间为\(T_{\text{rec-dbl Reduce-Scatter}} = \log p \alpha + (\log p - \frac{(p - 1)}{p})n \beta + \log p (\frac{(p - 1)}{p})n \gamma\) [43]。该算法在处理简洁信息时效果很好（例如，小于32字节的消息）。

The Recursive-Doubling algorithm can also perform All-Reduce similarly to how it performs All-Gather, except that each communication step of the Recursive-Doubling algorithm also involves a local reduction. The Recursive-Doubling algorithm performs well for short messages and long messages with user-defined reduction operations. The time taken by this algorithm for All-Reduce is \(T_{\text{rec-dbl All-Reduce}} = \log p \alpha + n \log p \beta + n \log p \gamma\) [44].

递归加倍（Recursive-Doubling）算法也可以执行All-Reduce，类似于执行All-Gather的方式，不同之处在于递归加倍算法的每个通信步骤还涉及局部归约。递归加倍算法在处理短消息和具有用户定义归约操作的长消息时表现良好。该算法执行All-Reduce所需的时间为\(T_{\text{rec-dbl All-Reduce}} = \log p \alpha + n \log p \beta + n \log p \gamma\) [44]。

alt text

Recursive-Halving¶

Similar to applying the Recursive-Doubling algorithm for All-Gather, the Recursive-Halving algorithm can be used to perform Reduce-Scatter. First, processes at a distance of \(p/2\) away exchange data with each other. Each process performs both the sending and receiving operations. All processes need the sent data in the other half, and all processes need the received data in its half. The reduction operation is performed on the received data. The reduction can be made because the procedure is commutative. Second, processes at a distance of \(p/4\) away exchange data with each other: each process performs both the sending and receiving operations. All processes need the sent data in the other half of the current subtree, and all processes require the received data in its half of the current subtree. The reduction operation is performed on the received data. This procedure is performed recursively, halving the data communicated at each step. The total number of steps of this process is \(\log p\). Therefore, if \(p\) is a power of 2, the time taken by this algorithm is given by \(T_{\text{rec-half Reduce-Scatter}} = \log p \alpha + \left( \frac{p-1}{p} \right)n \beta + \left( \frac{p-1}{p} \right)n \gamma\) [44].

类似于将递归加倍（Recursive-Doubling）算法应用于All-Gather，递归减半（Recursive-Halving）算法可以用于执行Reduce-Scatter。首先，距离为\(p/2\)的进程相互交换数据。每个进程执行发送和接收操作。所有进程都需要另一半中的发送数据，并且所有进程都需要自己这半中的接收数据。归约操作在接收数据上执行。由于该过程是交换的，所以可以进行归约。其次，距离为\(p/4\)的进程相互交换数据：每个进程执行发送和接收操作。所有进程都需要当前子树另一半中的发送数据，并且所有进程需要当前子树中自己这半中的接收数据。归约操作在接收数据上执行。该过程递归地执行，每步交换的数据减半。此过程的总步数为\(\log p\)。因此，如果\(p\)是2的幂，则该算法所需的时间为\(T_{\text{rec-half Reduce-Scatter}} = \log p \alpha + \left( \frac{p-1}{p} \right)n \beta + \left( \frac{p-1}{p} \right)n \gamma\) [44]。

xCCL Collective Communication Algorithms¶

We mainly focus on MSCCL in this part

NCCL¶

Ring. The Ring algorithm is used for All-Reduce in NCCL to move data across all GPUs [17]. The data are split into multiple chunks and transferred one by one during the operation. This pipeline modality reduces the idle time the GPU spends waiting for data. However, the latency of the Ring algorithm for All-Reduce increases with the number of GPU devices. Since NCCL is implemented with CUDA, one CUDA thread block is allocated to one ring direction in this library.

环算法（Ring）。 环算法在NCCL中用于在所有GPU之间进行All-Reduce数据传输[17]。数据被分成多个块并在操作过程中一个一个地传输。这种流水线模式减少了GPU等待数据的空闲时间。然而，环算法在All-Reduce操作中的延迟随着GPU设备数量的增加而增加。由于NCCL是使用CUDA实现的，在该库中，一个CUDA线程块被分配给一个环方向。

Double Binary Trees. Since the latency of the Ring algorithm increases with the number of GPUs, it is not suitable for communication among a large number of GPUs. The Double Binary Tree algorithm was proposed to solve this problem because of its logarithmic latency [18]. Based on the architecture of a binary tree, the leaves of one binary tree can be used as the nodes of another. Almost every rank is connected to two parents and two children ranks, except for the root ranks. Compared with the Ring algorithm, the latency of Double Binary Trees is more negligible in the NCCL test on various large machines.

双二叉树（Double Binary Trees）。 由于环算法的延迟随着GPU数量的增加而增加，它不适用于大量GPU之间的通信。提出了双二叉树算法来解决这个问题，因为它具有对数延迟[18]。基于二叉树的结构，一个二叉树的叶子可以用作另一个二叉树的节点。几乎每个rank都连接到两个父节点和两个子节点rank，除了根rank。与环算法相比，在各种大型机器上的NCCL测试中，双二叉树的延迟更为微不足道。

Double Binary Trees [link]

I prefer to ref nvidia-nccl file

第一棵二叉树（main tree）

Text Only
          0
        /   \
       1     2
      / \   / \
     3   4 5   6
                \
                 7

第二棵二叉树（ori-leaf tree）

解释

第一棵二叉树：

节点0将数据A0发送给节点1和节点2
节点1将数据A1发送给节点3和节点4
节点2将数据A2发送给节点5和节点6
节点5将数据A5发送给节点7
节点6将数据A6发送给节点7

第二棵二叉树：

节点3将数据A3发送给节点4和节点5
节点5将数据A5发送给节点6和节点7
节点6将数据A6发送给节点7

分析

每个节点（除了根节点）都有两个父节点和两个子节点。双二叉树的优势在于它能将大量节点的通信分解为多个较小的步骤，每一步都只涉及到相邻节点的通信，从而减少了整体延迟
注意：其他节点与leaf-node的通信，用的是main tree；leaf-node与leaf-node的通信，用的是ori-leaf tree
这种设计是“分层”思想的体现，减小了整体时延

alt text

MSCCL¶

Ring. MSCCL implements Ring for All-Reduce, Reduce-Scatter, and All-Gather [19]. MSCCL allocates multiple channels to one logical ring. In this way, different point-to-point connections can be implemented between the same pairs of GPUs. The protocol for scheduling a logical ring onto one channel varies according to the message size. This strategy enables the logical ring's distribution across channels and efficiently overlaps point-to-point operations.

环形算法。 MSCCL 为 All-Reduce、Reduce-Scatter 和 All-Gather 实现了环形算法[19]。MSCCL 为一个逻辑环分配了多个通道。通过这种方式，可以在相同的 GPU 对之间实现不同的点对点连接。根据消息大小，调度逻辑环到一个通道的协议有所不同。该策略使逻辑环分布在多个通道上，并有效地重叠点对点操作。

Detailed Explanation of Ring Algorithm in MSCCL

Ring Algorithm in MSCCL (Multi-Channel Synchronous Collective Library)

MSCCL uses the ring algorithm to perform collective operations like All-Reduce, Reduce-Scatter, and All-Gather. Here’s a step-by-step breakdown of how it works and its benefits:

(1) Ring Algorithm Overview

The ring algorithm is a common method for performing collective communication in parallel computing. In a ring algorithm, each process (or GPU) sends and receives data from its neighboring processes in a circular (ring) fashion. For example, in an All-Reduce operation, each process contributes its data, and the result is distributed back to all processes.

(2) MSCCL’s Multi-Channel Implementation

MSCCL enhances the traditional ring algorithm by using multiple channels for communication. This means that the same logical ring of processes can utilize several communication paths (channels) simultaneously , leading to more efficient data transfer.

(3) Logical Ring and Channels

Logical Ring: A sequence of processes arranged in a circular manner, where each process communicates with its next and previous neighbor.
Channels: Independent communication paths that can be used to transfer data simultaneously.

By distributing the logical ring across multiple channels, MSCCL can reduce communication bottlenecks and improve overall efficiency.

(4) Point-to-Point Connections

With multiple channels, MSCCL can __establish different point-to-point connections between the same pairs of GPU__s. This means that even if two GPUs need to communicate frequently, the workload can be spread across multiple channels, reducing congestion and improving performance.

(5) Scheduling Protocol

The protocol for scheduling which logical ring uses which channel can vary depending on the size of the message being communicated:

Small Messages: May use fewer channels to reduce overhead.
Large Messages: May use more channels to maximize bandwidth utilization.

By dynamically adjusting the channel allocation based on message size, MSCCL ensures efficient use of resources.

(6) Overlap of Point-to-Point Operations

The strategy of distributing the logical ring across channels allows MSCCL to overlap point-to-point operations. This means that while one part of the data is being sent over one channel, another part can be sent over a different channel simultaneously. This overlapping reduces idle time and enhances communication efficiency.

Traditional Ring Algorithm (Single Channel)

GPU 0 sends data to GPU 1.
GPU 1 sends data to GPU 2.
GPU 2 sends data to GPU 3.
GPU 3 sends data back to GPU 0.

MSCCL Multi-Channel Implementation

MSCCL allocates 2 channels to the logical ring.
Channel 1:
- GPU 0 sends data to GPU 1.
- GPU 1 sends data to GPU 2.
Channel 2:
- GPU 2 sends data to GPU 3.
- GPU 3 sends data back to GPU 0.

By using two channels, the communication steps can occur simultaneously on different paths, reducing overall communication time.

Benefits

Reduced Latency: By overlapping communication, MSCCL reduces the time each GPU spends waiting for data.
Improved Bandwidth Utilization: Multiple channels ensure that the communication bandwidth is fully utilized, especially for large messages.
Scalability: The approach scales better with the number of GPUs, making it suitable for large-scale systems.

All-Pairs. In MSCCL, three different types of GPU buffers are available for input, output, and scratch data chunks. Because the Ring algorithm is unsuitable for small message sizes, MSCCL implements All-Pairs for All-Reduce when the message size is small [19]. The Ring algorithm proceeds in two steps: each rank receives the chunk of data from every rank and performs computation operations, and then the chunks of the result data are broadcast to every other rank. Compared with 2n - 2 steps in the Ring algorithm, only two communication steps are used in the All-Pairs algorithm, which makes the latency of the All-Pairs algorithm lower.

全对（All-Pairs）算法

在 MSCCL 中，有三种不同类型的 GPU 缓冲区可用于处理数据块：输入（input）、输出（output）和暂存（scratch）。由于环算法不适用于小消息大小，当消息较小时，MSCCL 实现了 All-Pairs 算法来进行 All-Reduce 操作[19]。环算法分两个步骤进行：每个进程从所有其他进程接收数据块并执行计算操作，然后结果数据块被广播到所有其他进程。相比于环算法的 2n - 2 步，全对算法仅需两个通信步骤，这使得全对算法的延迟更低。

Hierarchical. Different algorithms can be applied to perform All-Reduce according to the input configurations. Besides the above-mentioned algorithms, Hierarchical is another possible one in MSCCL [19]. There are four communication steps in this algorithm. The first step is to perform Reduce-Scatter within a node, the second step is to perform Reduce-Scatter across nodes, the third step is to perform All-Gather across nodes, and the final step is to perform All-Gather within a node.

分层（Hierarchical）算法

根据输入配置，可以应用不同的算法来执行 All-Reduce。除了上述提到的算法外，分层算法也是 MSCCL 中的一种可能选择[19]。该算法包含四个通信步骤。第一步是在节点内执行 Reduce-Scatter，第二步是在节点之间执行 Reduce-Scatter，第三步是在节点之间执行 All-Gather，最后一步是在节点内执行 All-Gather。

Two-Step. The traditional All-to-All algorithm only implements one communication step, but the number of small chunks transferred across nodes is large. In MSCCL, a two-step All-to-All algorithm is implemented with aggregated cross-node communication to reduce the cost [19].

两步（Two-Step）算法

传统的 All-to-All 算法只实现一个通信步骤，但跨节点传输的小数据块数量很大。在 MSCCL 中，通过聚合跨节点通信来实现两步 All-to-All 算法，以降低成本[19]。