Collective Communication Routines¶

Collective communication operations are an essential tool used in many high-performance computing applications to move and process data within multiprocess systems. Although there are many named routines, as listed in Table 1, some are especially important for machine learning applications. Programmers can use individual or combinations of collective routines to build distributed training strategies. In this section, we will review routines that are implemented in contemporary collective communication libraries. A high-level review of collective algorithms is included in Section 4.

集合通信操作是许多高性能计算应用中用于在多进程系统中移动和处理数据的基本工具。虽然有许多命名的例程，如表1所示，但其中一些对机器学习应用尤其重要。程序员可以使用单个或组合的集合例程来构建分布式训练策略。在本节中，我们将回顾当代集合通信库中实现的例程。第4节将包括集合算法的高级回顾。

alt text

Broadcast¶

Broadcast = one send to all

The Broadcast collective operation describes a process whereby the root node distributes the same data to all nodes within the system. After the Broadcast operation is complete, every node will hold the same data. Broadcast is one of the two most common collectives in DL training applications (along with All-Reduce; see Subsection 2.6) and can be used for tasks such as sending training data to all processes.

Example: Consider a system with four processes in Fig. 2. Process \(p_0\) holds data \(D\). After the Broadcast collective runs, processes \(p_0\), \(p_1\), \(p_2\), and \(p_3\) will all hold data \(D\).

alt text

All-Gather¶

All-Gather = all sending and all receiving

The All-Gather collective operation results in each node receiving data from all nodes within the system. Essentially, All-Gather can be described as all processes performing a Broadcast operation with their respective data or as all nodes performing a Gather operation. Note that this is not necessarily how All-Gather is actually implemented.

Example: Consider a system with three processes in Fig. 2. Process \(p_0\) holds data \(D_0\), process \(p_1\) holds data \(D_1\), and process \(p_2\) holds data \(D_2\). After All-Gather completes, processes \(p_0\), \(p_1\), and \(p_2\) will all hold data \(D_0\), \(D_1\), and \(D_2\).

alt text

Scatter¶

Scatter = split up and send to each one

Unlike Broadcast, in which one node sends the same data to every other process, the Scatter collective operation involves a single process transmitting different data to the other processes based on some splitting pattern or rule. By the traditional definition of Scatter, the rule is that the input data are divided into \(n\) pieces where \(n\) is the number of processes in the system. Each piece is then sent to its corresponding process.

Example: Consider a system with three processes in Fig. 2. Process \(p_0\) holds data vector \(\mathbf{v_d} = (D_0, D_1, D_2)\) where \(D_0, D_1\), and \(D_2\) are data. When Scatter is run, vector \(\mathbf{v_d}\) is divided into component pieces \(D_0, D_1\), and \(D_2\). Data \(D_0\) remains on \(p_0\), \(D_1\) is sent to process \(p_1\), and \(D_2\) is sent to process \(p_2\). There is a clear benefit to using Scatter over Broadcast when dividing work among processes, as each process will not waste memory holding data it does not need. Network bandwidth can also be conserved by avoiding unnecessary data transfer operations [^1].

alt text

All-to-All (v)¶

All-to-All ~ All-to-Allv = transpose

All-to-Allv (note the addition of "v") is like standard All-to-All, except that participating processes are not restricted to sending uniform data sizes and can instead send messages with variable sizes. The general All-to-All operation itself is where each process sends data to each of the other processes in the system. The resulting data layout is effectively a transpose of the layout present before the operation. The All-to-All collective is vital if high-performance switching between data and model parallelism in a deep learning training process is required because this switch can be described as a transpose.

Example: Consider a system where there are three processes in Fig. 2. Each process \(p_0, p_1\), and \(p_2\) holds unique data \(A_i, B_i\), and \(C_i\) where \(i\) corresponds to the process number. After the All-to-All collective completes, process \(p_0\) will hold data \(A_0, A_1\), and \(A_2\); process \(p_1\) will hold data \(B_0, B_1\), and \(B_2\); and process \(p_2\) will hold data \(C_0, C_1\), and \(C_2\).

All-to-Allv（注意添加了“v”）类似于标准的All-to-All，不同之处在于参与的进程不受限于发送统一大小的数据，而是可以发送可变大小的消息。All-to-All操作本身是每个进程向系统中的其他每个进程发送数据。操作后的数据布局实际上是操作前布局的转置。如果在深度学习训练过程中需要在数据并行和模型并行之间进行高性能切换，那么All-to-All集体通信至关重要，因为这种切换可以描述为转置。

示例: 考虑图2中的一个有三个进程的系统。每个进程 \(p_0, p_1\) 和 \(p_2\) 持有唯一的数据 \(A_i, B_i\) 和 \(C_i\)，其中 \(i\) 对应进程编号。在All-to-All集体通信完成后，进程 \(p_0\) 将持有数据 \(A_0, A_1\) 和 \(A_2\)；进程 \(p_1\) 将持有数据 \(B_0, B_1\) 和 \(B_2\)；而进程 \(p_2\) 将持有数据 \(C_0, C_1\) 和 \(C_2\)。

alt text

Reduce¶

Reduce = merge and set to one

The Reduce collective refers to a process in which a single node receives data from each node in the system and applies some operation on those data, resulting in a single output. Note that this operation can be anything, provided it is associative. This allows the operation to be performed in parallel while maintaining the correctness and determinism of the program.

Example: Consider a system with three processes (as in Fig. 2): \(p_0, p_1\), and \(p_2\). Process \(p_0\) holds data \(D_0\), process \(p_1\) holds \(D_1\), and process \(p_2\) holds \(D_2\). When the Reduce collective is performed, data from process \(p_0\), data from process \(p_1\), and data from process \(p_2\) will be combined to produce result \(f(D_0, D_1, D_2) = D_\rho\). If \(p_0\) is set as the destination process, then result \(D_\rho\) will be sent to \(p_0\).

Reduce 集体操作是指一个单节点从系统中的每个节点接收数据并对这些数据应用某种操作，从而产生单一输出。注意，这种操作可以是任何操作，只要它是结合的。这允许操作并行执行，同时保持程序的正确性和确定性。

示例: 考虑一个有三个进程的系统（如图2所示）：\(p_0, p_1\) 和 \(p_2\)。进程 \(p_0\) 持有数据 \(D_0\)，进程 \(p_1\) 持有数据 \(D_1\)，进程 \(p_2\) 持有数据 \(D_2\)。当执行 Reduce 集体操作时，来自进程 \(p_0\) 的数据、来自进程 \(p_1\) 的数据和来自进程 \(p_2\) 的数据将被组合以生成结果 \(f(D_0, D_1, D_2) = D_\rho\)。如果将 \(p_0\) 设为目标进程，则结果 \(D_\rho\) 将被发送到 \(p_0\)。

Note that the Reduce collective does not itself distribute result \(D_\rho\) to the other processes. Instead, one must either broadcast result \(D_\rho\) as shown in Subsection 2.1 or use the All-Reduce collective operation as explained in Subsection 2.6. If the result \(D_\rho\) must be broken up before being distributed to the other processes, either the Scatter operation can be used after Reduce, or the Reduce-Scatter operation can be used in place of both as explained in Subsection 2.3 and Subsection 2.7 respectively.

注意，Reduce 集体操作本身并不将结果 \(D_\rho\) 分发给其他进程。

必须如子章节2.1所示广播结果 \(D_\rho\)，或者如子章节2.6解释的那样使用 All-Reduce 才可以进行集体操作。如果结果 \(D_\rho\) 在分发给其他进程之前必须被分解，可以在 Reduce 操作之后使用 Scatter 操作，或者如子章节2.3和子章节2.7所解释的那样使用 Reduce-Scatter 操作来代替这两者。

alt text

Note

本身不具备“自分发”的性质

Reduce collective does not itself distribute result \(D_\rho\) to the other processes.

如果需要实现分发，需要“broadcast”和“All-Reduce”

Instead, one must either broadcast result \(D_\rho\) as shown in Subsection 2.1 or use the All-Reduce collective operation as explained in Subsection 2.6.

分发前需要“自分解”，则“Scatter”和“Reduce-Scatter”

If the result \(D_\rho\) must be broken up before being distributed to the other processes, either the Scatter operation can be used after Reduce, or the Reduce-Scatter operation can be used in place of both as explained in Subsection 2.3 and Subsection 2.7 respectively.

alt text

All-Reduce¶

All-Reduce = Reduce + (then) Broadcast

At a high level, the All-Reduce collective can be described as a Reduce step followed by a Broadcast step. After the operation completes, all processes in the system will hold the result of the Reduce operation. All-Reduce is used extensively in data-parallel distributed deep learning training tasks to compute and communicate gradients during the backpropagation step.

从高层次来看，All-Reduce 集体操作可以描述为一个 Reduce 步骤，接着是一个 Broadcast 步骤。在操作完成后，系统中的所有进程都会持有 Reduce 操作的结果。All-Reduce 广泛用于数据并行的深度学习训练任务中，在反向传播步骤中计算和通信梯度。

Example: Consider a system with three processes in Fig. 2 (as in Fig. 2f): \(p_0, p_1\), and \(p_2\). Process \(p_0\) holds \(D_0\), \(p_1\) holds \(D_1\), and \(p_2\) holds \(D_2\). When the All-Reduce collective operation is performed, data from process \(p_0\), data from process \(p_1\), and data from process \(p_2\) will be combined to produce result \(f(D_0, D_1, D_2) = D_\rho\). The result \(D_\rho\) is then sent to each of the processes \(p_0, p_1\), and \(p_2\). All-Reduce implementations are tuned for higher performance than running Reduce and Broadcast sequentially, even if both approaches result in all processes holding \(D_\rho\).

alt text

Reduce-Scatter¶

As the name implies, the Reduce-Scatter collective is best described as the combination of the Reduce operation and the Scatter operation in the given order. This definition, however, is not fully descriptive as it is the result of the Reduce operation that must be divided into \(n\) pieces so that it can be distributed to the processes in the system [21].

Example: In a system with three processes as shown in Fig. 2g: \(p_0, p_1\), and \(p_2\) where each process holds corresponding data \(D_0, D_1\), and \(D_2\), the Reduce portion of Reduce-Scatter produces an output \(v_\rho\). The components of the result \(D_{\rho0}, D_{\rho1}\), and \(D_{\rho2}\) are scattered (e.g., via the Scatter collective) to processes \(p_0, p_1\), and \(p_2\) respectively.

顾名思义，Reduce-Scatter 集体操作最好描述为按顺序执行 Reduce 操作和 Scatter 操作的组合。然而，这一定义并不完全，因为 Reduce 操作的结果必须被分成 \(n\) 份，以便分发给系统中的各个进程 [21]。

例子：在如图 2g 所示的包含三个进程的系统中：\(p_0, p_1\) 和 \(p_2\)，其中每个进程分别持有相应的数据 \(D_0, D_1\) 和 \(D_2\)。Reduce-Scatter 的 Reduce 部分生成一个输出 \(v_\rho\)。结果的各个部分 \(D_{\rho0}, D_{\rho1}\) 和 \(D_{\rho2}\) 被分散（例如，通过 Scatter 集体操作）到进程 \(p_0, p_1\) 和 \(p_2\)。

alt text

Note

这里的配图应该是有笔误，\(p_0, p_1\) 和 \(p_2\)对应的应该分别是\(D_{\rho0}, D_{\rho1}\) 和 \(D_{\rho2}\)