MSCCLANG EXAMPLE¶

This section introduces MSCCLang through a running example, hierarchical AllReduce, and introduces common terminology used throughout the paper.

本节通过一个正在运行的示例——分层AllReduce，介绍了MSCCLang，并介绍了本文中使用的常见术语。

Terminology. In a cluster of 𝑁 nodes or machines with 𝐺 GPUs each, the rank of a GPU is identified by a tuple (𝑛, 𝑔) where 𝑛 is the node index and 𝑔 is the GPU index within the node, or alternatively by the integer value 𝑛 × 𝐺 + 𝑔. We refer to GPUs by their tuple and single integer ranks interchangeably.

Collectives operate on buffers of data divided into chunks, which represent contiguous spans of elements with a uniform size. Chunks are the finest granularity that data is sent with in a collective.

术语。在一个由 N 个节点或机器组成的集群中，每个节点具有 G 个 GPU，GPU 的排名由一个元组 (n, g) 标识，其中 n 是节点索引，g 是节点内的 GPU 索引，或者由整数值 n * G + g 标识。我们可以交替使用元组和单个整数排名来指代 GPU。

集体通信操作作用于分成块的数据缓冲区，这些块表示具有统一大小的连续元素跨度。块是集体通信中发送数据的最小粒度。

alt text

Hierarchical AllReduce. Figure 1 shows the workings of this algorithm. For a topology of 𝑁 (= 2) nodes and 𝐺(= 3) GPUs per node, the algorithm splits the input buffer into 𝑁 × 𝐺(= 6) chunks. The algorithm proceeds in four phases. The first phase is an intra-node ReduceScatter that computes the sum of buffers within a node with the result “scattered” across the GPUs. In this example, this is done through a Ring algorithm. GPU 1 sends 𝑁 chunks (chunk 0 and chunk 1) to GPU 2 which adds them to its corresponding chunks before sending them to GPU 0. In the end, GPU 0 has the intra-node sum of these 𝑁 chunks, which is shown as lightly shaded in the figure. Other GPUs have intra-node sum of 𝑁 other chunks each by executing a similar ring as shown in the figure.

分层AllReduce。图1展示了该算法的工作原理。对于具有 N (= 2) 个节点和每个节点具有 G (= 3) 个GPU的拓扑结构，该算法将输入缓冲区分成 N * G (= 6) 块。

算法分为四个阶段。

第一阶段是节点内的ReduceScatter，计算节点内缓冲区的和，并将结果“分散”到各个GPU。在此示例中，这是通过环算法完成的。GPU 1 将 N 块（块0和块1）发送到GPU 2，GPU 2 将其添加到相应的块中，然后将它们发送到GPU 0。最终，GPU 0 拥有这些 N 块的节点内和，如图中所示为浅色阴影。其他GPU通过执行类似的环算法，每个GPU拥有 N 其他块的节点内和，如图所示。

The second phase is an inter-node ReduceScatter, where GPUs with the same intra-node index communicate to sum their chunks across nodes. For instance, GPU 0 (i.e., (0, 0)) and GPU 3 (i.e., (1, 0)) use a Ring algorithm to add the intra-node sums of chunk 0 and chunk 1. The result is scattered with each GPU having one chunk of the AllReduce result, which is shown as darkly shaded in the figure. The final two phases are an inter-node AllGather followed by an intra-node AllGather, both of which follow a similar Ring algorithm to distribute these chunks to all GPUs.

第二阶段是节点间的ReduceScatter，相同节点内索引的GPUs进行通信以跨节点求和它们的块。例如，GPU 0（即(0, 0)）和GPU 3（即(1, 0)）使用环算法来加总块0和块1的节点内和。结果被分散，每个GPU拥有AllReduce结果的一块，如图中深色阴影所示。

最后两个阶段是节点间AllGather和节点内AllGather，这两个阶段都遵循类似的环算法，将这些块分发到所有GPU。

alt text

MSCCLang Program. The MSCCLang DSL is embedded in Python and allows users to write communication algorithms by declaratively specifying chunks routes across the GPUs to implement a collective. We call such specifications chunk-oriented. Figure 3 shows the code for the hierarchical AllReduce algorithm. When interpreted as a Python program, the execution mimics the description in Figure 1. Figure 3a creates the four phases: 𝑁 and 𝐺 instances of intra-node and inter-node ReduceScatter and AllGather, respectively. Figure 3b implements ReduceScatter and AllGather using the Ring algorithm. In MSCCLang, a chunk is identified by its rank and its index into a buffer in the rank, as shown in Line 8 and Line 17. As this chunk is routed across the ring, ReduceScatter performs reduce at Line 11 while AllGather performs a copy at Line 20. Section 3 explains the MSCCLang DSL in detail.

MSCCLang 程序。MSCCLang 的领域特定语言（DSL）嵌入在 Python 中，允许用户通过声明性地指定跨 GPU 的块路由来编写通信算法，以实现集体通信。我们称这种规范为面向块的。图3展示了分层AllReduce算法的代码。当作为Python程序解释时，其执行过程模仿了图1中的描述。图3a 创建了四个阶段：节点内和节点间的 ReduceScatter 和 AllGather 的 N 和 G 实例。图3b 使用环算法实现了 ReduceScatter 和 AllGather。在 MSCCLang 中，块通过其排名及其在排名缓冲区中的索引来标识，如第8行和第17行所示。当块在环中路由时，ReduceScatter 在第11行执行 reduce 操作，而 AllGather 在第20行执行复制操作。第3节详细解释了 MSCCLang DSL。

代码图此处从略

MSCCLang Architecture. Figure 2 describes the components of the MSCCLang framework. Given a MSCCLang program, the MSCCLang compiler lowers it into an intermediate representation called MSCCL-IR which is directly interpreted by MSCCLang’s runtime. The compiler traces the program to capture the chunk dependencies in a Chunk DAG and performs several optimizations such as aggregation, instruction fusion, and parallelization. The compiler then schedules the program onto thread blocks using MSCCLang DSL directives so that the user may control the optimizations and scheduling choices. The compiler ensures that distributed execution correctly implements the chunk-oriented semantics of the input program with a guaranteed absence of deadlocks and data races.

MSCCLang架构。图2描述了MSCCLang框架的组成部分。给定一个MSCCLang程序，MSCCLang编译器将其降低为一种中间表示，称为MSCCL-IR，该表示由MSCCLang的运行时直接解释。编译器跟踪程序以捕获块依赖关系，并在块依赖关系图（Chunk DAG）中进行多种优化，如聚合、指令融合和并行化。然后，编译器使用MSCCLang DSL指令将程序调度到线程块上，以便用户能够控制优化和调度选择。编译器确保分布式执行正确实现输入程序的块导向语义，并保证不存在死锁和数据竞争。

alt text

The MSCCLang runtime executes MSCCL-IR as a single CUDA kernel and performs additional optimizations such as pipelining to improve thread block and link utilization. The key advantage of MSCCLang is that users get algorithmic flexibility to specify custom communication algorithms in a high-level DSL while still getting the performance of hand-written kernels.

MSCCLang运行时将MSCCL-IR作为单个CUDA内核执行，并执行其他优化，如管道化以提高线程块和链接利用率。MSCCLang的关键优势在于用户可以以高级DSL指定自定义通信算法，同时仍然获得编写手写内核的性能。

DSL

domain-specific language，领域特定语言