EVALUATION¶

We evaluate MSCCLang by implementing classic and custom algorithms for the commonly used collectives AllReduce and AllToAll. We optimize each algorithm’s schedule for various GPU system configurations and input sizes. All our programs require less than 30 lines of code, and took between 15 minutes to an hour to write and manually optimize.

我们通过实现经典和自定义的AllReduce和AllToAll这两个常用的集合算法来评估MSCCLang。我们针对不同的GPU系统配置和输入大小优化了每个算法的调度。我们所有的程序都少于30行代码，编写和手动优化这些程序的时间在15分钟到1小时之间。

Experimental Setup. Experiments are performed on two GPU clusters: Azure ND A100 v4-series (NDv4) and NVIDIA DGX2s. Each NDv4 (Figure 7) contains 8 NVIDIA A100 GPUs connected by 12 third-generation NVLinks to 6 NVSwitches for a total of 600 GB/s bi-directional bandwidth. For cross-node communication, each pair of GPUs within a node share a single PCIeSwitch that connects to 2 HDR InfiniBand NICs, each running at 25 GB/s bandwidth.

实验设置。实验在两个GPU集群上进行：Azure ND A100 v4系列（NDv4）和NVIDIA DGX2s。每个NDv4（图7）包含8个NVIDIA A100 GPU，通过12条第三代NVLink连接到6个NVSwitch，总共提供600 GB/s的双向带宽。对于跨节点通信，每对节点内的GPU共享一个单一的PCIe交换机，该交换机连接到2个HDR InfiniBand网卡，每个网卡的带宽为25 GB/s

Each DGX2 node contains 16 NVIDIA V100s divided into two boards of 8 GPUs. GPUs on each board are connected by 6 second-generation NVLinks to 6 NVSwitches, and every NVSwitch is connected by 8 NVLinks to its counterpart NVSwitch on the other board. For cross-node communication each pair of GPUs share a single PCIeSwitch that is connected to 1 HDR InfiniBand NIC running with 25 GB/s bandwidth.

每个DGX2节点包含16个NVIDIA V100，分为两块，每块8个GPU。每块上的GPU通过6条第二代NVLink连接到6个NVSwitch，每个NVSwitch通过8条NVLink连接到另一块上的对应NVSwitch。对于跨节点通信，每对GPU共享一个单一的PCIe交换机，该交换机连接到1个HDR InfiniBand网卡，带宽为25 GB/s。

MSCCLang is built on top of NCCL-2.8.4-1 [27]. When applicable, we compare against collectives implemented in NCCL or expert hand-optimized implementations. For custom algorithms and collectives for which hand-optimized CUDA implementations were previously not available, we provide best effort hand-written kernels. All experiments are averaged over 50 iterations after a warmup period of 20 iterations. Each algorithm’s optimizations are tuned for the two systems. We analyze results for the A100 system since similar trends are seen on the V100 system and discuss certain exceptions.

MSCCLang构建在NCCL-2.8.4-1 [27]之上。在适用的情况下，我们与NCCL或专家手动优化的实现进行对比。对于之前没有手动优化CUDA实现的自定义算法和集合操作，我们提供了尽最大努力编写的内核。所有实验在预热期20次迭代后，平均进行50次迭代。每个算法的优化针对这两个系统进行了调整。我们分析了A100系统的结果，因为V100系统上也显示了类似的趋势，并讨论了某些例外情况。

alt text

AllReduce¶

AllReduce is an MPI collective that globally reduces input buffers across GPUs and replicates the results to all GPUs. We implemented three AllReduce algorithms that target single-node and multi-mode systems.

AllReduce 是一种 MPI 集合操作，它在所有 GPU 上全局地减少输入缓冲区，并将结果复制到所有 GPU。我们实现了三种 AllReduce 算法，分别针对单节点和多节点系统。

MPI

这里的 MPI 是指 Message Passing Interface（消息传递接口），它是一种用于并行计算的标准化通信协议。MPI 提供了一套 API（应用程序接口），用于在 分布式计算环境 中实现 不同计算节点 （如多台计算机或多台 GPU）之间的数据传递和任务协调。

在高性能计算（HPC）中，MPI 广泛用于编写能够在多处理器系统中高效运行的 并行程序。MPI 支持多种通信模式，包括点对点通信和集合通信操作（如 AllReduce、AllToAll 等），这使得它成为并行计算中的重要工具。

Ring AllReduce¶

7.1.1 Ring AllReduce. A Ring AllReduce with R ranks, divides each rank’s input buffer into R chunks. Ranks are logically connected in a ring, and each chunk traverses the ring twice starting from the corresponding rank. The first traversal reduces all corresponding chunks and the second traversal copies the result to all ranks. We implement our ring with a ReduceScatter followed by an AllGather from Figure 3b using a list of all ranks in the node, an offset of 0, and a count of 1 for the parameters.

7.1.1 环形AllReduce（Ring AllReduce）

环形AllReduce在具有R个节点的系统中，将每个节点的输入缓冲区分成R个块。节点在逻辑上连接成一个环，每个块从对应的节点开始，经过环两次传递。第一次传递对所有对应的块进行归约，第二次传递将结果复制到所有节点。我们实现了环形结构，该结构包括一个ReduceScatter操作和一个AllGather操作，如图3b所示，使用节点列表、偏移量为0以及参数计数为1。

具体来说：

ReduceScatter：将输入缓冲区分成R个块，每个块分配给一个节点。每个节点对自己分配到的块进行部分归约操作。
AllGather：每个节点从其他所有节点收集归约后的结果，最终每个节点都会拥有完整的归约结果。

这样设计的环形AllReduce实现了一种高效的数据归约和分发机制，适用于在多节点系统中进行大规模并行计算。

alt text

Our ring implementation distributes a single logical ring across multiple channels by varying the channel of copy and reduce operations. We tune the number of channels per ring, parallelization, and protocol for the system. While examining NCCL’s codebase, we found and experimentally validated that NCCL’s Ring schedule is roughly equivalent to scheduling a logical ring onto one channel, parallelizing the entire program 24 times, and varying the protocol based on the buffer size.

我们的环形实现通过改变复制和归约操作的通道，在多个通道上分配一个逻辑环。我们调整每个环的通道数量、并行化和系统的协议。在检查NCCL的代码库时，我们发现并通过实验验证，NCCL的环形调度大致相当于将一个逻辑环调度到一个通道上，并将整个程序并行化24次，根据缓冲区大小变化协议。

We compare our Ring implementations against NCCL’s Ring implementation in Figure 8a. The MSCCLang Ring implementation outperforms NCCL by up to 1.9× when the buffer size is between 32KB and 3MB. Distributing a logical ring across multiple channels enables better overlapping of sends and receives resulting in performance gains. However, this distribution uses more resources thus limiting the chunk parallelization. For buffer sizes greater than 32MB, more parallelization is required, and the best MSCCLang configurations matched NCCL’s performance by scheduling a logical ring onto one channel and parallelizing the program 24 times.

我们在图8a中将我们的环形实现与NCCL的环形实现进行了比较。当缓冲区大小在32KB到3MB之间时，MSCCLang的环形实现性能比NCCL高出最多1.9倍。将逻辑环分配到多个通道上，可以更好地重叠发送和接收操作，从而带来性能提升。然而，这种分配使用了更多资源，从而限制了块的并行化。对于大于32MB的缓冲区，需要更多的并行化，最佳的MSCCLang配置通过将逻辑环调度到一个通道并将程序并行化24次来与NCCL的性能匹配。

All Pairs AllReduce¶

7.1.2 All Pairs AllReduce. One advantage of MSCCLang is the ability to explore different algorithms easily. All Pairs is an algorithm we developed while exploring algorithmic optimizations for AllReduce that targets small buffer sizes. This algorithm uses two communication steps: each rank gathers a chunk from every rank, computes the sum, and broadcasts the chunk to every other rank.

7.1.2 All Pairs AllReduce

MSCCLang的一个优势是能够轻松探索不同的算法。All Pairs是我们在探索针对小缓冲区大小的AllReduce算法优化时开发的一种算法。该算法使用两个通信步骤：每个rank从剩余每个rank各收集一个块，计算总和，并将该块广播给每个其他rank。

Since we do not have a CUDA baseline to compare against, we plot the speedup of MSCCLang’s All Pairs against NCCL’s Ring algorithm in Figure 8a.

由于我们没有可以用作对比的CUDA基准，我们将MSCCLang的All Pairs与NCCL的Ring算法的加速比绘制在图8a中。

The speedups provided by All Pairs are driven by algorithmic and scheduling optimizations. Ring and All Pairs exchange the same volume of data, but All Pairs has better latency because it uses 2 communication steps compared with Ring’s 2𝑅 − 2 steps. For buffer sizes from 1KB to 1MB, All Pairs is up to 1.8× faster than NCCL, depending on the number of instances used to optimize the program.

All Pairs的加速主要得益于算法和调度优化。虽然Ring和All Pairs交换的数据量相同，但All Pairs因为只使用了2个通信步骤，相较于Ring的2𝑅−2个步骤，具有更好的延迟表现。对于大小在1KB到1MB之间的缓冲区，All Pairs比NCCL快多达1.8倍，这取决于用于优化程序的实例数量。

Hierarchical AllReduce¶

The final AllReduce algorithm we analyze is a Hierarchical AllReduce algorithm described in Section 2. Figure 8c plots the speedup of the Hierarchical AllReduce implemented in MSCCLang against NCCL. Depending on the input size, we apply different optimizations to the same base algorithm. For small sizes we are up to 1.4× faster than NCCL. For large buffers, greater than 1GB, our implementation is up to 11% than NCCL.

我们分析的最后一个AllReduce算法是第2节中描述的分层AllReduce算法。图8c绘制了MSCCLang实现的分层AllReduce相对于NCCL的加速情况。根据输入大小，我们对同一基本算法应用了不同的优化。对于小规模数据，我们的实现比NCCL快多达1.4倍。对于大于1GB的大缓冲区，我们的实现比NCCL快最多11%。

In red, we plot the speedup of same algorithm implemented with NCCL collectives. The implementation is significantly slower than the MSCCLang’s and NCCL’s implementation due to the overhead of multiple kernel launches and lack of cross-kernel optimizations. Using MSCCLang optimizations to execute the algorithm in a single kernel and pipeline thread blocks significantly improved the algorithm’s performance such that it is faster than NCCL’s implementation for a large range of buffer sizes.

用红色绘制的是用NCCL集体通信实现的相同算法的加速情况。由于多次内核启动的开销和缺乏跨内核优化，该实现显著慢于MSCCLang和NCCL的实现。使用MSCCLang优化在单个内核中执行算法并流水线化线程块，显著提高了算法的性能，使其在很大范围的缓冲区大小上比NCCL的实现更快。

Two-Step AllToAll¶

AllToAll is an MPI collective that transposes a buffer of data between GPUs such that chunk 𝑖 on GPU 𝑗 ends up on GPU 𝑖 at index 𝑗 , and is commonly implemented as a set of point-to-point sends and receives between all GPUs. Because each GPU exchanges data with every other GPU, AllToAll is a very communication intensive collective.

AllToAll 是一种 MPI 集体操作，它在 GPU 之间转置数据缓冲区，使得 GPU 𝑗 上的第 𝑖 块数据最终出现在 GPU 𝑖 上的索引 𝑗 处，通常实现为所有 GPU 之间的一组点对点发送和接收。由于每个 GPU 都与每个其他 GPU 交换数据，AllToAll 是一种通信密集型的集体操作。

On AllToAlls spanning 10s to 100s of nodes, the naive implementation requires only one communication step, but sends many small chunks to other nodes over IB which is expensive due to the high overhead costs of IB. We implement a Two-Step AllToAll algorithm that aggregates cross-node sends, reducing the total overhead cost of the IB sends. The algorithm is described in Figure 9, and we used MSCCLang’s default scheduling with 1 instance and tuned the protocol for the buffer size.

在跨越数十到数百个节点的 AllToAll 操作中，简单的实现只需要一个通信步骤，但会通过 InfiniBand（IB）将许多小块数据发送到其他节点，这由于 IB 的高开销成本而变得昂贵。我们实现了一种两步 AllToAll 算法，通过聚合跨节点发送来减少 IB 发送的总开销成本。该算法如图 9 所示，我们使用了 MSCCLang 的默认调度，并针对缓冲区大小调整了协议。

alt text

We compare the performance of our implementation against a hand-optimized CUDA implementation of the Two-Step algorithm in Figure 8e. At large sizes the MSCCLang implementation is up to 1.3× faster than the hand-optimized implementation. Note, at smaller sizes between 2MB-64MB there are large fluctuations in speedup caused by congestion in the IB network which is shared with other cloud tenants; however the general trends show that MSCCLang’s optimizations improve performance.

我们将我们的实现与手动优化的 CUDA 实现的两步算法进行了比较，如图 8e 所示。在较大的数据规模下，MSCCLang 的实现比手动优化的实现快最多 1.3 倍。请注意，在 2MB-64MB 之间的小数据规模下，速度提升有较大波动，这是由于 IB 网络与其他云租户共享而导致的拥塞；然而，整体趋势表明，MSCCLang 的优化提升了性能。

For reference, we also plot NCCL’s performance relative to the hand-optimized implementation. In general, both Two-Step implementations provide significant improvements over NCCL. However, for larger buffer sizes, greater than 512MB the hand-optimized Two-Step implementation is slower than NCCL, while the MSCCLang implementation is 20% faster.

作为参考，我们还绘制了 NCCL 相对于手动优化实现的性能。总体而言，两步实现相比 NCCL 都提供了显著的改进。然而，对于大于 512MB 的较大缓冲区大小，手动优化的两步实现比 NCCL 慢，而 MSCCLang 的实现快了 20%。

The hand-optimized version is implemented using point-to-point primitives exposed by NCCL, but lacks scheduling decisions made by the compiler that divides communication across multiple parallel thread blocks. The MSCCLang seamless handles aggregating chunks in the scratch buffer (Line 12), while the handwritten implementation requires a separate kernel that copies and contiguously arranges chunks in a scratch buffer for the aggregated IB send resulting in extra synchronization overhead. Furthermore, the MSCCLang implementation is much more succinct and requires only 15 lines of code while the hand optimized kernel requires roughly 70 lines of code.

手动优化版本是使用 NCCL 提供的点对点原语实现的，但缺乏编译器在多个并行线程块之间划分通信的调度决策。MSCCLang 无缝处理在临时缓冲区中聚合块（第 12 行），而手写实现则需要一个单独的内核来复制和连续排列临时缓冲区中的块以进行聚合的 IB 发送，从而导致额外的同步开销。此外，MSCCLang 的实现更加简洁，只需要 15 行代码，而手动优化的内核大约需要 70 行代码。

Custom Collectives: AllToNext¶

A key feature of MSCCLang is the ability to implement new collective communication patterns quickly and efficiently. We demonstrate this on a new collective called AllToNext. This collective involves 𝑅 GPUs, where GPU 𝑖 sends a buffer of data to GPU 𝑖 + 1, with the last GPU sending nothing. This communication pattern exists in applications that process data in a pipelined fashion across multiple GPUs. In a naive implementation of AllToNext, every GPU directly sends its buffer to the next GPU. However, on a distributed system made up of multiple nodes with heterogeneous links, throughput is bottlenecked by the low-bandwidth inter-node network links. Furthermore, on a nodes with multiple network links this will only use one link, wasting inter-node bandwidth. For example, in an A100 node, there are 8 IB links but a single send can only utilize 1 link.

MSCCLang 的一个关键特性是能够快速高效地实现新的集体通信模式。我们在一个新的集体通信模式 AllToNext 上展示了这一点。AllToNext 集体通信涉及 𝑅 个 GPU，其中 GPU 𝑖 将数据缓冲区发送到 GPU 𝑖 + 1，最后一个 GPU 则不发送任何数据。这种通信模式存在于需要在多个 GPU 上以流水线方式处理数据的应用程序中。在 AllToNext 的朴素实现中，每个 GPU 直接将其缓冲区发送到下一个 GPU。然而，在由多个节点组成的异构链路分布式系统中，吞吐量受到低带宽节点间网络链路的限制。此外，在具有多个网络链路的节点上，这种方式只会使用一个链路，浪费了节点间的带宽。例如，在一个 A100 节点中，有 8 个 IB 链路，但单个发送只能利用 1 个链路。

We designed the AllToNext algorithm specifically to address this problem by utilizing all available IB links in a node. Within a node, GPUs directly send their buffers to the next GPU; when transferring across a node (Figure 10), all GPUs within the nodes cooperatively send the buffer to utilize all IB links. Specifically, on a scaled down A100 system (𝑁 = 2,𝐺 = 3), when GPU (0, 2) sends to GPU (1, 0), it will divide its buffer into 𝐺 = 3 chunks and scatter it among all GPUs in node 0. Each GPU (0, 𝑔 ) directly sends its chunk to the corresponding GPU on the next node (1, 𝑔 ), and finally all chunks will be gathered back onto GPU (1, 0).

我们专门设计了 AllToNext 算法来解决这个问题，通过利用节点中所有可用的 IB 链路。在一个节点内，GPU 直接将它们的缓冲区发送到下一个 GPU；当跨节点传输时（图 10），节点内的所有 GPU 协同发送缓冲区，以利用所有 IB 链路。具体而言，在一个缩小的 A100 系统中（𝑁 = 2，𝐺 = 3），当 GPU（0, 2）向 GPU（1, 0）发送数据时，它会将缓冲区分成 𝐺 = 3 个块，并将其分散到节点 0 的所有 GPU 上。每个 GPU（0, 𝑔）直接将其块发送到下一个节点对应的 GPU（1, 𝑔），最终所有块将汇集回 GPU（1, 0）。

alt text

Figure 8g plots the speedup of AllToNext on 3×A100 nodes against a handwritten CUDA baseline where each GPU directly sends its entire buffer to the next GPU using NCCL’s send and receive primitives. When sending small buffers, AllToNext performs worse than baseline due to overhead from the extra communication steps. For larger buffers however, AllToNext begins to show improvement over the baseline, and is ultimately up to 14.5× for a large buffers. The best performing selection of r depends buffer sizes. For small buffer sizes, less parallelization provide better performance, as the benefit from parallelizing communication doesn’t offset the cost of initializing extra resources. As the buffer sizes increase, programs with more parallelization produce larger speedups as the initialization overhead is amortized over more communication.

图 8g 展示了 AllToNext 在 3×A100 节点上的加速效果，相对于手写的 CUDA 基线程序，后者使用 NCCL 的发送和接收原语，每个 GPU 直接将其整个缓冲区发送到下一个 GPU。当发送小缓冲区时，由于额外通信步骤带来的开销，AllToNext 的性能比基线差。然而，对于较大的缓冲区，AllToNext 开始显示出相对于基线的改进，对于大缓冲区最终可以提高到 14.5 倍。性能最佳的选择取决于缓冲区大小。对于小缓冲区，较少的并行化提供了更好的性能，因为并行化通信的好处无法抵消初始化额外资源的成本。随着缓冲区大小的增加，更多并行化的程序产生更大的加速效果，因为初始化开销分摊在更多的通信上。

Important

For small buffer sizes, less parallelization provide better performance, as the benefit from parallelizing communication doesn’t offset the cost of initializing extra resources.

As the buffer sizes increase, programs with more parallelization produce larger speedups as the initialization overhead is amortized over more communication.

SCCL Comparison¶

SCCL [4] is an automatic collective communication algorithm generator which similarly considers both latency and bandwidth of each link for optimal solution. SCCL implements these algorithms from scratch using its own point-to-point communication protocol that works for GPUs interconnected with NVLinks only. The focus of SCCL is mostly on generating custom algorithms while the focus of MSCCLang is on implementing any custom algorithm for any interconnection such as InfiniBand, shared host memory, or NVLinks.

SCCL [4] 是一个自动化的集体通信算法生成器，同样考虑了每个链路的延迟和带宽以找到最优解。SCCL 从头开始实现这些算法，使用自己的点对点通信协议，该协议仅适用于通过 NVLinks 互连的 GPU。SCCL 的重点主要在于生成自定义算法，而 MSCCLang 的重点在于实现任何互连（例如 InfiniBand、共享主机内存或 NVLinks）上的任何自定义算法。

Figure 11 compares the performance of (1,2,2) AllGather algorithm from SCCL on DGX-1 8×V100 GPUs (see Table 4 from Section 5.4 in [4]) using SCCL and MSCCLang implementations. It is clear that MSCCLang implementation is faster for small sizes thanks to LL protocol, but Simple protocol is not as performant as SCCL protocol for middle sizes. The reason is that SCCL implementation uses a direct copy from source to destination for point-to-point communication while MSCCLang protocols use FIFO slots for intermediate buffers as explained in Section 6.1. This means that SCCL protocol has less memory footprint than MSCCLang Simple protocol and therefore, it is more efficient. SCCL direct copy protocol can also be implemented in MSCCLang Simple protocols, but we leave it for future work.

图 11 比较了 SCCL 和 MSCCLang 实现的 (1,2,2) AllGather 算法在 DGX-1 8×V100 GPU 上的性能（见 [4] 第 5.4 节表 4）。显然，由于 LL 协议，MSCCLang 实现的小规模数据传输速度更快，但对于中等规模数据，Simple 协议的性能不如 SCCL 协议。这是因为 SCCL 实现使用直接从源到目的地的复制进行点对点通信，而 MSCCLang 协议使用第 6.1 节中解释的 FIFO 槽作为中间缓冲区。这意味着 SCCL 协议的内存占用比 MSCCLang Simple 协议更少，因此更高效。虽然 SCCL 的直接复制协议也可以在 MSCCLang Simple 协议中实现，但我们将其留作未来工作。

Point-to-Point Connections

Remote Buffers. (review in Section 6.1)

NCCL abstracts different kinds of interconnects from CUDA code by providing intermediate buffers of constant size of 𝑏 bytes for sends to write to and receives to read from.
These buffers are subdivided into 𝑠 FIFO slots which allows 𝑠 sends to finish without waiting for receives (1 ≤ 𝑠 ≤ 8).
MSCCLang compiler prevents a schedule with more than 𝑠 outstanding sends to avoid deadlocks. By default, 512KB ≤ 𝑏 ≤ 5MB and 1 ≤ 𝑠 ≤ 8 (exact values are defined by the protocol, explained later).

End-to-End Results¶

Custom algorithms generated by MSCCLang are used by Azure OpenAI to accelerate Copilot. These algorithms speed up the GPU time by 20%. MSCCLang is also used for training a large Mixture-ofExperts model [28] for speech, language, and vision on 256×A100 GPU providing 1.10-1.89× speed up depending on the model architecture when using the Azure Container for PyTorch [32].

由 MSCCLang 生成的自定义算法被 Azure OpenAI 用于加速 Copilot。这些算法将 GPU 时间加快了 20%。MSCCLang 还用于在 256×A100 GPU 上训练大型 Mixture-of-Experts 模型 [28]，在使用 Azure Container for PyTorch [32] 时，根据模型架构的不同，提供 1.10-1.89× 的加速效果。