INTRODUCTION¶

Recent trends in machine learning (ML) point towards model sizes growing at a much faster rate than a single GPU’s memory capacity and computational power [1, 29]. This necessitates distributing model parameters across multiple GPUs [9, 25, 39] for both model training as well as for model inference. The resulting cost of communication increases as a percentage of total GPU execution time as models become larger. For instance, training Resnet50 [17] with ≈100MB of parameters spends 3% of the time in communication [35], while training DeepLight [10] with ≈2GB of parameters spends 79% of its time in communication on the same distributed system. Therefore, optimizing communication will be critical for future ML workloads.

机器学习（ML）的最新趋势表明，模型规模的增长速度远快于单个GPU的内存容量和计算能力[1, 29]。这就需要将模型参数分布在多个GPU上[9, 25, 39]，以便进行模型训练和推理。随着模型变得更大，通信成本占总GPU执行时间的比例也随之增加。例如，在训练具有约100MB参数的Resnet50时[17]，通信时间占总时间的3%[35]，而在同一分布式系统上训练具有约2GB参数的DeepLight时[10]，通信时间占79%。因此，优化通信对未来的ML工作负载至关重要。

大模型花费在通信交流的时间占比是不是太高了？

？

Communication kernels in ML workloads support Message Passing Interface (MPI) collective communication operations, such as AllReduce, AllGather, and AllToAll [14]. These collectives cooperatively exchange data across GPUs using various communication algorithms [41]. Vendor libraries, like NCCL [27] and RCCL [33], provide high-performance implementations of a few standard algorithms, namely Ring and Tree. Recent research [4, 6, 44, 45] has shown the promise of custom algorithms that are tailored for underlying interconnection topologies and input sizes. However, these works do not implement low-level optimizations such as pipelining, parallelization, and fusion that are necessary for maximizing performance. Partly to avoid the complexity of implementing such low-level optimizations, many works [6, 44, 45] compose existing vendor library implementations; doing so not only incurs the cost of multiple kernel launches but also loses the opportunity to perform optimizations that cross kernel boundaries.

机器学习工作负载中的通信内核支持消息传递接口（MPI）集体通信操作，如AllReduce、AllGather和AllToAll[14]。这些集体操作使用各种通信算法在GPU之间协作交换数据[41]。供应商库，如NCCL[27]和RCCL[33]，提供了少数标准算法（如环形和树形）的高性能实现。

最近的研究[4, 6, 44, 45]表明，针对底层互连拓扑和输入大小定制的算法具有很大潜力。然而，这些研究没有实现诸如流水线、并行化和融合等低级优化，这些优化对于最大化性能至关重要。部分为了避免实现此类低级优化的复杂性，许多研究[6, 44, 45]采用现有的供应商库实现；这样做不仅会导致多次内核启动的成本，还会失去跨内核边界进行优化的机会。

为什么采用现有供应商库实现会导致多次内核启动 && 不能跨内核边界进行优化

独立的内核实现：供应商库中的集体通信操作（如AllReduce、AllGather、AllToAll）通常是 作为独立的内核实现 的。每次调用这些操作时，都需要启动一个新的内核。
启动开销：每次内核启动都会有一定的启动开销，包括设置内核参数、分配资源、以及启动内核的时间。这些开销累积起来会对整体性能产生显著影响。
上下文切换：每次内核启动都会引发 上下文切换 ，GPU需要从当前任务切换到新的任务，这个过程也会增加额外的延迟。
缺乏优化机会：在使用现有供应商库时，每个内核的执行是独立的，无法共享数据或资源。这种 独立性限制了 跨内核的优化机会，例如 数据的重用、流水线处理和并行化优化

This paper proposes Microsoft’s Collective Communication Language (MSCCLang) which is a unified system for generating high-performance implementations of custom communication algorithms. MSCCLang consists of a domain-specific language (DSL) for specifying communication algorithms, a compiler for generating high-performance executables from these high-level specifications, and an efficient runtime for execution. For a given collective communication algorithm, a developer can explore different implementations and optimizations in MSCCLang without fearing data races/deadlocks or writing any C/CUDA code while enjoying the performance of a hand-written code. Additionally, MSCCLang can automatically check whether an implementation properly implements a collective before running on hardware. Lastly, the runtime is APIcompatible with NCCL allowing existing ML workloads to easily convert to MSCCLang, inherit NCCL’s support of diverse set GPUs and inter-connections, and safely fall over to NCCL for scenarios unsupported in MSCCLang. MSCCLang is publicly available at https://github.com/microsoft/msccl and https://github.com/ microsoft/msccl-tools.

本文提出了微软的集体通信语言（MSCCLang），这是一种用于生成高性能自定义通信算法实现的统一系统。

MSCCLang由以下部分组成：用于指定通信算法的领域特定语言（DSL），用于从这些高级规范生成高性能可执行文件的编译器（compiler），以及用于执行的高效运行时（runtime）。

对于给定的集体通信算法，开发者可以在MSCCLang中探索不同的实现和优化，而无需担心数据竞争或死锁问题，也无需编写任何C/CUDA代码，同时享有手写代码的性能。

此外，MSCCLang可以在硬件运行前自动检查实现是否正确实现了集体通信。

最后，MSCCLang的运行时与NCCL的API兼容，使现有的ML工作负载能够轻松转换为MSCCLang，继承NCCL对各种GPU和互连的支持，并在MSCCLang不支持的场景中安全地回退到NCCL。MSCCLang已公开发布，网址为https://github.com/microsoft/msccl 和 https://github.com/microsoft/msccl-tools。

We evaluate MSCCLang on two distributed GPU systems: a cluster of 8×A100 nodes and a cluster of 16×V100 nodes. We show that for a given algorithm, MSCCLang implementations match, and often beat, the performance of a hand-written implementation. This includes an AllToAll algorithm on multiple nodes that is up to 1.3× faster than a hand-optimized implementation and Ring AllReduce algorithm that is up to 1.9× faster than NCCL’s optimized implementation. Additionally, we make a case for custom collectives by replacing simple point-to-point communication with a new collective called AllToNext. Lastly, MSCCLang system is used to serve a public facing language model on 8×A100 GPUs and training a large Mixture-of-Experts model for speech, language, and vision on 256×A100 GPUs at Microsoft providing 1.22-1.29× and 1.10-1.89× speed up, respectively.

我们在两个分布式GPU系统上评估了MSCCLang：一个由8个A100节点组成的集群和一个由16个V100节点组成的集群。我们展示了对于给定的算法，MSCCLang的实现不仅能够匹敌手写实现的性能，甚至常常超越手写实现的性能。这包括在多个节点上运行的AllToAll算法，其速度比手工优化实现快至1.3倍，以及Ring AllReduce算法，其速度比NCCL的优化实现快至1.9倍。此外，我们通过用一个新的集体通信AllToNext替换简单的点对点通信，论证了自定义集体通信的优势。最后，MSCCLang系统被用于在8个A100 GPU上服务于一个面向公众的语言模型，并在微软内部用256个A100 GPU训练一个用于语音、语言和视觉的大型专家混合模型，分别提供了1.22-1.29倍和1.10-1.89倍的速度提升。