RELATED WORK && CONCLUSION¶

Optimizing Collectives. The message passing interface (MPI) [11] is a popular abstraction for communication primitives. Efficient algorithms for implementing these primitives is a long-studied research area [5, 31, 41]. Prior works have included optimizing algorithms for specific topologies like mesh, hypercube, or fat-tree [2, 3, 37] and for clusters of shared-memory processors [34, 40, 42, 43]. Motivated by recent ML workloads, Horovod [38] implements collective primitives by using NCCL locally in node and MPI across nodes. Others such as BlueConnect [7] and PLink [24] exploit the hierarchical network topology of a cloud system or a data center to improve the performance of collective primitives. Recent work focuses on automatically generating new collective algorithms, either by packing trees [44] or using a constraint solver to generate pareto-optimal algorithms [4]. In contrast, this work focuses on a high-level language for specifying these algorithms and efficiently running them on state-of-the-art accelerators.

优化集体通信。消息传递接口（MPI）[11] 是一种流行的通信原语抽象。实现这些原语的高效算法是一个长期研究领域 [5, 31, 41]。先前的研究包括为特定拓扑结构（如网格、超立方体或胖树）[2, 3, 37] 和共享内存处理器集群 [34, 40, 42, 43] 优化算法。受近期 ML 工作负载的激励，Horovod [38] 通过在节点本地使用 NCCL 和跨节点使用 MPI 来实现集体原语。其他如 BlueConnect [7] 和 PLink [24] 利用云系统或数据中心的分层网络拓扑来提高集体原语的性能。最近的工作集中在自动生成新的集体算法，无论是通过打包树 [44] 还是使用约束求解器生成帕累托最优算法 [4]。相比之下，这项工作侧重于一种高级语言，用于指定这些算法并在最先进的加速器上高效运行它们。

In-network aggregation is another direction to accelerate reduction based communication primitives using custom hardware. Mellanox Scalable Hierarchical Aggregation and Reduction Protocol (SHArP) [13] is one of the techniques available in InfiniBand switches. Other programmable switches including SwitchML [36] and ATP [22] also share the similar idea to offload GPU reduction to network switches in order to accelerate AllReduce in deep learning workloads. Apart from switches, BluesMPI [16], ACCL [18] and BytePS [21] also offload communication primitives to SmartNIC, FPGA, and spare CPU nodes, respectively. Those works all introduce extra hardware thus increase bandwidth limit for primitives, while MSCCLang focuses on software stack only to program and optimize collective communication algorithms within existing hardware.

在网聚合 是加速基于减少的通信原语的另一方向，采用定制硬件。Mellanox 可扩展分层聚合和减少协议（SHArP）[13] 是 InfiniBand 交换机中可用的技术之一。其他可编程交换机，包括 SwitchML [36] 和 ATP [22] 也有类似的想法，将 GPU 减少操作卸载到网络交换机，以加速深度学习工作负载中的 AllReduce。除了交换机，BluesMPI [16]、ACCL [18] 和 BytePS [21] 分别将通信原语卸载到 SmartNIC、FPGA 和备用 CPU 节点。这些工作都引入了额外的硬件，从而增加了原语的带宽限制，而 MSCCLang 则专注于仅利用现有硬件的软件栈来编程和优化集体通信算法。

Recent works [15, 20, 30, 46] have shown the advantage of overlapping computation and communication when optimizing distributed ML workloads. While our focus here is on specifying communication collectives, extending MSCCLang to further specify the scheduling of computation is an interesting future work.

最近的研究[15, 20, 30, 46]显示了在优化分布式机器学习工作负载时重叠计算和通信的优势。虽然我们这里的重点是指定通信集体，但将 MSCCLang 扩展到进一步指定计算的调度是一个有趣的未来工作方向。

Dataflow Languages. The chunk-oriented programming style of MSCCLang is motivated by dataflow programming languages. The design of the language is particularly influenced by declarative coordination languages such as Linda [12] and Concurrent Collections [19]. Rather than use explicit tuples, MSCCLang uses implicit chunk identifiers to coordinate multiple ranks. Cilk [23] also influenced the aspect of MSCCLang where the deterministic semantics of the program is specified by the sequential semantics of the host language.

数据流语言。MSCCLang 的基于块的编程风格受数据流编程语言的启发。该语言的设计特别受声明式协调语言的影响，如 Linda [12] 和 Concurrent Collections [19]。与使用显式元组不同，MSCCLang 使用隐式块标识符来协调多个等级。Cilk [23] 也影响了 MSCCLang 的某些方面，其中程序的确定性语义由宿主语言的顺序语义指定。

CONCLUSION¶

MSCCLang is a novel software system designed for implementing GPU collective communications.

MSCCLang 是一种专门为实现 GPU 集体通信而设计的新型软件系统。

MSCCLang provides a domain specific language for flexible collective implementations and a compiler for lowering the DSL to low-level representation that is efficiently executed by an optimized runtime.

MSCCLang 提供了一种用于灵活实现集体通信的领域特定语言 (DSL) 以及一个将 DSL 降级为可由优化运行时高效执行的低级表示的编译器。

We evaluated MSCCLang by implementing the common collectives AllToAll and AllReduce on different GPU systems that outperform the state-of-the-art GPU collective library. Additionally, we introduce a custom collective, AllToNext, that demonstrates the flexibility to develop new collectives that are not in the standard MPI interface. We believe the programmability of MSCCLang will empower ML researchers to optimize existing or explore new collectives in their GPU workloads.

我们通过在不同的 GPU 系统上实现常见的 AllToAll 和 AllReduce 集体通信来评估 MSCCLang，结果表明其性能优于最先进的 GPU 集体库。此外，我们还介绍了一种自定义集体通信 AllToNext，展示了开发标准 MPI 接口中没有的新集体通信的灵活性。我们相信，MSCCLang 的可编程性将使 ML 研究人员能够优化现有集体通信或探索新的集体通信，以优化其 GPU 工作负载。

ACKNOWLEDGMENTS¶

We would like to thank our colleagues at Microsoft Research in the RiSE and Systems & Networking Group for their early feedback on this work and the Azure HPC team for their help on experiments. We also thank our anonymous reviewers for their detailed feedback.

RELATED WORK && CONCLUSION¶

RELATED WORK¶

CONCLUSION¶

ACKNOWLEDGMENTS¶