xCCL: A Survey of Industry-Led Collective Communication Libraries for Deep Learning¶

Abstract¶

Abstract Machine learning techniques have become ubiquitous both in industry and academic applications. Increasing model sizes and training data volumes necessitate fast and efficient distributed training approaches. Collective communications greatly simplify inter- and intra-node data transfer and are an essential part of the distributed training process as in- formation such as gradients must be shared between processing nodes. In this paper, we survey the current state-of-the-art collective communication libraries (namely xCCL, including NCCL, oneCCL, RCCL, MSCCL, ACCL, and Gloo), with a focus on the industry-led ones for deep learning workloads. We investigate the design features of these xCCLs, discuss their use cases in the industry deep learning workloads, compare their performance with industry-made benchmarks (i.e., NCCL Tests and PARAM), and discuss key take-aways and interesting observations. We believe our survey sheds light on potential research directions of future designs for xCCLs.

摘要机器学习技术在工业和学术应用中已变得无处不在。模型规模的增大和训练数据量的增加需要快速且高效的分布式训练方法。集合通信大大简化了节点间和节点内的数据传输，是分布式训练过程中不可或缺的一部分，因为梯度等信息必须在处理节点之间共享。在本文中，我们调查了当前最先进的集合通信库（即 xCCL，包括 NCCL、oneCCL、RCCL、MSCCL、ACCL 和 Gloo），重点关注在深度学习工作负载中由行业领导的库。我们研究了这些 xCCL 的设计特征，讨论了它们在工业深度学习工作负载中的使用案例，比较了它们在工业基准测试（即 NCCL 测试和 PARAM）中的性能，并讨论了关键的收获和有趣的观察。我们相信，我们的调查为未来 xCCL 设计的潜在研究方向提供了启示。

Keywords¶

Key words collective, deep learning , distributed training , GPUDirect, RDMA (remote direct memory access)