Conclusions¶

This paper presented an extensive survey on industry-led collective communication libraries (xCCL) which are frequently used in distributed deep learning training workloads. We started at the physical network topology layer that underlies all communication between devices. We then discussed the data transfer algorithms used in collective routines. Next, we explored different industry solutions by comparing their feature sets and explaining real-world deep learning application use cases. We evaluated xCCL performance by running two industry-made benchmarks, NCCL Tests and PARAM. Based on our results, we explained the performance characteristics of evaluated xCCLs. We also discussed why xCCLs are gaining traction in the industry when classic communication libraries such as MPI implementations exist. We further explained how these libraries take advantage of hardware accelerators and fast interconnects to support deep learning training workloads. Through our tests and investigation, we have determined that NCCL is currently the most mature collective communication library. We hope that future efforts will be made to explore the optimizations present in NCCL and effectively apply them in other xCCLs.

本文对工业主导的集体通信库（xCCL）进行了广泛的调查，这些库经常用于分布式深度学习训练工作负载。我们从底层物理网络拓扑层开始，这层网络支持所有设备之间的通信。然后我们讨论了集体例程中使用的数据传输算法。接下来，我们通过比较其功能集并解释真实的深度学习应用案例，探讨了不同的行业解决方案。我们通过运行两个行业基准测试——NCCL Tests 和 PARAM——评估了 xCCL 的性能。根据我们的结果，我们解释了所评估的 xCCL 的性能特征。我们还讨论了为什么在存在经典通信库（如 MPI 实现）的情况下，xCCL 在工业界越来越受欢迎。我们进一步解释了这些库如何利用硬件加速器和快速互连来支持深度学习训练工作负载。通过我们的测试和调查，我们确定 NCCL 是目前最成熟的集体通信库。我们希望未来能够进一步探索 NCCL 中的优化并有效地应用于其他 xCCL 中。

Acknowledgements¶

On the momentous occasion of Prof. Kai Hwang’s 80th birthday, we would like to express our deepest gratitude and admiration for his exceptional contributions to the field of Parallel Computing, as well as for his unwavering commitment to educating and inspiring generations of students, including ourselves. We want to thank the anonymous reviewers for their insightful comments and suggestions.