Introduction¶

Designing high-performance communication subsystems is one of the most challenging tasks essential to achieving scalable parallel computing goals, as the communication performance can directly influence the execution efficiency of large-scale distributed software.

设计高性能通信子系统是实现可扩展并行计算目标的最具挑战性的任务之一，因为通信性能可以直接影响大规模分布式软件的执行效率。

Collectives are a form of organized communication that has become ubiquitous in parallel computing, distributed computing, and high-performance computing (HPC) applications.

集合通信是一种有组织的通信形式，已在并行计算、分布式计算和高性能计算（HPC）应用中变得无处不在。

Collective communication operations, such as Broadcast and All-Reduce, can aggregate（聚合）and disseminate（传播）data to multiple processes while retaining a relatively simple API (Application Programming Interface).

集合通信操作，如广播和全规约，可以聚合和传播数据到多个进程，同时保留相对简单的API（应用程序编程接口）。

Collectives abstract away much of the complexity of managing communication; however, it is critical that both the collective communication implementation and the programming model chosen are well-architected, well-designed, and optimized for the particular intended application.

集合通信抽象了大量通信管理的复杂性；然而，关键在于所选择的集合通信实现和编程模型必须是精心架构、设计合理并针对特定预期应用进行优化的。

Having existed for almost 30 years, Message Passing Interface (MPI) is one of the most widely used programming models for large-scale scientific applications that involve collective communication. Due to its high speed and portability, MPI has become the model favored in the academic community. There are various implementations of the MPI programming model, such as MPICH, MVAPICH, and Open MPI. Despite the age of MPI and the development of new collective communication models and libraries, few have been able to compete with MPI in terms of popularity and generality. Some examples of newer non-MPI libraries are OpenSHMEM (Open-source Symmetric Hierarchical Memory), UCX (Unified Communication X), and UCC (Unified Collective Communication). Fig. 1 shows an overview of the classic collective communication libraries, modern collective communication libraries, as well as related communication hardware and interconnects.

近30年来，消息传递接口（MPI）一直是涉及集合通信的大规模科学应用中最广泛使用的编程模型之一。由于其高速和可移植性，MPI已成为学术界所青睐的模型。MPI编程模型有多种实现，例如MPICH、MVAPICH和Open MPI。尽管MPI已经有一定的历史，并且出现了新的集合通信模型和库，但在受欢迎度和通用性方面，少有能与MPI竞争的。新兴的非MPI库的一些例子包括OpenSHMEM（开源对称分层内存）、UCX（统一通信X）和UCC（统一集合通信）。图1展示了经典集合通信库、现代集合通信库以及相关通信硬件和互连的概述。

In recent years, machine learning (ML), especially deep learning (DL), has become an extremely hot topic, and there have been numerous advancements in many scientific fields such as computer vision and natural language processing. With continuously increasing data volume and model sizes, methods for reducing training and inference time have themselves become important research topics. For example, the pre-trained Transformer GPT-3 (Generative Pre-trained Transformer 3) model contains approximately 175 billion parameters and may take multiple days (or more) to train on advanced GPU-based clusters.

近年来，机器学习（ML），特别是深度学习（DL），已经成为一个极其热门的话题，并且在许多科学领域，如计算机视觉和自然语言处理方面取得了众多进展。随着数据量和模型规模的不断增加，减少训练和推理时间的方法本身也成为重要的研究课题。例如，预训练Transformer GPT-3（生成式预训练Transformer 3）模型包含大约1750亿个参数，在先进的基于GPU的集群上训练可能需要数天（甚至更长时间）。

Long training times are often considered blockers for the practical deployment of such models. The case is similarly severe when considering industry-level large-scale ML/DL models such as deep learning recommendation models (DLRM). Therefore, it is necessary to accelerate these processes with effective use of parallel computing, and collectives have the potential to significantly influence the performance and scalability. Under the influence of ML/DL, optimizations on some collective routines are heavily investigated. This evolution also applies to related communication hardware and interconnects. For example, the traditional Remote Direct Memory Access communication (RDMA) mechanism has been widely used in many areas such as HPC, big data, key-value stores, and high-performance cloud computing workloads. With the advance of ML, there is an increasing demand for RDMA and GPUDirect RDMA (GDR). The interconnect speed requirements can reach 400 Gbps per port.

长时间的训练通常被认为是此类模型实际部署的阻碍。在考虑工业级大规模ML/DL模型（如深度学习推荐模型（DLRM））时，情况同样严重。因此，有必要通过有效使用并行计算来加速这些过程，而集合通信有可能显著影响性能和可扩展性。在ML/DL的影响下，对一些集合通信操作的优化进行了大量研究。这种演变也适用于相关的通信硬件和互连技术。例如，传统的远程直接内存访问（RDMA）通信机制已广泛应用于许多领域，如高性能计算（HPC）、大数据、键值存储和高性能云计算工作负载。随着ML的进步，对RDMA和GPUDirect RDMA（GDR）的需求也在增加。互连速度需求可以达到每端口400 Gbps。

However, while MPI has enjoyed success in the academic world, it is not widely adopted in the industry. Instead, many industry-leading companies like NVIDIA and Microsoft have developed their own collective communication libraries for deep learning applications. Most notably, the NVIDIA Collective Communications Library (NCCL), first released by NVIDIA in 2015, has gained enough traction to inspire other companies to develop and deploy similar collective libraries such as AMD's ROCm Collective Communication Library (RCCL) and parts of Gloo. In this paper, we refer to such collective communication libraries as xCCL. The evolution from MPI-dominated collectives for classic HPC scenarios to emerging hardware-accelerated collectives for deep learning scenarios is shown in Fig. 1.

然而，虽然MPI在学术界取得了成功，但它在工业界并未被广泛采用。相反，许多行业领先的公司，如NVIDIA和微软，开发了自己的用于深度学习应用的集合通信库。最著名的是NVIDIA的集合通信库（NCCL），由NVIDIA于2015年首次发布，已获得足够的关注，激励其他公司开发和部署类似的集合通信库，如AMD的ROCm集合通信库（RCCL）和Gloo的部分组件。在本文中，我们将这些集合通信库称为xCCL。从经典HPC场景中由MPI主导的集合通信到用于深度学习场景的硬件加速集合通信的演变如图1所示。

This momentum has motivated us to pose several research questions:

1) What makes the contemporary xCCL libraries more attractive than classic MPI designs?

2) What are the performance characteristics of each collective communication library?

3) How are these xCCL libraries designed? Are there shared design patterns, and if so, why?

这种势头促使我们提出了几个研究问题： 1）是什么使得当代的xCCL库比经典的MPI设计更具吸引力？ 2）每个集合通信库的性能特征是什么？ 3）这些xCCL库是如何设计的？是否存在共同的设计模式，如果有，原因是什么？

To answer these questions, we survey the current state-of-the-art collective communication libraries (i.e., xCCL), focusing on those developed for industry deep learning workloads. We investigate the features of these xCCLs, compare their performance with experiments, and discuss key takeaways and interesting observations.

为了解答这些问题，我们调查了当前最先进的集合通信库（即xCCL），重点关注那些为工业深度学习工作负载开发的库。我们研究了这些xCCL的特性，通过实验比较它们的性能，并讨论关键的收获和有趣的观察。

The rest of this paper is organized as follows. Section 2 introduces widely used collective communication routines. Section 3 and Section 4 describe the popular physical network topologies and collective algorithms. In Section 5, we present the impact of collectives on machine learning training as well as some case studies from the industry. In Section 6, we survey representative industry-developed collective communication libraries and introduce their features. In Section 7, we select several libraries and run experiments to benchmark them. We show a comparison of their performance characteristics. Section 8 will discuss some of our observations and insights. Lastly, Section 9 discusses some related work and Section 10 concludes this paper.

本文的其余部分组织如下：第2部分介绍了广泛使用的集合通信例程。第3部分和第4部分描述了流行的物理网络拓扑结构和集合算法。在第5部分，我们展示了集合通信对机器学习训练的影响以及一些来自工业界的案例研究。第6部分，我们调查了具有代表性的工业开发的集合通信库并介绍其特性。第7部分，我们选择了几个库并进行了基准测试实验，展示了它们的性能特征比较。第8部分将讨论我们的一些观察和见解。最后，第9部分讨论了一些相关工作，第10部分总结了本文。

Outline

Section 2: collective communication routines
Section 3/4: network topologies and collective algorithms
Section 5: impact of collectives on ML && case studies
Section 6: collective communication libraries and its features
Section 7: experiment && benchmark
Section 8/9/10: discussion and conclusion

The main contributions of this paper are as follows.

Summarizing and studying the collective communication operations, network topologies, and algorithms that underpin contemporary distributed deep learning training.
Discussing industry collective communication solutions through case studies and a detailed examination of collective communication libraries.
Comparing the performance of current collective communication libraries using industry-made benchmarks.

本文的主要贡献如下：

总结和研究支撑当代分布式深度学习训练的集合通信操作、网络拓扑和算法
通过案例研究和详细检查集合通信库，讨论行业集合通信解决方案
使用行业基准测试比较当前集合通信库的性能

alt text