跳转至

Industry Solutions — xCCL

The rise in the popularity of distributed deep learning has contributed to growing interest in fast, efficient, and portable collective implementations. A summary of industry collective communication libraries is shown in Table 3.

分布式深度学习的流行促使人们对快速、高效和便携的集合实现产生了越来越大的兴趣。表3展示了行业集合通信库的概述。

alt text

We mainly focus on MSCCL in this part

Microsoft MSCCL

The Microsoft Azure team proposed the Microsoft Collective Communication Library (MSCCL) [36] to make creating and executing custom collective communication algorithms much easier. MSCCL is made up of three components: GC3 [73], TACCL [74], and SCCL[75]. GC3 provides a data-oriented domain-specific language (DSL) and a corresponding compiler to simplify GPU communication programming. TACCL is dedicated to automatically generating algorithms by guiding a synthesizer. SCCL synthesizes collective communication algorithms tailored to the hardware topology. With these three components, custom collective communication algorithms can be implemented efficiently and flexibly in MSCCL.

微软Azure团队提出了Microsoft集合通信库(MSCCL)[36],以简化自定义集合通信算法的创建和执行。MSCCL由三个组件组成:GC3 [73]、TACCL [74]和SCCL [75]。GC3提供了一种面向数据的领域特定语言(DSL)及相应的编译器,以简化GPU通信编程。TACCL专注于通过引导综合器自动生成算法。SCCL根据硬件拓扑合成集合通信算法。有了这三个组件,自定义集合通信算法在MSCCL中可以高效且灵活地实现。

Architecture

alt text

For a given collective communication algorithm and physical topology, MSCCL can explore different implementations and optimizations with high-level specifications. MSCCL enables generating efficient custom communication algorithms with a chunk-oriented program, as shown in Fig. 8(d). The chunk-oriented program specifies the chunk routine from source to destination. To specify chunk routing through GPUs, a DSL is used in GC3 and a communication sketch is used in TACCL. After the program is created, it can be traced into a chunk-directed acyclic graph (DAG). Then, the instruction DAG (distinct from the chunk DAG) is created by expanding the chunk operations into instruction operations. After that, the instruction DAG is scheduled after being compiled into an intermediate representation (IR). Once the IR is generated, the MSCCL runtime executes it efficiently, since the MSCCL runtime inherits NCCL's capability to set point-to-point links over various interconnects such as NVLink and PCIe.

对于给定的集合通信算法和物理拓扑,MSCCL可以通过高级规范探索不同的实现和优化。MSCCL通过块导向的程序来生成高效的自定义通信算法,如图8(d)所示。块导向的程序指定从源到目的地的块例程。为了指定通过GPU的块路由,GC3中使用DSL,而TACCL中使用通信草图。程序创建后,可以追踪到一个块导向的有向无环图(DAG)。然后,通过将块操作扩展为指令操作,创建与块DAG不同的指令DAG。在编译成中间表示(IR)后,指令DAG被调度。生成IR后,MSCCL运行时高效执行它,因为MSCCL运行时继承了NCCL在各种互连(如NVLink和PCIe)上设置点对点链接的能力。

Framework Support

Because MSCCL's API is compatible with NCCL, it is convenient to integrate the MSCCL runtime into state-of-the-art deep learning frameworks such as PyTorch by swapping out the NCCL backend with the MSCCL backend.

由于MSCCL的API与NCCL兼容,可以方便地将MSCCL运行时集成到最先进的深度学习框架(如PyTorch)中,只需将NCCL后端替换为MSCCL后端即可。

MSCCL Runtime

MSCCL DSL

The DSL is a chunk-oriented dataflow language that can be used to write an efficient communication kernel. The programmer specifies how chunks are routed across GPUs in this language.

DSL 是一种面向块的数据流语言,可用于编写高效的通信内核。程序员可以使用这种语言指定数据块如何在 GPU 之间路由。

MSCCL Runtime

IR is the executable file generated by MSCCL's compiler. It can be executed by the MSCCL runtime. The MSCCL runtime extends NCCL and uses NCCL’s point-to-point send and receive functionality and is backward compatible with NCCL’s API.

IR 是 MSCCL 编译器生成的可执行文件。它可以由 MSCCL 运行时执行。MSCCL 运行时扩展了 NCCL,并使用 NCCL 的点对点发送和接收功能,与 NCCL 的 API 向后兼容。

MSCCL Compiler

The MSCCL compiler traces the program to record the chunk dependencies in the chunk DAG. The compiler then performs a series of optimizations and schedules the resulting chunk DAG to thread blocks specified in the IR. The MSCCL DSL allows users to guide the compiler into optimizing and scheduling the program.

MSCCL 编译器跟踪程序以记录块依赖关系到块 DAG 中。然后,编译器执行一系列优化,并将生成的块 DAG 调度到 IR 中指定的线程块。MSCCL DSL 允许用户引导编译器优化和调度程序。

Optimization

It is important to optimize the schedules of the program to improve performance. A set of scheduling directives is used to optimize the trade-off for parallelization when scheduling instructions to multiple thread blocks. There are several aspects:

  1. Multiple connections may exist in the same pair of GPUs and are labeled as channels to help distinguish different connections. The most efficient channel can then be allocated for a particular operation.
  2. A transfer can be broken up into multiple smaller transfers to improve execution parallelism.
  3. When multiple contiguous chunks are sent from one GPU to another, aggregating these chunks in a single transfer can reduce latency.

优化程序的调度以提高性能非常重要。一组调度指令用于在调度指令到多个线程块时优化并行化的权衡。有几个方面:

  • 在相同的 GPU 对中可能存在多个连接,这些连接被标记为通道,以帮助区分不同的连接。然后,可以为特定操作分配最有效的通道。
  • 可以将一次传输分解为多个较小的传输,以提高执行并行性。
  • 当多个连续的块从一个 GPU 发送到另一个 GPU 时,聚合这些块到一次传输中可以减少延迟。

alt text

alt text

Practical Workloads and Applications

MSCCL Performance

MSCCL has been used for inference with a public-facing language model on 8x A100 GPUs; the operations of the GPU have been accelerated by 1.22x–1.29x, depending on the input batch size. MSCCL has also been used to train a sizeable Mixture-of-Experts model on 256x A100 GPUs, providing 1.10x–1.89x speed-up depending on the Mixture-of-Experts model architecture.

MSCCL 已用于在 8x A100 GPU 上进行公共语言模型的推理;根据输入批次大小,GPU 的操作速度提高了 1.22 倍至 1.29 倍。MSCCL 还用于在 256x A100 GPU 上训练大型专家混合模型,具体速度提高了 1.10 倍至 1.89 倍,具体取决于专家混合模型的架构。

Others

We ignored here :)