MSCCLANG RUNTIME¶

The MSCCLang runtime executes program by directly interpreting MSCCL-IR programs. The runtime is an extension of NCCL, and it inherits infrastructure for establishing point-to-point (P2P) connections over various inter-connects including NVLink, PCIe, shared host memory, InfiniBand (IB) and TCP. All MSCCL-IR generated by our compiler is guaranteed to be correct, but some programs might only be performant for a range of buffer sizes. Therefore, the runtime dynamically selects the right algorithm to invoke based on user configurable size ranges and falls back to NCCL’s built-in algorithms otherwise. This allows a user to hyper-optimize MSCCLang programs to a specific use case.

MSCCLang运行时通过直接解释MSCCL-IR程序来执行程序。运行时是NCCL的扩展，继承了通过各种互连（包括NVLink、PCIe、共享主机内存、InfiniBand (IB) 和 TCP）建立点对点（P2P）连接的基础设施。我们编译器生成的所有MSCCL-IR都是保证正确的，但某些程序可能仅在特定的缓冲区大小范围内具有良好的性能。因此，运行时根据用户可配置的大小范围动态选择合适的算法进行调用，否则回退到NCCL的内置算法。这样，用户可以对MSCCLang程序进行超优化，以适应特定的使用场景。

Point-to-Point Connections¶

Remote Buffers. NCCL abstracts different kinds of interconnects from CUDA code by providing intermediate buffers of constant size of 𝑏 bytes for sends to write to and receives to read from. These buffers are subdivided into 𝑠 FIFO slots which allows 𝑠 sends to finish without waiting for receives (1 ≤ 𝑠 ≤ 8). MSCCLang compiler prevents a schedule with more than 𝑠 outstanding sends to avoid deadlocks. By default, 512KB ≤ 𝑏 ≤ 5MB and 1 ≤ 𝑠 ≤ 8 (exact values are defined by the protocol, explained later).

远程缓冲区。NCCL通过提供大小为𝑏字节的中间缓冲区来从CUDA代码中抽象出不同类型的互连，发送操作将数据写入这些缓冲区，而接收操作从这些缓冲区读取数据。这些缓冲区被细分为𝑠个FIFO插槽，这使得𝑠次发送可以在不等待接收的情况下完成（1 ≤ 𝑠 ≤ 8）。MSCCLang编译器防止安排超过𝑠次未完成的发送操作以避免死锁。默认情况下，512KB ≤ 𝑏 ≤ 5MB，1 ≤ 𝑠 ≤ 8（具体值由协议定义，稍后会解释）。

Remote buffers are allocated on different memories depending on the inter-connection type. For NVLink or PCIe connections within a node, buffers are allocated on the receiving GPU. For cross-node IB connections, two buffers are allocated with one on the sending GPU and another on the receiving GPU. The IB driver transfers data between the buffers via GPUDirect RDMA [26], with a CPU helper thread initiating RDMA transfers. Other types of interconnects involve host memory, but we omit their description as they are not used on our evaluation systems.

远程缓冲区根据互连类型在不同的内存上分配。对于节点内的NVLink或PCIe连接，缓冲区分配在接收GPU上。对于跨节点的IB连接，分配两个缓冲区，一个在发送GPU上，另一个在接收GPU上。IB驱动程序通过GPUDirect RDMA【26】在缓冲区之间传输数据，CPU辅助线程启动RDMA传输。其他类型的互连涉及主机内存，但由于它们未在我们的评估系统中使用，所以我们省略了它们的描述。

alt text

Channels. As explained in Section 5, each P2P connection in NCCL requires a channel, which is an internal NCCL data structure that distinguishes different P2P connections between the same pair of GPUs.

通道。正如第5节所解释的，每个NCCL中的P2P连接都需要一个通道，这是一个内部的NCCL数据结构，用于区分同一对GPU之间的不同P2P连接。

Protocols. NCCL implements three communication protocols, Simple, LL128, LL, that trade off latency and bandwidth. Simple has the highest bandwidth and latency, LL has the lowest bandwidth and latency, and LL128’s performance is in-between [27]. The protocol also defines the remote buffer size and the number of slots. The user may set a desired protocol in the DSL, which is stored in the MSCCL-IR.

协议。NCCL 实现了三种通信协议：Simple、LL128 和 LL，它们在延迟和带宽之间进行权衡。Simple 具有最高的带宽和延迟，LL 具有最低的带宽和延迟，LL128 的性能介于两者之间。协议还定义了远程缓冲区的大小和插槽的数量。用户可以在 DSL 中设置所需的协议，该协议将存储在 MSCCL-IR 中。

Interpreter¶

Initialization. In the initialization phase of the runtime, an MSCCLIR program is parsed and stored in the GPU memory. When the runtime invokes the interpreter for a given program, it concurrently launches all the required thread blocks with a cooperative kernel launch [8]. Note that all thread blocks must execute at the same time due to potential cross thread block dependencies between them. Consequently, the compiler can only generate IRs that do not have more thread blocks than the available Streaming Multiprocessors (SMs). The connections needed by the thread blocks in every program (Figure 4) are also created.

初始化。在运行时的初始化阶段，MSCCLIR 程序会被解析并存储在 GPU 内存中。当运行时调用某个程序的解释器时，它会通过协同内核启动【8】并发地启动所有所需的线程块。请注意，由于线程块之间可能存在跨线程块的依赖关系，所有线程块必须同时执行。因此，编译器只能生成不超过可用流多处理器（SM）数量的 IR。每个程序所需的线程块连接（如图 4 所示）也会被创建。

Instruction Data Structure. The execution engine for MSCCLang runtime is an efficient interpreter written in CUDA shown in Figure 5 which runs a list of instructions on each thread block. Line 1 shows the elements of an instruction: step is the instruction index in an array, opcode identifies the instruction type, srcPtr and dstPtr are the input and output pointers, and srcOff and dstOff are their corresponding offset, respectively. The pointers can be one of input, output, or scratch buffers, and offset is the chunk index into the buffer. count is the number of consecutive chunks this instruction will execute on (see aggregation in Section 2). Last arguments are for cross thread block synchronizations: depBid and depStep. These two arrays are a list of thread block IDs and instruction steps, respectively, that this instruction is dependent on. hasDep is a boolean flag indicating whether there are other instruction dependent on this instruction.

指令数据结构。MSCCLang运行时的执行引擎是一个用CUDA编写的高效解释器，如图5所示，它在每个线程块上运行一个指令列表。第1行展示了指令的元素：step是数组中的指令索引，opcode标识指令类型，srcPtr和dstPtr分别是输入和输出指针，srcOff和dstOff是它们各自的偏移量。这些指针可以是输入、输出或暂存缓冲区，偏移量是缓冲区中的块索引。count是该指令将执行的连续块的数量（见第2节中的聚合）。最后的参数是用于跨线程块同步的：depBid和depStep。这两个数组分别是该指令依赖的线程块ID和指令步骤的列表。hasDep是一个布尔标志，指示是否有其他指令依赖于该指令。

Figure 5: MSCCLIR Interpreter

C
struct Instruction { 
    int step, opCode, srcOff, dstOff, count; 
    void *srcPtr, *dstPtr;
    int depBid[D], depStep[D];
    bool hasDep; 
};

step: instruction index in an array
opcode: instruction type
srcPtr and dstPtr: input and output pointers
srcOff and dstOff: corresponding offset, respectively (offset is the chunk index into the buffer)
count: number of consecutive chunks this instruction will execute on
depBid / depStep: thread block IDs and instruction steps
hasDep: boolean flag indicating whether there are other instruction dependent on this instruction

Pipelining. The outer-most loop in the interpreter is the pipelining loop shown in Line 10 of Figure 5. As described in Section 6.1, the remote buffers for each P2P connection have a fixed size. Therefore, if the size of a chunk is larger than a remote buffer slot, it is split into multiple tiles such that it fits in a slot.

流水线处理。解释器中的最外层循环是流水线循环，如图5第10行所示。如第6.1节所述，每个P2P连接的远程缓冲区都有固定大小。因此，如果一个块的大小超过了远程缓冲区槽的大小，它会被分割成多个瓦片，以适应槽的大小。

Rather than serially process each tile within a chunk, the interpreter pipelines execution for performance. Consider the hierarchical AllReduce in Figure 1. It starts with an intra-node ReduceScatter followed by inter-node ReduceScatter and AllGather, and ends with an intra-node AllGather. If the interpreter serially executes each chunk’s tile, the inter-node communication links are not utilized during intra-node phases and vice versa (Figure 6). Instead, the interpreter pipelines execution of the tiles by processing tile 1, then processing tile 2, etc., so that both the inter-node and intra-node links are utilized concurrently.

解释器并不是串行地处理每个块中的每个瓦片，而是为了提高性能对执行进行流水线处理。考虑图1中的分层AllReduce。它从节点内的ReduceScatter开始，然后是节点间的ReduceScatter和AllGather，最后以节点内的AllGather结束。如果解释器串行地执行每个块的瓦片，则在节点内阶段期间节点间通信链接不会被利用，反之亦然（图6）。相反，解释器通过先处理瓦片1，然后处理瓦片2等，对瓦片的执行进行流水线处理，从而使节点间和节点内的链接能够同时被利用。

这里真是太天才了！！！！！

alt text

Pipelining improves performance by increasing link and SM utilization in the system. Users may configure MSCCLang’s tile size for more aggressive pipelining. However, as tile sizes reduce, the performance benefit of pipelining decreases due to the increased startup cost of executing more sends.

流水线处理通过增加系统中的链接和SM利用率来提高性能。用户可以配置MSCCLang的瓦片大小以实现更积极的流水线处理。然而，随着瓦片大小的减小，由于执行更多发送操作的启动成本增加，流水线处理的性能优势会减小。

Instruction Loop. The inner-most loop in the interpreter in Line 12 decodes instructions in the input MSCCL-IR and executes them in-order. There is a list of switch-case statements in Line 18 that decides which instructions to execute.

指令循环。解释器中最内层的循环在第12行，它解码输入的MSCCL-IR中的指令并按顺序执行它们。在第18行有一个switch-case语句列表，决定执行哪些指令。

Cross Thread Block Synchronization. Cross thread block synchronization is not naturally supported in CUDA. However, the interpreter runs all thread blocks concurrently, which allows thread blocks to synchronize via semaphores stored in global memory. Each thread block has a semaphore (semaphore[bid]) in Figure 5 that is initialized to 0. When an instruction hasDep is set (Line 25), a CUDA __syncthreads and a __threadfence is issued to flush the caches and then the semaphore is set to the running step s (Line 27). If this instruction is dependent on instructions from other thread blocks, all semaphores for dependent thread blocks wait to be set (Line 16).

跨线程块同步。在CUDA中，跨线程块同步并不自然支持。然而，解释器同时运行所有线程块，这允许线程块通过存储在全局内存中的信号量进行同步。每个线程块都有一个信号量(semaphore[bid])，如图5所示，其初始化为0。当指令的hasDep被设置时（第25行），会发出CUDA的__syncthreads和__threadfence指令以刷新缓存，然后信号量被设置为当前的执行步骤s（第27行）。如果该指令依赖于其他线程块的指令，所有依赖的线程块的信号量都会等待被设置（第16行）。

alt text