LOWERING MSCCLANG PROGRAMS¶
This section explains how MSCCLang’s compiler lowers programs into instructions by first tracing them into a directed acyclic graph (DAG) of operations, which we call the Chunk DAG, and then further lowering into an Instruction DAG. Section 5 discusses how the compiler schedules these instructions into code targeting the low-level MSCCL-IR for our runtime, as well as the optimization interface the DSL provides for controlling scheduling decisions.
本节解释了 MSCCLang 的编译器如何将程序降低为指令,首先将它们追踪为一个操作的有向无环图(DAG),我们称之为块DAG,然后进一步降低为指令DAG。第5节讨论了编译器如何将这些指令调度成针对低级MSCCL-IR运行时的代码,以及DSL提供的用于控制调度决策的优化接口。
Tracing¶
The compiler traces a program by sequential execution into a Chunk DAG, which captures the global view of chunk movement and naturally exposes the program’s parallelism. The graph includes source nodes for all input chunks. Every copy and reduce operation is also a node, and the edges between nodes are dependencies between operations that arise from chunk movement (true dependencies) and reusing buffer indices (false dependencies).
编译器通过顺序执行将程序追踪为一个块DAG,该DAG捕捉了块移动的全局视图,并自然地揭示了程序的并行性。图中包括所有输入块的源节点。每个复制和归约操作也是一个节点,节点之间的边是由于块移动(真实依赖)和重用缓冲区索引(虚假依赖)而产生的操作依赖关系。
Figure 4 depicts a subset of the Chunk DAG of Figure 3 that traces chunk 0 across every rank. The Chunk DAG preserves the hierarchical structure of the program, with the first two levels of reduces corresponding to the intra-node ReduceScatter and the the last reduce corresponding to the inter-node ReduceScatter.
图4展示了图3中块DAG的一个子集,该子集追踪了每个rank中的块0。块DAG保留了程序的层次结构,前两个层次的归约对应于节点内的ReduceScatter,最后一个归约对应于节点间的ReduceScatter。
Instruction Generation¶
The compiler expands each chunk operation node into instruction nodes to generate the Instruction DAG. Instructions are either point-to-point communication primitives or local primitives that are executed by a single GPU. The instructions are listed below:
编译器将每个块操作节点扩展为指令节点,以生成指令DAG。指令可以是点对点通信原语或由单个GPU执行的本地原语。指令如下所示:
send(buffer,index)/recv(buffer,index) sends/receives from the given buffer at the chunk index to/from the remote GPU.
reduce(srcBuf,srcInd,dstBuf,dstInd) locally applies a pre-defined reduction operation to the corresponding chunks and stores the result in the destination.
copy(srcBuf,srcInd,dstBuf,dstInd) performs a local copy of a chunk from a source location to a destination.
recvReduceCopy(srcBuf,srcInd,dstBuf,dstInd) is a fused in-struction that receives a chunk, reduces it with a source chunk, and locally copies it to the destination. Abbreviated as rrc.
recvReduceCopySend, recvReduceSend, recvCopySend are additional fused instructions that performs receive, send, and an optional reduction of a chunk. Abbreviated as rrcs/rrs/rcs.
send(buffer, index) / recv(buffer, index)
:从给定缓冲区的块索引向远程GPU发送/接收数据reduce(srcBuf, srcInd, dstBuf, dstInd)
:在本地对相应块应用预定义的归约操作,并将结果存储在目标缓冲区 (buf,index)copy(srcBuf, srcInd, dstBuf, dstInd)
:在本地将一个块从源位置复制到目标位置recvReduceCopy(srcBuf, srcInd, dstBuf, dstInd)
:接收一个块,与源块进行归约操作,并在本地将其复制到目标位置的融合指令。缩写为rrc
recvReduceCopySend
、recvReduceSend
、recvCopySend
:额外的融合指令,执行接收、发送及可选的块归约操作。分别缩写为rrcs
、rrs
、rcs
The fused instructions can be implemented by composing send, recv, reduce, and copy instructions. However, fused implementations can optimize away global memory accesses as intermediate values are transferred through GPU registers.
这些融合指令可以通过组合发送、接收、归约和复制指令来实现。然而,融合实现可以通过使用GPU寄存器传递中间值来优化掉全局内存访问。
The compiler expands chunk operations differently depending on whether they are local or remote. A remote copy expands into a send and a receive instruction, and a remote reduce expands into a send and a receiveReduceCopy instruction. For local copy or reduction operations MSCCLang generates only a single local instruction. Note that instructions such as receiveCopySend cannot be generated this way as it requires looking at two chunk operations.
编译器根据块操作是本地的还是远程的来进行不同的扩展。远程复制扩展为发送和接收指令,远程归约扩展为发送和接收归约复制指令。对于本地复制或归约操作,MSCCLang只生成一个本地指令。需要注意的是,像接收复制发送这样的指令不能通过这种方式生成,因为它需要查看两个块操作。
Instructions such as receiveCopySend cannot be generated this way as it requires looking at two chunk operations
本地和远程操作
在 MSCCLang 中,块操作可以是本地的,也可以是远程的。本地操作发生在同一个 GPU 上,而远程操作则涉及多个 GPU 之间的通信。
远程复制操作
当需要在远程 GPU 之间复制数据时,编译器会将这种操作展开为两个指令:
- 发送指令(send):负责从 源 GPU 发送数据
- 接收指令(recv):负责在 目标 GPU 上接收数据
远程归约操作
当需要在远程 GPU 之间进行归约操作时(例如,将两个块的数据进行加和),编译器会将这种操作展开为两个指令:
- 发送指令(send):负责从源 GPU 发送数据。
- 接收归约复制指令(recvReduceCopy):负责在目标 GPU 上接收数据并进行归约操作,然后将结果存储在目标位置。
本地操作
对于本地的复制或归约操作,由于这些操作 只在一个 GPU 上进行,不涉及跨 GPU 的通信,所以编译器只需要生成一个本地指令。
这些本地指令要么是一个简单的复制操作(copy),要么是一个归约操作(reduce)。
特殊情况
有些复杂的融合指令(如 receiveCopySend)需要同时处理多个块操作。这种指令不能简单地通过上述方法生成,因为它们涉及到多个步骤的组合,需要编译器在生成指令时进行更复杂的分析和处理。
The compiler connects the two instructions resulting from a remote operation by a communication edge that indicates that the receiving side synchronizes with the sender. It also preserves the original edges of the Chunk DAG as processing edges, which represent the execution-order dependencies within ranks.
编译器通过一条通信边将远程操作产生的两个指令连接起来,这条边表明接收方与发送方同步。它还保留了块DAG的原始边作为处理边,这些边表示rank内的执行顺序依赖关系。
Instruction Fusion (融合)¶
The initial instruction generation pass only uses a subset of the available instructions and excludes the fused instructions that combine a receive and a send. The compiler performs a series of peephole optimizations to combine consecutive base instructions into fused instructions.
初始指令生成阶段仅使用可用指令的一个子集,并排除了那些将接收和发送操作结合在一起的融合指令。编译器随后执行一系列的peephole优化,将连续的基本指令组合成融合指令。
rrc
recvReduceCopy(srcBuf, srcInd, dstBuf, dstInd)
:接收 一个块,与源块进行 归约 操作,并在本地将其复制到目标位置的融合指令。缩写为 rrc
rcs. Rewrites a back-to-back receive and send on the same chunk into a fused receiveCopySend: If there are multiple sends dependent on the receive, the send on the longest path in the Instruction DAG is fused.
rcs. 将背靠背的接收和发送操作重写为融合的接收-复制-发送指令:如果有多个发送依赖于接收,则在指令DAG中选择最长路径上的发送进行融合。
rrcs. Rewrites a back-to-back receiveReduceCopy and a send on the same chunk into a receiveReduceCopySend.
rrcs. 将连续的接收-规约-复制操作和同一数据块上的发送操作重写为融合的接收-规约-复制-发送指令。
rrs. Is a special case of the previous optimization; if the result of the rrc is never used locally (i.e. it is later overwritten), the reduction result does not need to be saved locally and a more efficient receiveReduceSend instruction is used instead.
rrs. 是前述优化的一个特殊情况;如果接收-规约-复制操作的结果从未在本地使用(即后来会被覆盖),则规约结果无需在本地保存,而是使用更高效的接收-规约-发送指令。
高效地处理接收(receive)和发送(send)操作
当这些操作在程序中连续出现时:
rcs.(receiveCopySend)
原始操作:当程序中有连续的接收和发送操作时,编译器会将它们结合在一起。
- 优化前:两个单独的指令,一个是接收操作,接着是一个发送操作。
- 优化后:编译器将这两个操作融合成一个单一的指令——接收-复制-发送(receiveCopySend),减少了中间步骤。
特殊情况:如果接收操作依赖多个发送操作,编译器会选择在指令DAG(有向无环图)中最长路径上的发送进行融合,以最大化优化效果。
rrcs.(receiveReduceCopySend)
原始操作:当程序中有连续的接收-规约-复制(receiveReduceCopy)操作和发送操作时,编译器会将它们结合在一起。
- 优化前:三个单独的指令,一个是接收操作,然后是规约操作,接着是复制操作,最后是发送操作。
- 优化后:编译器将这些操作融合成一个单一的指令——接收-规约-复制-发送(receiveReduceCopySend),减少了中间步骤和内存访问。
rrs.(receiveReduceSend)
- 特殊情况:当接收-规约-复制(receiveReduceCopy)操作的结果在本地从未被使用(例如,结果之后会被覆盖),编译器会使用更高效的接收-规约-发送(receiveReduceSend)指令。
- 优化前:三个单独的指令,接收-规约-复制(receiveReduceCopy)操作,然后发送(send)。
- 优化后:编译器将这些操作融合成一个单一的指令——接收-规约-发送(receiveReduceSend),减少了不必要的本地存储和中间步骤。
Figure 4 depicts the hierarchical AllReduce Instruction DAG for chunk 0 up to the inter-node ReduceScatter. The compiler has expanded each operation node into two instruction nodes, with communication edges connecting a matching send and receive. Highlighted in green, is a back-to-back rrc and send that is fused into a rrs instruction.
图 4 描绘了层次化 AllReduce 指令有向无环图(Instruction DAG),以 0 号数据块为例,直到节点间的 ReduceScatter。编译器已将每个操作节点扩展为两个指令节点,通过通信边连接匹配的发送和接收操作。图中用绿色高亮标记的是连续的接收规约复制(rrc)和发送(send)操作,这些操作被融合成了一个接收规约发送(rrs)指令。