跳转至

SCHEDULING MSCCLANG PROGRAMS

After a program is lowered into an Instruction DAG, it is scheduled into a MSCCL-IR program that specifies the program’s execution. At a high-level this process assigns every instruction to a thread block that will execute it and every communication edge to a channel identifying the connection data is transferred through.

在程序被降低到指令有向无环图(Instruction DAG)之后,它会被调度到一个 MSCCL-IR 程序中,该程序指定了程序的执行。在高层次上,这个过程将每个指令分配给将执行它的线程块,并将每条通信边分配给一个通道,用于标识数据传输的连接。

MSCCL-IR. MSCCL-IR is a tree data structure ( Figure 4) that divides a collective into individual GPU programs that are subdivided into thread blocks containing a list of instructions that are sequentially executed.

MSCCL-IR 是一种树状数据结构(如图4所示),它将集体操作分解为单个 GPU 程序,这些程序又细分为线程块,其中包含按顺序执行的指令列表。

Thread blocks can make two uni-directional connections to a send peer and a receive peer that are used for sending and receiving chunks. GPUs are allowed to have multiple redundant connections to the same GPU that are identified by a channel. Channels are similar to tags in MPI that identify the route a send takes.

线程块可以为发送对等方和接收对等方建立两个单向连接,这些连接用于发送和接收数据块。一个GPU 允许 与 另一相同的GPU 建立多个冗余连接,这些连接通过通道进行标识。通道类似于 MPI 中的标签,用于标识发送的路径。

GPUs are allowed to have multiple redundant connections to the same GPU that are identified by a channel

线程块的连接

每个线程块可以建立两个单向连接:一个发送连接和一个接收连接 - 发送连接用于将数据块(chunks)发送给其他 GPU - 接收连接用于从其他 GPU 接收数据块

冗余连接

每个 GPU 可以与同一个 GPU 建立多个冗余连接 - 这些冗余连接通过一个“通道”(channel)来识别 - 通道的作用类似于 MPI 中的标签(tag),它们标识发送操作的路径

通道的作用

通道用来管理和区分不同的连接路径,确保数据在正确的路径上传输

具体示例,假设有两个 GPU,分别为 GPU A 和 GPU B: GPU A 的 一个线程块 可以通过一个发送连接将数据块发送给 GPU B 的一个线程块。 同时,GPU A 的 另一个线程块 可以通过不同的发送连接将数据块发送给 GPU B 的 另一个线程块

这些不同的连接通过通道(类似标签)来区分,以确保数据不会混淆

Our design restricts thread blocks to have at most one send and receive connections so that two thread blocks do not serialize over the same connection. Similarly, a connection can only have one sending and receiving thread block. The compiler ensures this constraint is honored by during scheduling.

我们的设计限制了线程块最多只能有一个发送和接收连接,以避免两个线程块在同一连接上进行串行操作。同样,一个连接只能有一个发送和接收线程块。编译器在调度过程中确保遵守这一约束。

alt text

alt text

DSL Scheduling Directives

Optimizing a program’s schedule is crucial for extracting performance. The MSCCLang DSL provides a set of scheduling directives so that users can optionally specify optimizations on top of a program. These optimizations trade off between parallelization, by assigning instructions across multiple thread blocks and occupancy constraints, as each thread block consumes GPU resources.

优化程序的调度对性能提升至关重要。MSCCLang DSL 提供了一组调度指令,用户可以选择在程序的基础上指定这些优化。这些优化在并行化(通过将指令分配给多个线程块)和占用约束(因为每个线程块消耗 GPU 资源)之间进行权衡。

Channel Directives. A program may use multiple connections between the same pair of GPUs that are differentiated by their channels. Programs can specify that an operation utilizes a particular channel with an optional parameter ch. The code below schedules two copies between the same pair of GPUs on different channels which ensures they can execute in parallel:

通道指令。程序可以使用同一对 GPU 之间的多个连接,这些连接通过通道来区分。程序可以通过可选参数 ch 指定某个操作使用特定通道。下面的代码在不同通道上调度了同一对 GPU 之间的两个复制操作,从而确保它们可以并行执行:

Python
1
2
c.copy(rank, buf, idx, ch=0) 
c.copy(rank, buf, idx, ch=1)

In the hierarchical AllReduce, we manually scheduled intra-node ReduceScatters onto channel 0, the inter-node AllGathers and ReduceScatters onto channel 1, and the intra-node AllGather onto channel 2.

在分层 AllReduce 中,我们手动将节点内的 ReduceScatter 调度到通道 0,将节点间的 AllGather 和 ReduceScatter 调度到通道 1,将节点内的 AllGather 调度到通道 2。

Setting
  • intra-node ReduceScatter - channel_0
  • inter-node ReduceScatter - channel_1
  • inter-node AllGather - channel_1
  • intra-node AllGather - channel_2

Chunk Parallelization. An important performance aspect is the amount of parallelism used for transfers. Chunk parallelization is an automatic optimization that breaks up a transfer into multiple smaller transfers that execute in parallel. In the hierarchical AllReduce, we parallelize the intra-node ReduceScatters and AllGathers using the parallelize modifier:

一个重要的性能方面是传输使用的并行度。块并行化是一种自动优化,它将传输分解为多个较小的并行执行的传输。在分层 AllReduce 中,我们使用 parallelize 修饰符并行化节点内的 ReduceScatter 和 AllGather:

Python
1
2
3
4
5
for n in range(N):
    local_ranks = [i+n*G for i in range(G)]
    with parallelize(N):
        ReduceScatter(local_ranks, 0, N)     
...

Parallelizing a code fragment by 𝑁 has the effect of creating 𝑁 parallel instances of the underlying copy and reduce operations, where each operation operates on 1/𝑁 of the data. The compiler duplicates instruction nodes corresponding to the fragment and ensures each instances’s channels do not intersect so that instances execute in parallel.

通过 N 并行化代码片段的效果是创建 N 个底层复制和归约操作的并行实例,其中每个操作处理 1/N 的数据。编译器会复制与片段对应的指令节点,并确保每个实例的通道不相交,以便实例可以并行执行。

There are two advantages of chunk parallelization. First, this enables parallelization of compute heavy aspects of the algorithm such as reductions. Second, parallelization can increase the utilization of high-bandwidth links by allowing multiple thread blocks to simultaneously use the underlying link. Our experience has shown that a single thread block in an NVIDIA A100 GPU is not capable of saturating the bandwidth of its outgoing NVLink. The user should carefully choose the parallelization factor as increasing it beyond a certain point will reduce performance due to competition for bandwidth.

块并行化有两个优点。首先,它使算法中计算量大的方面(如归约)能够并行化。其次,并行化可以通过允许多个线程块同时使用底层链路来提高高带宽链路的利用率。我们的经验表明,单个线程块在NVIDIA A100 GPU中无法饱和其外出的NVLink带宽。用户应该仔细选择并行化因子,因为将其增加到一定程度以上会由于对带宽的竞争而降低性能。

Aggregation. When multiple chunks are transferred from one GPU to another and these chunks are contiguous, it is sometimes more efficient to aggregate these chunks in a single network transfer. Assuming an 𝛼 − 𝛽 communication cost model, each send has a start up 𝛼 cost and a per-byte 𝛽 cost. By aggregating sends, the compiler amortizes the start up cost. However, aggressive aggregation can slow down a program. All aggregated chunks must be ready before the send is executed, so that one delayed chunk blocks the progress of multiple chunks.

当多个数据块从一个GPU传输到另一个GPU,并且这些数据块是连续的时,有时将这些数据块聚合在一次网络传输中更高效:

假设一个𝛼 − 𝛽通信成本模型,每次发送都有一个启动成本𝛼和一个每字节成本𝛽。通过聚合发送,编译器摊薄了启动成本

然而,过度的聚合可能会使程序变慢:

所有聚合的数据块必须在发送执行之前准备好,因此一个延迟的数据块会阻碍多个数据块的进展

Users specify aggregated sends by passing multi-count chunk references to copy and reduce operations. For example, Line 8 and Line 17 indicate that 𝑁 chunks are should be aggregated into a single send during the intra-node ReduceScatters and AllGathers.

用户通过向复制和归约操作传递多计数数据块引用来指定聚合发送。例如,第8行和第17行表明在节点内的ReduceScatters和AllGathers操作中,应该将N个数据块聚合成一次发送

alt text

Scheduling

Given a program’s Instruction DAG and scheduling directives, the compiler assigns every instruction node to a thread block and remaining edges onto channels. This assignment respects the constraints that each thread block can have at most one send and receive peer. Additionally for correctness, the assignment does not introduce deadlocks, which are possible due to the sequential execution order of instructions within a thread block.

给定一个程序的指令DAG和调度指令,编译器将每个指令节点分配给一个线程块,并将剩余的边分配到通道上。这个分配遵循以下约束:每个线程块最多只能有一个发送和接收对等体。此外,为了保证正确性,分配不会引入死锁,这在线程块内指令的顺序执行过程中是可能发生的。

Channel Assignment. The compiler assigns every communication edge according to the user’s scheduling directives. Any remaining communication edges are assigned to the lowest valid channel with two exceptions. First, communication edges generated from a parallelized code fragment are scheduled onto different channels so that they don’t serialize. Second, a series of fused instructions share the same channel. The compiler ensures this by assigning the lowest channel that satisfies all communication edges in the chain

通道分配:编译器根据用户的调度指令分配每条通信边。任何剩余的通信边都会被分配到最低有效通道,但有两个例外。首先,由并行化代码片段生成的通信边被调度到不同的通道上,以防止它们串行化。其次,一系列融合指令共享同一个通道。编译器通过分配满足链中所有通信边的最低通道来确保这一点。

Thread Block Assignment. The compiler’s thread block assignment policy implements a greedy heuristic that attempts to schedule instructions in the order they will be ready. The high level steps of the routine are as follows:

线程块分配:编译器的线程块分配策略实现了一种贪心启发式方法,试图按照指令准备就绪的顺序进行调度。该方法的高级步骤如下:

  1. Assign instruction priority by calculating the depth (maximum hops from a root node) and reverse depth (maximum hops to a leaf node) for every instruction in the Instruction DAG. This prioritizes instructions that are enabled earlier and have more downstream dependencies respectively.

  2. Create thread blocks by scanning through all instructions per GPU and creating thread blocks for every unique (send-peer, receive-peer, channel) tuple.

  3. Sort instructions into a global topological order with respect to their dependencies with a heap using the priority to order instructions.

  4. Assign instructions to thread blocks to their matching matching thread block in the topological order. If an instruction has multiple candidates (e.g. local copies can happen on any thread block) then the thread block whose latest assigned instruction is earliest is chosen.

An Instruction DAG is guaranteed to have a global topological order because it was generated by sequentially tracing a MSCCLang program. By assigning instructions to thread blocks in a topological order that respects communication and processing edges, all implicit dependencies introduced by thread block sequential execution cannot produce cycles so that the MSCCL-IR does not have deadlocks.

指令DAG因为是通过顺序跟踪MSCCLang程序生成的,因此保证具有全局拓扑顺序。通过按照尊重通信和处理边的拓扑顺序将指令分配给线程块,线程块顺序执行引入的所有隐式依赖都不会产生循环,因此MSCCL-IR不会有死锁。

No Deadlock!

指令DAG的全局拓扑顺序

DAG(有向无环图)中的节点表示指令,边表示指令之间的依赖关系。

由于MSCCLang程序是顺序跟踪生成的,指令之间的依赖关系保持了程序的顺序,因此DAG没有环路。

这种顺序保证了图中节点有一个全局拓扑排序,可以从没有前驱节点(即根节点)开始,逐层处理所有节点,直至所有节点都处理完毕。

通信和处理边

通信边表示跨线程块的指令依赖,例如发送和接收指令之间的依赖。

处理边表示线程块内的指令依赖,即指令的执行顺序。

在指令分配过程中,必须保持这些依赖关系,确保前置指令在其依赖的指令之前执行。

避免死锁

通过按拓扑顺序分配指令,可以确保所有隐式依赖都能得到满足,即任何指令都在其前置依赖指令之后被分配和执行。

由于DAG没有环路,线程块内顺序执行的指令也不会引入新的循环依赖,从而不会产生死锁。

Cross Thread block Synchronization Insertion. An instruction’s dependencies are captured in the Instruction DAG as communication edges between sends and receives, and processing edges that indicate execution order within a rank. The MSCCL-IR program must respect this order to be correct and data race free. While communication edges implicitly synchronize because receives block until a send, certain processing edges need explicit synchronization.

跨线程块同步插入。指令的依赖关系在指令DAG中以发送和接收之间的通信边,以及指示在同一等级内的执行顺序的处理边来表示。MSCCL-IR程序必须尊重这种顺序才能正确且无数据竞争。虽然通信边由于接收会阻塞直到发送完成而隐式同步,但某些处理边则需要显式同步。

Instructions within a thread block are executed sequentially, and thus any processing edges between them are already satisfied. However, instruction across thread blocks execute out-of-order. Processing edges between different thread blocks are explicitly preserved in the MSCCL-IR file as cross thread block dependencies. Instructions with cross thread block dependencies have a dep modifier which identifies the instruction(s) that must execute before.

线程块内的指令是按顺序执行的,因此它们之间的任何处理边已经得到了满足。然而,不同线程块之间的指令是无序执行的。不同线程块之间的处理边在MSCCL-IR文件中显式保留为跨线程块依赖关系。具有跨线程块依赖关系的指令带有一个dep修饰符,该修饰符标识必须先执行的指令。

cross thread block dependencies

跨线程块同步插入

在不同线程块之间的指令执行顺序是不确定的。为了确保指令按正确的顺序执行,编译器需要明确地保留这些依赖关系

这在MSCCL-IR文件中通过跨线程块依赖关系来实现。对于有跨线程块依赖关系的指令,它们会带有一个dep修饰符,用于标识必须先执行的指令。

详细解释

  1. 线程块内的顺序执行:在同一个线程块中的指令是按顺序执行的,所以这些指令之间的依赖关系(处理边)已经得到了满足。也就是说,线程块内的指令不需要额外的同步。

  2. 线程块之间的无序执行:不同线程块之间的指令执行顺序是不确定的。为了确保正确的数据流和防止数据竞争,必须显式地处理这些跨线程块的依赖关系。

  3. 跨线程块依赖关系:MSCCL-IR文件中保留了不同线程块之间的处理边作为跨线程块依赖关系。这样的依赖关系需要通过明确的指令同步来确保执行顺序。

dep修饰符

对于 有跨线程块依赖关系的指令 ,它们会带有一个dep修饰符。这个修饰符用于 标识那些必须在当前指令之前执行的指令 。这样可以确保即使在不同线程块之间,指令的执行顺序也能被正确地维护,从而避免数据竞争和死锁问题。