Two-tree algorithms for full bandwidth broadcast, reduction and scan¶

Note

Abstract¶

We present a new, simple algorithmic idea for the collective communication operations broadcast, reduction, and scan (prefix sums). The algorithms concurrently communicate over two binary trees which both span the entire network. By careful layout and communication scheduling, each tree communicates as efficiently as a single tree with exclusive use of the network. Our algorithms thus achieve up to twice the bandwidth of most previous algorithms. In particular, our approach beats all previous algorithms for reduction and scan. Experiments on clusters with Myrinet and InfiniBand interconnect show significant reductions in running time for all three operations sometimes even close to the best possible factor of two.

我们提出了一种用于集体通信操作（广播、规约和扫描（前缀和））的新型简单算法。该算法在两棵覆盖整个网络的二叉树上同时进行通信。通过精心的布局和通信调度，每棵树在专用网络上的通信效率与单棵树相当。因此，我们的算法在带宽上比大多数先前的算法提高了两倍。特别是，我们的方法在规约和扫描操作上优于所有先前的算法。在使用Myrinet和InﬁniBand互连的集群上进行的实验显示，所有三种操作的运行时间均显著减少，有时甚至接近最佳可能的两倍改进。

Keywords¶

Message-passing parallel programming
Broadcast
Reduction
Parallel prefix (scan)
Bipartite-edge coloring (二分边着色)

Introduction¶

Parallel programs for distributed memory machines can be structured as sequential computations plus calls to a small number of (collective) communication primitives. Part of the success of communication libraries such as MPI (the Message Passing Interface, see [20]) is that this approach gives a clear division of labor: the programmer thinks in terms of these primitives and the library is responsible for implementing them efficiently. Hence, there has been intensive research on optimal algorithms especially for basic, collective communication primitives that involve sets of processors [1,3,4,10,17,19,21,23,24]. Constant factors matter eminently. This paper is concerned with the somewhat surprising observation that for a standard cost model and three of the most important primitives – broadcast, 1 reduction 2 (accumulate), and scan 3 (prefix sum) – a simple algorithmic approach closes the gap between previously known approaches and the lower bound of the execution time given by the (bidirectional) communication bandwidth of the machine. For broadcast this has previously been achieved also by other algorithms [2,9,12,23], but these are typically more complicated and/or do not extend to the reduction and parallel prefix operations. We believe the results achieved for reduction and parallel prefix to be the theoretically currently best known. The algorithms are simple to implement, and have been implemented within the framework of NEC proprietary MPI libraries [16].

并行程序在分布式内存机器上的结构可以是顺序计算加上少量（集体）通信原语的调用。像MPI（消息传递接口，参见[20]）这样的通信库取得成功的部分原因在于这种方法提供了明确的分工：程序员以这些原语为思考对象，而库则负责高效地实现它们。因此，针对涉及一组处理器的基本集体通信原语，已经进行了大量关于最优算法的研究[1,3,4,10,17,19,21,23,24]。常数因子极其重要。本文关注一个有些令人惊讶的观察结果，即对于标准成本模型和三种最重要的原语 —— 广播、规约（累加）和扫描（前缀和）—— 一种简单的算法方法缩小了已知方法和机器的（双向）通信带宽所给出的执行时间下限之间的差距。对于广播，这已通过其他算法实现[2,9,12,23]，但这些算法通常更复杂和/或不适用于规约和并行前缀操作。我们认为对于规约和并行前缀所取得的结果在理论上是目前已知的最优结果。这些算法简单易实现，并已在NEC专有的MPI库框架内实现[16]。

In Section 2 we review related work and basic ideas relevant for the subsequent discussion. In particular, our algorithms adopt the widely used approach to split the message into k packets and sending them along a binary tree in a pipelined fashion. The central new idea behind our 2Tree algorithms is to use two such binary trees at once, each handling half of the total message. Section 3 explains how these trees are constructed and how communication can be scheduled so that the both trees can work at the same bandwidth one would get with a single tree with complete use of the network. This is possible because we can pair leaves of one tree with interior nodes of the other tree. When a processing element (PE) sends to its ‘second’ child in one tree, it can simultaneously receive from its parent in the other tree. Note that this covers the entire communication requirement of a leaf. Scheduling the communication optimally is possible by modeling the communication requirements as a bipartite graph and amounts to solving a bipartite edge-coloring problem.

在第二节中，我们回顾了相关工作和与后续讨论相关的基本思想。特别地，我们的算法采用了将消息分割成k个数据包并以流水线方式沿着二叉树发送的广泛使用的方法。我们的2Tree算法背后的核心新思想是同时使用两棵这样的二叉树，每棵树处理一半的总消息量。第三节解释了如何构建这些树以及如何调度通信，使得这两棵树可以以单棵树完全利用网络时的带宽工作。这是可能的，因为我们可以将一棵树的叶节点与另一棵树的内部节点配对。当一个处理单元（PE）在一棵树上向其“第二”子节点发送时，它可以同时从另一棵树上的父节点接收。注意，这覆盖了叶节点的全部通信需求。通过将通信需求建模为二部图并解决一个二部边着色问题，可以实现通信的最优调度。

Section 4 adapts the 2Tree idea to three different collective communication operations. For a broadcast, the PE sending the message is not directly integrated into the two trees but alternates between sending packets to each of the roots of the two trees. Reduction is basically a broadcast with inverted direction of communication (plus the appropriate arithmetical operations on the data). For scanning, both trees (which have different root PEs) work independently. Otherwise, the necessary communications resemble a reduction followed by a broadcast.

第四节将2Tree思想应用于三种不同的集体通信操作。对于广播，发送消息的处理单元（PE）没有直接集成到两棵树中，而是交替地将数据包发送到两棵树的根节点。规约基本上是通信方向相反的广播（加上适当的数据算术操作）。对于扫描，两棵树（具有不同的根PE）独立工作。除此之外，必要的通信类似于规约后再进行广播。

Outline

Section 2: basic ideas and related work
Section 3: construction and communication scheduling
Section 4: adaptation to different collective communication operations
- broadcast
- reduction
- scanning
- necessary communications
Section 5: important refinements of 2Tree idea
Section 6: implementation results
Section 7: summary and future research

Important refinements of the 2Tree idea are discussed in Section 5. Perhaps the most surprising result is a simple yet nontrivial algorithm for computing the communication schedule of each PE in \(O(\log p)\) steps without any communication between the PEs. We also outline how to adapt the 2Tree algorithms to the simplex model and to networks with small bisection bandwidth. In Section 6 we report implementation results which indicate that our new algorithms outperform all previous algorithms in some situations. Section 7 summarizes the results and outlines possible future research.

第五节讨论了2Tree思想的重要改进。或许最令人惊讶的结果是一个简单但非平凡的算法，该算法可以在没有处理单元（PE）之间通信的情况下，以\(O(\log p)\)步骤计算每个PE的通信调度。我们还概述了如何将2Tree算法适应于单工模型和具有小截面带宽的网络。在第六节中，我们报告了实现结果，这些结果表明在某些情况下，我们的新算法优于所有先前的算法。第七节总结了结果并概述了可能的未来研究方向。

Adapt 2Tree algorithms to simplex model

在计算机网络和通信领域，“单工模型”指的是一种通信模式，其中数据传输是单向的，即信息只能从发送方传送到接收方，接收方不能回复发送方。

例如，电视广播和无线电广播就是单工通信的典型例子，信号只能从广播塔传送到接收设备，而接收设备不能发送任何信息回去。

在上下文中，将2Tree算法适应于单工模型可能意味着要考虑单向通信的限制，并相应地调整算法的设计和实现，以确保在这种通信环境下仍能高效工作。