Network Topologies for Collectives¶

In MPI, the Communicator construct is an abstraction that hides the complexity of lower-level communication between processes. This makes programming much more convenient. However, the physical network topology, i.e., not just the virtual topology associated with MPI Communicators, chosen can heavily impact the performance of collective communications. This is true for traditional collective applications, and there are many studies focused on designing network topology-aware collective algorithms to best take advantage of different network architectures [22]. The network topology is especially important when considering hardware-accelerated collectives because poor architecture decisions have the potential to wipe out the performance gains realized by using accelerator-specific communications in communication-bound applications [23].

在MPI中，通信器构造是一个抽象，隐藏了进程之间低层通信的复杂性。这使得编程更加方便。然而，所选择的物理网络拓扑（即不仅仅是与MPI通信器相关的虚拟拓扑）会严重影响集体通信的性能。这对于传统的集体应用程序也是如此，并且有许多研究专注于设计网络拓扑感知的集体算法，以最大程度地利用不同的网络架构 [22]。在考虑硬件加速的集体通信时，网络拓扑尤其重要，因为糟糕的架构决策可能会消除通过使用特定于加速器的通信在通信密集型应用中实现的性能提升 [23]。

Hypercube¶

The hypercube network topology [24] consists of a set of nodes connected in a multi-dimensional cube. Increasing the number of nodes in the system will increase the dimensionality of the hypercube. Complete hypercubes must contain exactly \(2^k\) nodes, where \(k\) is the hypercube dimensionality. As a comparison, incomplete hypercubes can have any number of nodes [25]. The simplest hypercube network is two nodes connected by a single link and is described as a 1D (one-dimensional) hypercube. A 2D hypercube is four nodes connected in a square pattern. Figure 3 shows an example of a 4D hypercube containing 16 nodes and is, as a consequence, complete.

超立方体网络拓扑 [24] 由一组在多维立方体中连接的节点组成。增加系统中的节点数量将增加超立方体的维度。完整的超立方体必须包含恰好 \(2^k\) 个节点，其中 \(k\) 是超立方体的维度。相比之下，不完整的超立方体可以有任意数量的节点 [25]。最简单的超立方体网络是两个通过单一链接连接的节点，被描述为一维 (1D) 超立方体。二维 (2D) 超立方体是四个以方形排列连接的节点。图3展示了一个包含16个节点的四维 (4D) 超立方体的例子，因此，它是完整的。

alt text

While hypercube topologies may not be deployed as frequently as other topologies in high-performance computing (HPC) applications in favor of architectures such as Fat-Tree (see Subsection 3.4) and Dragonfly+ (see Subsection 3.5), it serves as an important reference that other more recent topologies can be compared against. Hypercube topologies are resistant to node failures due to their high connectivity; however, the same high connectivity can result in scaling issues and higher complexity for networks with larger numbers of nodes [25]. They are also unique in that incomplete hypercube topologies can exhibit different performance characteristics than complete hypercube topologies [24, 25].

尽管在高性能计算 (HPC) 应用中，超立方体拓扑可能不如其他拓扑（如Fat-Tree（见3.4节）和Dragonfly+（见3.5节））那样频繁部署，但它作为其他更新拓扑的一个重要参考进行比较。超立方体拓扑由于其高连通性，对节点故障具有很强的抵抗力；然而，同样的高连通性也会导致具有大量节点的网络出现扩展问题和更高的复杂性 [25]。此外，不完整的超立方体拓扑表现出的性能特征也与完整的超立方体拓扑不同 [24, 25]。

Note

Hypercube topologies are resistant to node failures due to their high connectivity.

在超立方体拓扑中，每个节点都与多个其他节点相连。这种高连通性意味着即使一个节点故障，数据仍然可以通过许多其他路径传输。因此，即使某些节点出现故障，网络仍然可以继续运行

High connectivity can result in scaling issues and higher complexity for networks with larger numbers of nodes.

随着超立方体中节点数量的增加，连接数也呈指数增长。这种高连通性使得管理和扩展网络变得困难，因为处理增加的连接数变得更加复杂。此外，维护这些连接所需的资源也会变得相当可观，给可扩展性带来挑战

They are also unique in that incomplete hypercube topologies can exhibit different performance characteristics than complete hypercube topologies.

不完整的超立方体拓扑不具备完整超立方体拓扑的全部 \(2^k\) 个节点。正因为如此，网络行为和性能可能与完整超立方体有所不同。例如，数据传输路径、冗余性和容错性可能会有所不同，从而导致网络在不同条件下的运行效率不同

数据传输路径：在不完整的超立方体中，某些路径可能比完整的超立方体更长或更短，导致数据传输时间不同
冗余性：完整超立方体具有高度冗余性，即使多个节点故障，仍能保持高效运行。而不完整的超立方体由于节点较少，冗余性降低，可能在节点故障时性能下降
容错性：完整超立方体的高连通性提供了更好的容错能力，不完整超立方体在某些节点故障时可能无法提供同样的容错性能，从而影响网络的整体可靠性

Ring (圆环)¶

A ring topology [26] is a configuration where all the members are connected in a conceptually circular fashion. Hence, each member has two connections: one to each of its immediate neighbors. To communicate, packets of data are transmitted from one device to the next until reaching the destination. There are two transfer modes: the unidirectional ring network, in which packets of data travel in only one direction, and the bidirectional ring network, in which packets may travel in either direction.

环形拓扑 [26] 是一种配置，其中所有成员以概念上的圆形方式连接。因此，每个成员都有两个连接：一个连接到每个直接邻居。为了通信，数据包从一个设备传输到下一个设备，直到到达目的地。有两种传输模式：单向环形网络，其中数据包仅在一个方向上传输，以及双向环形网络，其中数据包可以在任一方向上传输。

The ring topology has three advantages. First, in both transmission modes of the ring topology, all data flow in only one direction, thus reducing packet collisions. Second, devices can be added easily without affecting the transmission speed. Finally, there is no need for a central server to coordinate network connectivity.

环形拓扑有三个优点。首先，在环形拓扑的两种传输模式中，所有数据仅朝一个方向流动，从而减少了数据包碰撞。其次，设备可以轻松添加，而不会影响传输速度。最后，不需要中央服务器来协调网络连接。

At the same time, the ring topology possesses notable disadvantages. First, in the worst case, data transmitted through the network must pass through all devices, which makes data transmission less flexible. Second, the entire topology will be affected if one machine experiences a failure. Also, the channel utilization of the ring topology is inefficient for short packets, and bandwidth fragmentation may occur.

同时，环形拓扑也有显著的缺点。首先，在最坏的情况下，通过网络传输的数据必须经过所有设备，这使得数据传输的灵活性降低。其次，如果一台机器发生故障，整个拓扑结构都会受到影响。此外，环形拓扑对短数据包的信道利用率低，可能会发生带宽碎片化。

alt text

Torus (环面)¶

A torus topology [27] is a generalization of the ring topology to higher dimensions, where the ring topology is viewed as a one-dimensional (1D) torus. In a 1D torus, as mentioned in Subsection 3.2, each member is connected to its two neighbors. Communication can occur bidirectionally (−x, +x).

环面拓扑 [27] 是对环形拓扑的高维推广，其中环形拓扑被视为一维 (1D) 环面。在一维环面中，如3.2节所述，每个成员连接到它的两个邻居。通信可以在双向（−x, +x）进行。

In a 2D torus configuration, there are two dimensions consisting of \(m\) rows and \(n\) columns. Each member in the topology is connected to its four immediate neighbors, and communication can occur in four directions (−x, +x, −y, +y).

在二维环面配置中，有两个维度，分别由 \(m\) 行和 \(n\) 列组成。拓扑中的每个成员都连接到其四个直接邻居，通信可以在四个方向（−x, +x, −y, +y）进行。

Similarly, in an N-dimensional torus topology, each member is connected to its \(2N\) neighbors, with communication possible in \(2^N\) directions, as each member node will have two neighbors in each dimension. Some examples of torus topologies of different dimensions are shown in Figure 4 (1D) and Figure 5 (2D).

类似地，在N维环面拓扑中，每个成员连接到其 \(2N\) 个邻居，通信可以在 \(2^N\) 个方向上进行，因为每个成员节点在每个维度中都有两个邻居。不同维度的环面拓扑示例见图4（1D）和图5（2D）。

alt text

One advantage of the torus topology is that it significantly decreases the topology diameter and reduces the cost of adding new members. All one must do is add additional links [28]. The torus topology can also provide higher bandwidth and lower latency than some other network topologies while still achieving high scalability. This is because the torus topology is homogeneous, and member nodes can communicate with one another via multiple routes [29]. For the same reason, it is also able to consume less power.

环面拓扑的一个优点是显著减少了拓扑直径，并降低了添加新成员的成本。只需添加额外的连接即可 [28]。环面拓扑还可以提供比一些其他网络拓扑更高的带宽和更低的延迟，同时仍然实现高可扩展性。这是因为环面拓扑是同质的，成员节点可以通过多个路径进行通信 [29]。由于相同的原因，它也能消耗更少的电力。

Note

(1) 减少拓扑直径

拓扑直径是 网络中任意两个节点之间的最长最短路径的长度 。也就是说，从一个节点到另一个节点，最短路径经过的节点数。

在环面拓扑中，每个节点都有两个维度上的连接（例如，在二维环面中，每个节点与其上、下、左、右的四个邻居相连）。这种结构使得节点之间的连接更加紧密，从而降低了网络的直径。

示例：在一维环面中，假设有10个节点，最远的两个节点之间的距离最多为5（环形结构允许数据从一个方向绕回）。在二维环面中，节点之间的最长距离被进一步缩短，因为每个节点可以通过更多的路径到达其他节点。

(2) 降低添加新成员的成本

在环面拓扑中，添加新节点通常只需要新增几条连接线，而不需要重构整个网络。

例如：

一维环面：在原有的环形网络上添加一个新节点时，只需将新节点与两个相邻的旧节点连接即可。
二维环面：添加新节点时，可以将其连接到周围的四个节点，这样不会对现有网络的其他部分产生重大影响。

The torus topology also has certain limitations. First, as dimensionality increases, the number of physical network connections necessarily increases. This means that wiring becomes more complex, and deployment costs grow [29]. Second, as new member nodes are added to a given dimension, it will require more energy and take longer to communicate within the dimension, as each message must travel through more nodes [29, 30]. To address this problem, a modified version of the torus topology, called folded-torus topology, was developed [31].

环面拓扑也有一定的限制。首先，随着维度的增加，物理网络连接的数量必然增加。这意味着布线变得更加复杂，部署成本也随之增加 [29]。其次，当在给定维度中添加新节点时，通信将需要更多的能量，并且由于每条消息必须通过更多的节点，通信时间也会变长 [29, 30]。为了解决这个问题，开发了一种环面拓扑的修改版本，称为折叠环面拓扑 [31]。

Fat-Tree¶

The fat-tree topology [32] is one of the most widely used topologies for efficient data communication. Unlike traditional tree structures in computer science, the fat-tree topology resembles trees in the real world. In a traditional tree topology, all branches have the same thickness, whereas in a (bandwidth) fat-tree topology, the communication bandwidth gets larger, i.e., the fat-tree gets "fatter," as one moves closer to the root. An illustration of a fat-tree topology is given in Figure 6.

胖树拓扑 [32] 是用于高效数据通信的最广泛使用的拓扑结构之一。与计算机科学中的传统树形结构不同，胖树拓扑类似于现实世界中的树木。在传统的树形拓扑中，所有的分支厚度相同，而在（带宽）胖树拓扑中，通信带宽随着离根部越来越近而变得更大，即胖树变得“更胖”。图6中给出了胖树拓扑的示意图。

In the fat-tree topology, only the leaves are used for computation, and all the other nodes are strictly for communication.

For example, when a leaf node wants to communicate with another leaf node, data will flow up the hierarchy recursively until a shared ancestor with the second leaf node is found. Data then flows back down the hierarchy to the second leaf node.

在胖树拓扑中，仅叶节点用于计算，而所有其他节点则专门用于通信。例如，当一个叶节点想要与另一个叶节点通信时，数据将递归地向上流动，直到找到与第二个叶节点共享的祖先节点。然后，数据会从层次结构中下降至第二个叶节点。

There are numerous advantages to using the fat-tree topology. First, the average distance between nodes grows logarithmically（对数） since it is a tree structure in nature [12]. In addition, it has also been proven that fat-trees are recursively scalable and partitionable with multiple well-designed routing algorithms [34]. Some other advantages of the fat-tree topology [33] include its symmetry, regularity, and high connectivity [35], which are attributable to its tree structure.

使用胖树拓扑有许多优点。首先，由于其本质上是一种树形结构，节点之间的平均距离以对数形式增长 [12]。此外，已证明胖树可以通过多种精心设计的路由算法实现递归的可扩展性和可分区性 [34]。胖树拓扑的其他优点 [33] 包括其对称性、规律性和高连通性 [35]，这些特性都归因于其树形结构。

Why average distance between nodes grows __logarithmically（对数）is an advantage

减少了延迟

对数增长意味着随着网络规模的增大，节点之间的平均距离增加得较慢，从而减少了数据传输的延迟。

提高了网络效率

节点之间的平均距离较短，数据包的传输路径较短，提高了数据传输效率，减少了网络拥塞和传输时间。

优化了路由

对数形式的平均距离简化了路由算法，使得网络能够更快地找到最佳路径，提高了网络性能。

支持大规模网络

对数增长特性使得胖树拓扑能够有效支持大规模网络，在网络节点数目非常大的情况下也能保持良好的性能和低延迟。

One disadvantage is the bandwidth requirement for the branches connected to the root [36]. This bandwidth requirement will get higher and higher as the fat-tree grows bigger, leading to a challenge in implementation. Another disadvantage is that due to the structure of the fat-tree topology, it is necessary to traverse all nodes between two leaf nodes when communicating data. In this case, load balancing and scheduling become another challenge [37].

一个缺点是连接到根节点的分支的带宽需求 [36]。随着胖树拓扑的规模变大，这种带宽需求会越来越高，从而导致实现上的挑战。另一个缺点是由于胖树拓扑的结构，在两个叶节点之间进行数据通信时，需要遍历所有中间节点。在这种情况下，负载均衡和调度成为另一个挑战 [37]。

alt text

Dragonfly and Dragonfly+¶

Another commonly used topology is the dragonfly topology [38], which is a hierarchical structure made up of multiple levels (i.e., routers and groups). An example of the dragonfly topology is shown in Figure 7. At the lowest level, each router is connected to multiple (p)terminals. Above this level is the group, which is a collection of routers (a routers) that have connections to routers in the same group (local channels) and connections to routers in other groups (h global channels). In a dragonfly topology, there will be many groups. All routers within a group work as a “virtual router” that has a very high radix \(k\). This radix is equal to the number of routers in the group multiplied by the number of connections each router has, \(k = a \times (p + h)\). All the numbers \(a\), \(p\), and \(h\) can be adjusted in accordance with the deployment requirements.

另一种常用的拓扑结构是龙飞拓扑 [38]，这是一种由多个层级（即路由器和组）组成的分层结构。龙飞拓扑的一个例子如图7所示。在最低层，每个路由器连接到多个终端(p个)。在这一层之上是组，组是由多个路由器（a 个路由器）组成的集合，这些路由器与同一组内的其他路由器（本地通道）和其他组的路由器（h个全球通道）有连接。在龙飞拓扑中，会有许多组。组内的所有路由器作为一个“虚拟路由器”工作，具有非常高的基数 \(k\)。这个基数等于组内路由器的数量乘以每个路由器的连接数，\(k = a \times (p + h)\)。所有的数值 \(a\)、\(p\) 和 \(h\) 可以根据部署需求进行调整。

Note

\(k = a \times (p + h)\)

k: 每组等效成为一个“虚拟交换机”，它的端口数是多少
a: 每组内的原本交换机个数
p: 每个交换机连接p个host，即：连接host的port数目是p
h: 每个交换机连接h个对外其他组的交换机(global link)，即：对外的port数目是h

根据端口数进行等效交换机，上述公式很显然

alt text

Modularity is one of the advantages that the dragonfly topology can provide [39]. Because the designs of intra-group connections and inter-group connections are decoupled, the wiring within a group does not affect the number of groups in the topology. In addition, this modular design also leads to the high scalability of the topology [38, 39]. The dragonfly topology is able to scale to a high number of nodes by simply increasing the effective radix, while still keeping a relatively low number of hops [38].

模块化是龙飞拓扑能够提供的优势之一 [39]。由于组内连接和组间连接的设计是分离的，组内的布线不会影响拓扑中组的数量。此外，这种模块化设计也使得拓扑具有高度的可扩展性 [38, 39]。龙飞拓扑可以通过简单地增加有效基数来扩展到大量节点，同时保持相对较低的跳数 [38]。

At the same time, this high number of connections also leads to a high construction cost for dragonfly topologies [39]. Additionally, these various connections can bring the topology high path diversity, causing low network utilization and throughput under certain traffic patterns [40]. To address this problem, an extended version called dragonfly+ has been introduced in recent years [41]. In the dragonfly+ topology, routers inside the group are connected in a Clos-like topology [32]. It also shows higher scalability and better router utilization [41].

与此同时，大量的连接也导致龙飞拓扑的建设成本很高 [39]。此外，这些各种连接可以为拓扑带来高路径多样性，在某些流量模式下会导致低网络利用率和吞吐量 [40]。为了解决这个问题，近年来引入了一个扩展版本，称为龙飞+ [41]。在龙飞+拓扑中，组内的路由器以类似Clos的拓扑结构连接 [32]。它还显示出更高的可扩展性和更好的路由器利用率 [41]。