BACKGROUND¶

Current Data Center Network Topologies¶

We conducted a study to determine the current best practices for data center communication networks. We focus here on commodity designs leveraging Ethernet and IP; we discuss the relationship of our work to alternative technologies in Section 7.

Topology¶

Typical architectures today consist of either two- or three-level trees of switches or routers. A three-tiered design (see Figure 1) has a core tier in the root of the tree, an aggregation tier in the middle and an edge tier at the leaves of the tree. A two-tiered design has only the core and the edge tiers. Typically, a two-tiered design can support between 5K to 8K hosts. Since we target approximately 25,000 hosts, we restrict our attention to the three-tier design.

当前的典型架构通常由两级或三级交换机或路由器树组成。三级设计（见图1）在树的根部有一个核心层，在中间有一个汇聚层，在树的叶子处有一个边缘层。两级设计则只有核心层和边缘层。通常，两级设计可以支持5,000到8,000台主机。由于我们目标是大约25,000台主机，因此我们将重点放在三级设计上。

Switches at the leaves of the tree have some number of GigE ports (48–288) as well as some number of 10 GigE uplinks to one or more layers of network elements that aggregate and transfer packets between the leaf switches. In the higher levels of the hierarchy there are switches with 10 GigE ports (typically 32–128) and signiﬁcant switching capacity to aggregate trafﬁc between the edges.

树叶子层的交换机具有若干个GigE端口（48-288），以及一些10 GigE上行链路，连接到一个或多个网络层，以汇聚和传输叶子交换机之间的数据包。在层级结构的较高层，有些交换机配备了10 GigE端口（通常为32-128个），并具有显著的交换能力，用于汇聚边缘之间的流量。

alt text

Oversubscription¶

Many data center designs introduce oversubscription as a means to lower the total cost of the design. We deﬁne the term oversubscription to be the ratio of the worst-case achievable aggregate bandwidth among the end hosts to the total bisection bandwidth of a particular communication topology. An oversubscription of 1:1 indicates that all hosts may potentially communicate with arbitrary other hosts at the full bandwidth of their network interface (e.g., 1 Gb/s for commodity Ethernet designs). An oversubscription value of 5:1 means that only 20% of available host bandwidth is available for some communication patterns. Typical designs are oversubscribed by a factor of 2.5:1 (400 Mbps) to 8:1 (125 Mbps) [1]. Although data centers with oversubscription of 1:1 are possible for 1 Gb/s Ethernet, as we discuss in Section 2.1.4, the cost for such designs is typically prohibitive, even for modest-size data centers. Achieving full bisection bandwidth for 10 Gb/s Ethernet is not currently possible when moving beyond a single switch.

许多数据中心设计引入了过度订阅（oversubscription）来降低整体设计成本。我们将过度订阅定义为终端主机之间在最差情况下可实现的总带宽与特定通信拓扑的总双向带宽之比。1:1的过度订阅意味着所有主机都可能以其网络接口的全带宽（例如，对于商用以太网设计来说为1 Gb/s）与任意其他主机进行通信。5:1的过度订阅值意味着在某些通信模式下，只有20%的可用主机带宽可用。典型设计的过度订阅比率为2.5:1（400 Mbps）至8:1（125 Mbps）【1】。尽管对于1 Gb/s以太网来说，实现1:1的过度订阅在数据中心中是可能的，但如我们在2.1.4节中讨论的那样，这种设计的成本通常是难以承受的，即使对于中等规模的数据中心也是如此。当超出单一交换机时，实现10 Gb/s以太网的全双向带宽目前是不可行的。

Oversubscription

defRatio = $\frac{aggregate \ bandwidth \ in \ the \ worst \ case}{total \ bisection \ bandwidth}$

Common Ratio: 5:1, 2.5:1, 8:1

Multi-path Routing¶

Delivering full bandwidth between arbitrary hosts in larger clusters requires a “multi-rooted” tree with multiple core switches (see Figure 1). This in turn requires a multi-path routing technique, such as ECMP [19]. Currently, most enterprise core switches support ECMP. Without the use of ECMP, the largest cluster that can be supported with a singly rooted core with 1:1 oversubscription would be limited to 1,280 nodes (corresponding to the bandwidth available from a single 128-port 10 GigE switch).

要在较大的集群中实现任意主机之间的全带宽传输，需要构建一个具有多个核心交换机的“多根”树形结构（见图1）。这就要求使用一种多路径路由技术，例如ECMP（等价多路径路由）【19】。目前，大多数企业核心交换机支持ECMP。如果不使用ECMP，在单根核心的情况下，具有1:1过度订阅比率的最大支持集群规模将限制在1,280个节点（对应于一个128端口10 GigE交换机所能提供的带宽）。

To take advantage of multiple paths, ECMP performs static load splitting among ﬂows. This does not account for ﬂow bandwidth in making allocation decisions, which can lead to oversubscription even for simple communication patterns. Further, current ECMP implementations limit the multiplicity of paths to 8–16, which is often less diversity than required to deliver high bisection bandwidth for larger data centers. In addition, the number of routing table entries grows multiplicatively with the number of paths considered, which increases cost and can also increase lookup latency.

为了利用多条路径，ECMP在流之间执行静态负载分割。这种方法在进行分配决策时不考虑流的带宽，这可能会导致即使在简单的通信模式下也会出现过度订阅。此外，当前的ECMP实现将路径的多样性限制在8到16条，通常不足以为较大的数据中心提供高双向带宽。此外，随着考虑的路径数量增加，路由表条目的数量会成倍增加，这不仅提高了成本，还可能增加查找延迟。

Note

ECMP（等价多路径路由）在流之间执行静态负载分割，但不考虑流的带宽大小，因此可能导致过度订阅，即使是在简单的通信模式下。这是因为ECMP分配流量时基于哈希算法，将流量分配到多个路径上，而不考虑每个流的实际带宽需求。

静态分配：
- ECMP基于静态负载分配，它将流量在多个路径之间平均分配，而不是根据每个流的实际带宽需求进行动态调整。这意味着即使有些路径上的流量需求很大，而其他路径的流量需求很小，ECMP仍然会将流量均匀分配，从而可能导致某些路径上出现带宽过载。
流量不均衡：
- 由于不同流之间的带宽需求可能有很大差异，简单的静态分配方式无法适应这些差异。结果是，当带宽需求大的流被分配到同一路径时，该路径可能会被过度订阅，导致网络拥塞和性能下降。
缺乏流带宽意识：
- ECMP在进行流量分配时，主要关注的是路径的可用性，而不是具体流的带宽使用情况。由于不考虑流的带宽，某些路径可能会比其他路径承担更多的流量，即使这些路径的总带宽不足以处理所有分配到的流量，最终导致过度订阅。

Cost¶

The cost for building a network interconnect for a large cluster greatly affects design decisions. As we discussed above, oversubscription is typically introduced to lower the total cost. Here we give the rough cost of various conﬁgurations for different number of hosts and oversubscription using current best practices. We assume a cost of $7,000 for each 48-port GigE switch at the edge and $700,000 for 128-port 10 GigE switches in the aggregation and core layers. We do not consider cabling costs in these calculations.

构建大型集群的网络互连所需的成本在设计决策中起着重要作用。正如我们上文所讨论的那样，通常会引入过度订阅来降低总成本。这里，我们给出使用当前最佳实践下，不同主机数量和过度订阅情况下的各种配置的粗略成本估算。我们假设每个边缘层48端口GigE交换机的成本为7,000美元，每个聚合和核心层的128端口10 GigE交换机的成本为700,000美元。在这些计算中，我们未考虑布线成本。

Figure 2 plots the cost in millions of US dollars as a function of the total number of end hosts on the x axis. Each curve represents a target oversubscription ratio. For instance, the switching hardware to interconnect 20,000 hosts with full bandwidth among all hosts comes to approximately $37M. The curve corresponding to an oversubscription of 3:1 plots the cost to interconnect end hosts where the maximum available bandwidth for arbitrary end host communication would be limited to approximately 330 Mbps. We also include the cost to deliver an oversubscription of 1:1 using our proposed fat-tree architecture for comparison.

图2显示了以百万美元为单位的成本，作为x轴上端主机总数的函数。每条曲线代表一个目标过度订阅比率。例如，要实现20,000个主机之间全带宽互连的交换硬件成本约为3,700万美元。对应于3:1过度订阅比率的曲线描绘了端主机互连的成本，在这种情况下，任意端主机之间的最大可用带宽将限制在约330 Mbps。我们还包括了使用我们提出的胖树架构实现1:1过度订阅的成本，以供对比。

Overall, we ﬁnd that existing techniques for delivering high levels of bandwidth in large clusters incur signiﬁcant cost and that fat-tree based cluster interconnects hold signiﬁcant promise for delivering scalable bandwidth at moderate cost. However, in some sense, Figure 2 understates the difﬁculty and expense of employing the highest-end components in building data center architectures. In 2008, 10 GigE switches are on the verge of becoming commodity parts; there is roughly a factor of 5 differential in price per port per bit/sec when comparing GigE to 10 GigE switches, and this differential continues to shrink. To explore the historical trend, we show in Table 1 the cost of the largest cluster conﬁguration that could be supported using the highest-end switches available in a particular year. We based these values on a historical study of product announcements from various vendors of high-end 10 GigE switches in 2002, 2004, 2006, and 2008.

总体而言，我们发现现有的技术在大规模集群中提供高带宽水平的成本显著，而基于胖树的集群互连在以适度成本提供可扩展带宽方面具有显著潜力。然而，从某种意义上说，图2低估了在构建数据中心架构时使用最高端组件的困难和费用。在2008年，10 GigE交换机正处于成为商品部件的边缘；在比较GigE与10 GigE交换机时，每端口每比特/秒的价格差异大约为5倍，并且这一差异还在继续缩小。为了探讨这一历史趋势，我们在表1中显示了在2002年、2004年、2006年和2008年，使用特定年份内最高端交换机所能支持的最大集群配置的成本。这些数值基于对各大高端10 GigE交换机厂商在这些年份的产品发布的历史研究。

Clos Networks / Fat-Trees¶

Today, the price differential between commodity and non-commodity switches provides a strong incentive to build large-scale communication networks using many smaller commodity switches rather than fewer larger and more expensive ones. More than fifty years ago, similar trends in telephone switches led Charles Clos to design a network topology that delivers high levels of bandwidth for many end devices by appropriately interconnecting smaller commodity switches [11].

今天，商品交换机和非商品交换机之间的价格差异使得使用更多的小型商品交换机而非更少且更昂贵的大型交换机来构建大规模通信网络具有强大的吸引力。超过五十年前，电话交换机的类似趋势促使Charles Clos设计了一种网络拓扑，通过适当地互连较小的商品交换机，为多个终端设备提供高带宽[11]。

We adopt a specific instance of a Clos topology called a fat-tree [23] to interconnect commodity Ethernet switches. A k-ary fat-tree is organized as shown in Figure 3. There are k pods, each containing two layers of k/2 switches. Each k-port switch in the lower layer is directly connected to k/2 hosts. The remaining k/2 ports are connected to k/2 aggregation switches in the upper layer of the hierarchy.

我们采用了一种特殊的Clos拓扑结构，称为胖树（fat-tree）[23]，用于互连商品以太网交换机。我们如图3所示组织了一个k元胖树。这个胖树结构包含k个Pod，每个Pod包含两层k/2个交换机。下层的每个k端口交换机直接连接到k/2个主机。其余的k/2个端口则连接到层次结构中聚合层的k/2个k端口交换机。

alt text

There are (k/2)^2 k-port core switches. Each core switch has one port connected to each of the k pods. The i-th port of any core switch is connected to pod i such that consecutive ports in the aggregation layer of each pod switch are connected to core switches in strides of k/2. Generally, a fat-tree built with k-port switches supports k^3/4 hosts. In this paper, we focus on designs where k = 48, though our approach generalizes to arbitrary values of k.

An advantage of the fat-tree topology is that all switching elements are identical, enabling us to leverage inexpensive commodity parts for all of the switches in the communication architecture. Further, fat-trees are rearrangeably non-blocking, meaning that for arbitrary communication patterns, there exists some set of paths that will fully utilize all available bandwidth to the end hosts in the topology. Achieving an oversubscription ratio of 1:1 in practice may be challenging due to the need to prevent packet reordering in TCP flows.

Figure 3 shows the simplest non-trivial instance of the fat-tree with k = 4. All hosts connected to the same edge switch form their own subnet. Therefore, all traffic to a host connected to the same lower-layer switch is switched, while all other traffic is routed.

As an example, a fat-tree built from 48-port GigE switches would consist of 48 pods, each containing an edge layer and an aggregation layer with 24 switches each. The edge switches in every pod are connected to 24 hosts each. The network supports 27,648 hosts, organized into 1,152 subnets with 24 hosts each. There are 576 equal-cost paths between any pair of hosts in different pods. The cost of deploying such a network architecture would be approximately $8.64M, compared to $37M for traditional techniques described earlier.

胖树结构中有$(k/2)^2$个k端口核心交换机。每个核心交换机都有一个端口连接到每个Pod。任意核心交换机的第$i$个端口连接到Pod $i$，使得每个Pod交换机聚合层中的连续端口以k/2步长连接到核心交换机。通常，用k端口交换机构建的胖树支持$k^3/4$个主机。在本文中，我们关注的是k值高达48的设计。我们的方法可以推广到任意k值。

胖树拓扑的一个优势是所有交换元件都是相同的，这使我们能够利用廉价的商品部件来构建通信架构中的所有交换机。此外，胖树具有可重排无阻塞性，这意味着对于任意通信模式，都存在一组路径可以饱和拓扑中提供给终端主机的所有带宽。在实践中实现1:1的过度订阅比率可能会很困难，因为需要防止TCP流中的数据包重新排序。

图3显示了k=4时最简单的非平凡胖树实例。连接到同一边缘交换机的所有主机构成它们自己的子网。因此，连接到同一层交换机的主机的所有流量都被交换，而所有其他流量则被路由。

作为该拓扑的一个实例，由48端口GigE交换机构建的胖树将包含48个Pod，每个Pod包含一个边缘层和一个聚合层，每层有24个交换机。每个Pod中的边缘交换机分配给24个主机。该网络支持27,648个主机，由1,152个子网组成，每个子网有24个主机。不同Pod中任意一对主机之间有576条等成本路径。部署这种网络架构的成本为864万美元，而相比之下，使用前面描述的传统技术则需要3700万美元。