INTRODUCTION¶

Growing expertise with clusters of commodity PCs have enabled a number of institutions to harness petaflops of computation power and petabytes of storage in a cost-efficient manner. Clusters consisting of tens of thousands of PCs are not unheard of in the largest institutions and thousand-node clusters are increasingly common in universities, research labs, and companies. Important applications classes include scientific computing, financial analysis, data analysis and warehousing, and large-scale network services.

随着对商用PC集群的熟练掌握，许多机构已经能够以成本高效的方式利用千万亿次的计算能力和PB级的存储容量。在最大的机构中，由数万台PC组成的集群并不罕见，而在大学、研究实验室和公司中，千节点集群也越来越普遍。重要的应用类别包括科学计算、金融分析、数据分析与仓储以及大规模网络服务。

Today, the principle bottleneck in large-scale clusters is often inter-node communication bandwidth. Many applications must exchange information with remote nodes to proceed with their local computation. For example, MapReduce [12] must perform significant data shuffling to transport the output of its map phase before proceeding with its reduce phase. Applications running on cluster-based file systems [18, 28, 13, 26] often require remote-node access before proceeding with their I/O operations. A query to a web search engine often requires parallel communication with every node in the cluster hosting the inverted index to return the most relevant results [7]. Even between logically distinct clusters, there are often signiﬁcant communication requirements, e.g., when updating the inverted index for individual clusters performing search from the site responsible for building the index. Internet services increasingly employ service oriented architectures [13], where the retrieval of a single web page can require coordination and communication with literally hundreds of individual sub-services running on remote nodes. Finally, the significant communication requirements of parallel scientific applications are well known [27, 8].

如今，大规模集群中的主要瓶颈往往是节点间的通信带宽。许多应用程序必须与远程节点交换信息，才能继续其本地计算。例如，MapReduce [12] 在进入其reduce阶段之前，必须执行大量的数据交换，以传输其map阶段的输出。运行在基于集群的文件系统上的应用程序 [18, 28, 13, 26] 通常需要在执行I/O操作之前访问远程节点。对网页搜索引擎的查询通常需要与集群中托管倒排索引的每个节点并行通信，以返回最相关的结果 [7]。即使在逻辑上独立的集群之间，也往往存在大量的通信需求，例如，当更新执行搜索的单个集群的倒排索引时，需要与负责构建索引的网站进行通信。互联网服务越来越多地采用面向服务的架构 [13]，其中单个网页的检索可能需要与运行在远程节点上的数百个独立子服务进行协调和通信。最后，并行科学应用的显著通信需求也众所周知 [27, 8]。

There are two high-level choices for building the communication fabric for large-scale clusters. One option leverages specialized hardware and communication protocols, such as InﬁniBand [2] or Myrinet [6]. While these solutions can scale to clusters of thousands of nodes with high bandwidth, they do not leverage commodity parts (and are hence more expensive) and are not natively compatible with TCP/IP applications. The second choice leverages commodity Ethernet switches and routers to interconnect cluster machines. This approach supports a familiar management infrastructure along with unmodiﬁed applications, operating systems, and hardware. Unfortunately, aggregate cluster bandwidth scales poorly with cluster size, and achieving the highest levels of bandwidth incurs non-linear cost increases with cluster size.

构建大规模集群通信网络有两种高层次的选择。一种选择是利用专用硬件和通信协议，如InfiniBand [2] 或 Myrinet [6]。虽然这些解决方案能够扩展到拥有数千个节点的高带宽集群，但它们不利用商用部件（因此更昂贵），并且与TCP/IP应用程序不原生兼容。另一种选择是利用商用以太网交换机和路由器来互连集群机器。这种方法支持熟悉的管理基础设施，并且无需修改应用程序、操作系统和硬件。然而，集群的总带宽随着集群规模的扩大而表现不佳，且要实现最高水平的带宽，成本随着集群规模的增加而非线性地增长。

For compatibility and cost reasons, most cluster communication systems follow the second approach. However, communication bandwidth in large clusters may become oversubscribed by a signiﬁcant factor depending on the communication patterns. That is, two nodes connected to the same physical switch may be able to communicate at full bandwidth (e.g., 1Gbps) but moving between switches, potentially across multiple levels in a hierarchy, may limit available bandwidth severely. Addressing these bottlenecks requires non-commodity solutions, e.g., large 10Gbps switches and routers. Further, typical single path routing along trees of interconnected switches means that overall cluster bandwidth is limited by the bandwidth available at the root of the communication hierarchy. Even as we are at a transition point where 10Gbps technology is becoming cost-competitive, the largest 10Gbps switches still incur signiﬁcant cost and still limit overall available bandwidth for the largest clusters.

出于兼容性和成本原因，大多数集群通信系统采用了第二种方法。然而，在大型集群中，通信带宽可能会因通信模式而被大幅度超额使用。也就是说，连接到同一个物理交换机的两个节点可能能够以全带宽（例如1Gbps）进行通信，但跨越多个层级在交换机之间移动时，可用带宽可能会受到严重限制。解决这些瓶颈需要非商用解决方案，例如大型10Gbps交换机和路由器。此外，沿互联交换机树的典型单路径路由意味着集群的总带宽受限于通信层级根部的可用带宽。即使我们正处于10Gbps技术变得具有成本竞争力的转折点，最大的10Gbps交换机仍然带来显著成本，并且仍然限制了最大集群的整体可用带宽。

In this context, the goal of this paper is to design a data center communication architecture that meets the following goals:

Scalable interconnection bandwidth: it should be possible for an arbitrary(随意的) host in the data center to communicate with any other host in the network at the full bandwidth of its local network interface.
Economies of scale: just as commodity personal computers became the basis for large-scale computing environments, we hope to leverage the same economies of scale to make cheap off-the-shelf Ethernet switches the basis for largescale data center networks.
Backward compatibility: the entire system should be backward compatible with hosts running Ethernet and IP. That is, existing data centers, which almost universally leverage commodity Ethernet and run IP, should be able to take advantage of the new interconnect architecture with no modiﬁcations.

在此背景下，本文的目标是设计一种满足以下目标的数据中心通信架构：

可扩展的互连带宽：数据中心内的任意主机应能够以其本地网络接口的全带宽与网络中的任何其他主机通信。
规模经济：正如商用个人计算机成为大规模计算环境的基础一样，我们希望利用相同的规模经济，使廉价的现成以太网交换机成为大规模数据中心网络的基础。
向后兼容性：整个系统应与运行以太网和IP的主机向后兼容。也就是说，几乎普遍使用商用以太网并运行IP的现有数据中心，应该能够在不做任何修改的情况下，利用新的互联架构。

We show that by interconnecting commodity switches in a fattree architecture, we can achieve the full bisection bandwidth of clusters consisting of tens of thousands of nodes. Speciﬁcally, one instance of our architecture employs 48-port Ethernet switches capable of providing full bandwidth to up 27,648 hosts. By leveraging strictly commodity switches, we achieve lower cost than existing solutions while simultaneously delivering more bandwidth. Our solution requires no changes to end hosts, is fully TCP/IP compatible, and imposes only moderate modiﬁcations to the forwarding functions of the switches themselves. We also expect that our approach will be the only way to deliver full bandwidth for large clusters once 10 GigE switches become commodity at the edge, given the current lack of any higher-speed Ethernet alternatives (at any cost). Even when higher-speed Ethernet solutions become available, they will initially have small port densities at signiﬁcant cost.

我们展示了通过在Fat-Tree架构中互联商用交换机，可以实现由数万个节点组成的集群的全双向带宽。具体而言，我们的架构实例之一采用了48端口以太网交换机，能够为多达27,648台主机提供全带宽支持。通过利用严格意义上的商用交换机，我们在降低成本的同时提供了比现有解决方案更多的带宽。我们的解决方案无需对终端主机进行任何更改，完全兼容TCP/IP，并且仅对交换机自身的转发功能进行适度修改。我们还预计，一旦10 GigE交换机在网络边缘成为商用品，由于当前没有任何更高速的以太网替代方案（无论成本如何），我们的方法将成为为大规模集群提供全带宽的唯一途径。即使当更高速的以太网解决方案出现时，它们初期的端口密度也会较小，且成本显著。