Revisiting Network Support for RDMA¶

Abstract¶

The advent of RoCE (RDMA over Converged Ethernet) has led to a significant increase in the use of RDMA in datacenter networks. To achieve good performance, RoCE requires a lossless network which is in turn achieved by enabling Priority Flow Control (PFC) within the network . However, PFC brings with it a host of problems such as head-of-the-line blocking, congestion spreading, and occasional deadlocks. Rather than seek to fix these issues, we instead ask: is PFC fundamentally required to support RDMA over Ethernet?

We show that the need for PFC is an artifact of current RoCE NIC designs rather than a fundamental requirement. We propose an improved RoCE NIC (IRN) design that makes a few simple changes to the RoCE NIC for better handling of packet losses. We show that IRN (without PFC) outperforms RoCE (with PFC) by 6-83% for typical network scenarios. Thus not only does IRN eliminate the need for PFC, it improves performance in the process! We further show that the changes that IRN introduces can be implemented with modest overheads of about 3-10% to NIC resources. Based on our results, we argue that research and industry should rethink the current trajectory of network support for RDMA.

RoCE（基于融合以太网的远程直接内存访问）的出现极大地推动了 RDMA 在数据中心网络中的应用。为实现高性能，RoCE 要求网络具备无损特性，而这通常通过在网络中启用 PFC（基于优先级的流量控制）来实现。然而，PFC 自身也带来了一系列问题，例如队头阻塞、拥塞扩散以及偶发性死锁。

我们没有试图修复这些问题，而是提出了一个根本性的疑问：在以太网上支持 RDMA 是否必须依赖 PFC？

我们证明了，对 PFC 的需求是当前 RoCE NIC（网络接口卡）设计所导致的人为结果，而非一项基本要求。

我们提出了一种改进型 RoCE NIC（IRN）设计，通过对 RoCE NIC 进行几项简单的改动，以更有效地处理丢包。实验表明，在典型的网络场景下，不使用 PFC 的 IRN 的性能比使用 PFC 的 RoCE 高出 6-83%。因此，IRN 不仅消除了对 PFC 的需求，还在这一过程中提升了性能！我们进一步证明，IRN 所引入的改动仅需增加约 3-10% 的 NIC 资源开销，代价适中。基于我们的研究结果，我们主张学术界和工业界应重新审视当前 RDMA 网络支持技术的发展方向。

TL;DR

目前的数据中心网络 (RoCE) 为了高性能运行，普遍采用了一种叫PFC的技术，但PFC本身会引发网络拥堵等新问题。研究者认为问题出在网卡（NIC）的设计上，并提出了一种改进型网卡（IRN）。这种新网卡不再需要PFC，并且性能反而比原来更好，成本增加也很小。

Introduction¶

Datacenter networks offer higher bandwidth and lower latency than traditional wide-area networks. However, traditional endhost networking stacks, with their high latencies and substantial CPU overhead, have limited the extent to which applications can make use of these characteristics. As a result, several large datacenters have recently adopted RDMA, which bypasses the traditional networking stacks in favor of direct memory accesses.

RDMA over Converged Ethernet (RoCE) has emerged as the canonical method for deploying RDMA in Ethernet-based datacenters [23, 38]. The centerpiece of RoCE is a NIC that (i) provides mechanisms for accessing host memory without CPU involvement and (ii) supports very basic network transport functionality. Early experience revealed that RoCE NICs only achieve good end-to-end performance when run over a lossless network, so operators turned to Ethernet’s Priority Flow Control (PFC) mechanism to achieve minimal packet loss. The combination of RoCE and PFC has enabled a wave of datacenter RDMA deployments.

数据中心网络提供了比传统广域网更高的带宽和更低的时延。然而，传统的终端主机网络协议栈因其高延迟和巨大的 CPU 开销，限制了应用程序利用这些网络特性的程度。因此，一些大型数据中心近来已采用 RDMA 技术，该技术通过直接内存访问来绕过传统网络协议栈。

RoCE（基于融合以太网的 RDMA）已成为在以太网数据中心部署 RDMA 的标准方法[23, 38]。RoCE 的核心是一个网络接口卡（NIC），该 NIC (i) 提供了无需 CPU 参与即可访问主机内存的机制，并且 (ii) 支持非常基础的网络传输功能。早期经验表明，RoCE NIC 只有在无损网络上运行时才能实现良好的端到端性能，因此网络运营商转而使用以太网的 PFC（基于优先级的流量控制）机制来最大限度地减少丢包。RoCE 与 PFC 的结合催生了一波数据中心 RDMA 的部署浪潮。

However, the current solution is not without problems. In particular, PFC adds management complexity and can lead to significant performance problems such as head-of-the-line blocking, congestion spreading, and occasional deadlocks [23, 24, 35, 37, 38]. Rather than continue down the current path and address the various problems with PFC, in this paper we take a step back and ask whether it was needed in the first place. To be clear, current RoCE NICs require a lossless fabric for good performance. However, the question we raise is: can the RoCE NIC design be altered so that we no longer need a lossless network fabric?

然而，当前的解决方案并非没有问题。特别是， PFC 增加了管理复杂性，并可能导致严重的性能问题，如队头阻塞、拥塞扩散和偶发性死锁 [23, 24, 35, 37, 38]。本文没有沿用当前路径去解决 PFC 的各种问题，而是退后一步，探究其最初的必要性。需要明确的是，当前的 RoCE NIC 确实需要无损网络交换矩阵才能获得良好性能。但我们提出的问题是：我们能否改变 RoCE NIC 的设计，从而不再需要一个无损的网络交换矩阵？

We answer this question in the affirmative, proposing a new design called IRN (for Improved RoCE NIC) that makes two incremental changes to current RoCE NICs (i) more efficient loss recovery, and (ii) basic end-to-end flow control to bound the number of in-flight packets (§3). We show, via extensive simulations on a RoCE simulator obtained from a commercial NIC vendor, that IRN performs better than current RoCE NICs, and that IRN does not require PFC to achieve high performance; in fact, IRN often performs better without PFC (§4). We detail the extensions to the RDMA protocol that IRN requires (§5) and use comparative analysis and FPGA synthesis to evaluate the overhead that IRN introduces in terms of NIC hardware resources (§6). Our results suggest that adding IRN functionality to current RoCE NICs would add as little as 3-10% overhead in resource consumption, with no deterioration in message rates.

我们对这个问题给出了肯定的回答，并提出了一种名为 IRN（改进型 RoCE NIC）的新设计。该设计对当前 RoCE NIC 进行了两项增量式改进：(i) 更高效的丢包恢复机制，以及 (ii) 基础的端到端流量控制以限制在途数据包的数量（§3）。我们通过在一个从商业 NIC 供应商处获得的 RoCE 仿真器上进行的大量仿真实验表明，IRN 的性能优于当前的 RoCE NIC，并且 IRN 无需 PFC 即可实现高性能；事实上，在没有 PFC 的情况下，IRN 的性能通常更好（§4）。我们详细阐述了 IRN 所需的 RDMA 协议扩展（§5），并使用对比分析和 FPGA 综合来评估 IRN 在 NIC 硬件资源方面引入的开销（§6）。我们的结果显示，向当前 RoCE NIC 添加 IRN 功能仅会增加 3-10% 的资源消耗开销，且消息速率不会降低。

A natural question that arises is how IRN compares to iWARP? iWARP [33] long ago proposed a similar philosophy as IRN: handling packet losses efficiently in the NIC rather than making the network lossless. What we show is that iWARP’s failing was in its design choices. The differences between iWARP and IRN designs stem from their starting points: iWARP aimed for full generality which led them to put the full TCP/IP stack on the NIC, requiring multiple layers of translation between RDMA abstractions and traditional TCP bytestream abstractions. As a result, iWARP NICs are typically far more complex than RoCE ones, with higher cost and lower performance (§2). In contrast, IRN starts with the much simpler design of RoCE and asks what minimal features can be added to eliminate the need for PFC.

More generally: while the merits of iWARP vs. RoCE has been a long-running debate in industry, there is no conclusive or rigorous evaluation that compares the two architectures. Instead, RoCE has emerged as the de-facto winner in the marketplace, and brought with it the implicit (and still lingering) assumption that a lossless fabric is necessary to achieve RoCE’s high performance. Our results are the first to rigorously show that, counter to what market adoption might suggest, iWARP in fact had the right architectural philosophy, although a needlessly complex design approach.

一个自然而然的问题是，IRN 与 iWARP 相比如何？ iWARP [33] 很久以前就提出了与 IRN 类似的理念：在 NIC 中高效处理丢包，而不是让网络本身无损。我们想说明的是， iWARP 的失败在于其设计选择 。iWARP 和 IRN 设计上的差异源于它们的出发点： iWARP 追求完全的通用性，这导致它将完整的 TCP/IP 协议栈置于 NIC 之上，需要在 RDMA 抽象和传统 TCP 字节流抽象之间进行多层转换。其结果是，iWARP NIC 通常比 RoCE NIC 复杂得多，成本更高，性能也更低（§2）。相比之下， IRN 从 RoCE 远为简单的设计出发，探究可以添加哪些最小化的特性来消除对 PFC 的需求。

更广泛地说：虽然 iWARP 与 RoCE 的优劣之争在业界由来已久，但并没有确凿或严谨的评估来比较这两种架构。相反，RoCE 已成为市场上事实上的赢家，并随之带来了一个隐含的（且至今仍存在的）假设，即无损网络交换矩阵是实现 RoCE 高性能的必要条件。我们的研究结果首次严谨地证明，与市场选择所暗示的相反，iWARP 事实上拥有正确的架构理念，只是其设计方法过于复杂。

Hence, one might view IRN and our results in one of two ways: (i) a new design for RoCE NICs which, at the cost of a few incremental modifications, eliminates the need for PFC and leads to better performance, or, (ii) a new incarnation of the iWARP philosophy which is simpler in implementation and faster in performance.

因此，人们可以从以下两种角度之一来看待 IRN 和我们的研究成果：

(i) 一种新的 RoCE NIC 设计，它以少量增量修改为代价，消除了对 PFC 的需求并带来了更好的性能；

(ii) iWARP 理念的一种新实现，其实现更简单，性能也更强