Background¶

We begin with reviewing some relevant background.

Infiniband RDMA and RoCE¶

RDMA has long been used by the HPC community in specialpurpose Infiniband clusters that use credit-based flow control to make the network lossless [4]. Because packet drops are rare in such clusters, the RDMA Infiniband transport (as implemented on the NIC) was not designed to efficiently recover from packet losses. When the receiver receives an out-of-order packet, it simply discards it and sends a negative acknowledgement (NACK) to the sender. When the sender sees a NACK, it retransmits all packets that were sent after the last acknowledged packet (i.e., it performs a go-back-N retransmission).

To take advantage of the widespread use of Ethernet in datacenters, RoCE [5, 9] was introduced to enable the use of RDMA over Ethernet. 1 RoCE adopted the same Infiniband transport design (including go-back-N loss recovery), and the network was made lossless using PFC.

长期以来，高性能计算（HPC）社区一直在专用的 InfiniBand 集群中使用 RDMA 技术，这些集群通过基于信用的流量控制 (credit-based flow control) 来实现无损网络 [4]。

由于在此类集群中丢包现象极为罕见，（在网卡上实现的）RDMA InfiniBand 传输协议并未被设计用来高效地从丢包中恢复。当接收端收到一个乱序数据包时，它会直接将其丢弃，并向发送端发送一个否定确认 (NACK)。

当发送端收到 NACK 后，它会重传自最后一个被确认的数据包之后所有已发送的数据包（即执行 “返回N帧” (Go-back-N) 的重传策略）。

为了利用以太网在数据中心的广泛部署，业界推出了 RoCE [5, 9] 技术以实现在以太网上运行 RDMA。RoCE 沿用了与 InfiniBand 相同的传输设计（包括 Go-back-N 丢包恢复机制），并依赖 PFC (基于优先级的流量控制) 来使网络达到无损状态。

RoCE && RoCEv2

We use the term RoCE for both RoCE [5] and its successor RoCEv2 [9] that enables running RDMA, not just over Ethernet, but also over IP-routed networks.

Keynote: RoCE && RoCEv2

RoCE (RDMA over Converged Ethernet): 指的是第一代 RoCE 协议，通常也称为 RoCEv1。它的核心目标是在融合以太网（一种可以承载多种流量类型并且能保证无丢包的网络）上运行 RDMA（远程直接内存访问）技术
RoCEv2 (RDMA over Converged Ethernet Version 2): 是 RoCE 的第二代, 应用范围更广, 扩展性更强

两者的根本区别在于它们工作的网络层次以及报文封装方式不同，这直接决定了它们是否可以被路由。

特性	RoCE (RoCEv1)	RoCEv2
网络层级	二层 (L2) 协议	三层 (L3) 协议
报文封装	RDMA 报文直接封装在以太网帧中。它使用一个专用的以太网类型字段 (EtherType `0x8915`) 来标识自己是 RoCE 流量。	RDMA 报文被封装在 UDP/IP 报文中。它使用一个专用的 UDP 目标端口号 (`4791`) 来标识自己是 RoCEv2 流量。
可路由性	不可路由。因为它工作在二层，缺少 IP 地址信息，所以流量只能被限制在同一个二层广播域（例如同一个 VLAN）内，无法跨越路由器。	可路由。由于报文包含了 IP 头部，它就像普通的网络数据包一样，可以被标准的三层路由器转发，从而跨越不同的子网。这也是您引用的原文中提到 RoCEv2 可以在 “IP-routed networks” 上运行的原因。
适用场景	适用于结构简单、所有服务器都在同一个二层网络内的小规模集群。	适用于任何规模的数据中心，特别是需要跨越多个机架、多个子网的大规模和超大规模网络环境。
负载均衡	负载均衡能力有限，主要依赖 MAC 地址等二层信息。	可以利用 IP 头部信息（如源/目的IP、源/目的端口号）实现更高效的 ECMP (等价多路径) 负载均衡，从而更好地利用网络带宽。

TL; DR

RoCEv1 是一个二层技术，它将 RDMA 直接嫁接到以太网上，但代价是流量无法离开本地网络（广播域），扩展性差
RoCEv2 通过将 RDMA 流量装入标准的 UDP/IP 包中，巧妙地解决了这个问题。这使得 RDMA 流量可以像普通数据一样在三层 IP 网络中自由路由，极大地增强了其灵活性和可扩展性，使其成为现代大型数据中心（尤其是 AI 和 HPC 集群）中的主流选择

Priority Flow Control¶

Priority Flow Control (PFC) [6] is Ethernet’s flow control mechanism, in which a switch sends a pause (or X-OFF) frame to the upstream entity (a switch or a NIC), when the queue exceeds a certain configured threshold. When the queue drains below this threshold, an X-ON frame is sent to resume transmission. When configured correctly, PFC makes the network lossless (as long as all network elements remain functioning). However, this coarse reaction to congestion is agnostic to which flows are causing it and this results in various performance issues that have been documented in numerous papers in recent years [23, 24, 35, 37, 38]. These issues range from mild (e.g., unfairness and head-of-line blocking) to severe, such as “pause spreading” as highlighted in [23] and even network deadlocks [24, 35, 37]. In an attempt to mitigate these issues, congestion control mechanisms have been proposed for RoCE (e.g., DCQCN [38] and Timely [29]) which reduce the sending rate on detecting congestion, but are not enough to eradicate the need for PFC. Hence, there is now a broad agreement that PFC makes networks harder to understand and manage, and can lead to myriad performance problems that need to be dealt with.

基于优先级的流量控制（PFC）[6] 是以太网的一种流量控制机制。在该机制中，当交换机某个队列的占用超过设定的阈值时，它会向上游实体（另一台交换机或网卡）发送一个暂停帧（或 X-OFF 帧）。当该队列的占用回落到阈值以下时，则会发送一个 X-ON 帧以恢复传输。在正确配置的情况下，PFC 能使网络实现无损（前提是所有网络元件都正常工作）。

然而，这种对拥塞的粗粒度反应无法区分具体是哪些流导致了拥塞，这引发了近年来众多论文所记载的各种性能问题 [23, 24, 35, 37, 38]。这些问题轻则包括不公平性和队头阻塞，重则如文献 [23] 中强调的“暂停帧扩散”，甚至会导致网络死锁 [24, 35, 37]。

为了缓解这些问题，学术界为 RoCE 提出了诸如 DCQCN [38] 和 Timely [29] 等拥塞控制机制。这些机制在检测到拥塞时会降低发送速率，但仍不足以完全消除对 PFC 的需求。因此，目前的广泛共识是，PFC 增加了网络的理解和管理难度，并可能导致大量亟待解决的性能问题。

iWARP vs RoCE¶

iWARP [33] was designed to support RDMA over a fully general (i.e., not loss-free) network. iWARP implements the entire TCP stack in hardware along with multiple other layers that it needs to translate TCP’s byte stream semantics to RDMA segments. Early in our work, we engaged with multiple NIC vendors and datacenter operators in an attempt to understand why iWARP was not more broadly adopted (since we believed the basic architectural premise underlying iWARP was correct). The consistent response we heard was that iWARP is significantly more complex and expensive than RoCE, with inferior performance [13].

We also looked for empirical datapoints to validate or refute these claims. We ran RDMA Write benchmarks on two machines connected to one another, using Chelsio T580-CR 40Gbps iWARP NICs on both machines for one set of experiments, and Mellanox MCX416A-BCAT 56Gbps RoCE NICs (with link speed set to 40Gbps) for another. Both NICs had similar specifications, and at the time of purchase, the iWARP NIC cost $760, while the RoCE NIC cost $420. Raw NIC performance values for 64 bytes batched Writes on a single queue-pair are reported in Table 1. We find that iWARP has 3× higher latency and 4× lower throughput than RoCE.

iWARP [33] 被设计用于在一个完全通用（即非无损）的网络上支持 RDMA。iWARP 在硬件中实现了完整的 TCP 协议栈，以及将 TCP 的字节流语义转换为 RDMA 段所需的其他多个协议层。

在我们工作的早期，我们与多家网卡（NIC）供应商及数据中心运营商进行了交流，试图理解为何 iWARP 未能得到更广泛的采用（因为我们相信 iWARP 背后的基本架构前提是正确的）。我们得到的一致回应是，iWARP 比 RoCE 要复杂和昂贵得多，且性能较差 [13]。

我们还寻找了实证数据点来验证或反驳这些说法。我们在两台直连的机器上运行了 RDMA 写操作基准测试。一组实验中，两台机器均使用 Chelsio T580-CR 40Gbps iWARP 网卡；另一组则使用 Mellanox MCX416A-BCAT 56Gbps RoCE 网卡（并将链路速率设为 40Gbps）。两款网卡的规格相近，在购买时，iWARP 网卡的价格为760美元，而 RoCE 网卡为420美元。表1报告了在单个队列对上进行64字节批量写操作的原始网卡性能数据。我们发现 iWARP 的延迟比 RoCE 高3倍，而吞吐量则低4倍。

These price and performance differences could be attributed to many factors other than transport design complexity (such as differences in profit margins, supported features and engineering effort) and hence should be viewed as anecdotal evidence as best. Nonetheless, they show that our conjecture (in favor of implementing loss recovery at the endhost NIC) was certainly not obvious based on current iWARP NICs.

这些价格和性能上的差异可能归因于传输层设计复杂性之外的多种因素（例如利润率、支持的特性和工程投入的差异），因此最多只能被视作个例证据 (anecdotal evidence)。尽管如此，这些数据表明，我们（支持在终端主机网卡上实现丢包恢复）的猜想，基于现有 iWARP 网卡的表现来看，其正确性有待进一步确认😅...

Our primary contribution is to show that iWARP, somewhat surprisingly, did in fact have the right philosophy: explicitly handling packet losses in the NIC leads to better performance than having a lossless network. However, efficiently handling packet loss does not require implementing the entire TCP stack in hardware as iWARP did. Instead, we identify the incremental changes to be made to current RoCE NICs, leading to a design which (i) does not require PFC yet achieves better network-wide performance than both RoCE and iWARP (§4), and (ii) is much closer to RoCE’s implementation with respect to both NIC performance and complexity (§6) and is thus significantly less complex than iWARP.

我们的主要贡献在于证明了 iWARP 实际上拥有正确的理念，这多少有些出人意料：即 在网卡中明确地处理丢包，比依赖一个无损网络能带来更好的性能。

然而，高效地处理丢包并不需要像 iWARP 那样在硬件中实现整个 TCP 协议栈。相反，我们明确了对当前 RoCE 网卡应做的增量式改进，从而得到一种新的设计。该设计 (i) 无需 PFC 即可获得比 RoCE 和 iWARP 更好的全网性能（§4），并且 (ii) 在网卡性能和复杂性方面更接近 RoCE 的实现（§6），因此远没有 iWARP 那样复杂。