Introduction¶

Remote Direct Memory Access (RDMA) allows end hosts to directly exchange data in the main memory while offloading network I/O responsibilities from the CPU onto RDMA-capable network interface cards (RNICs). Given the significant performance benefits that RDMA brings, it has been widely used in high-performance computing (HPC) settings deployed over proprietary Infiniband networks [51]. For its low-latency performance and to free up precious CPU cycles, nowadays, modern data centers are using RoCEv2 [14] to actively adopt RDMA technologies on Ethernet networks [17, 21, 25].

远程直接内存访问（RDMA）允许终端主机在不依赖CPU处理网络I/O的情况下，通过RDMA功能的网络接口卡（RNIC）直接在主存中交换数据。鉴于RDMA带来的显著性能提升，它在高性能计算（HPC）环境中得到了广泛应用，尤其是在私有的Infiniband网络中部署【51】。为了提供低延迟性能并释放宝贵的CPU计算资源，现代数据中心如今正使用RoCEv2【14】，在以太网网络上积极采用RDMA技术【17, 21, 25】。

Data center network topologies (e.g., leaf-spine) are typically designed to scale while having sufficient redundancy in mind. Specifically, there are multiple end-to-end paths between any two server racks. Thus, to maximally utilize the available network capacity, load balancing is performed to spread network traffic across different paths. Equal Cost Multi-Path (ECMP), in particular, is widely used in today’s data centers [38]. The next hop path is selected by hashing the packet fields and then taking the modulo of it over the number of available paths. Packets from a flow would always map onto the same path and thus these packets will always be delivered in the same order that it is sent.

数据中心网络拓扑（例如，叶脊结构）通常被设计成具有良好的可扩展性和足够的冗余。具体来说，在任意两个服务器机架之间通常有多条端到端的路径。因此，为了最大限度地利用网络带宽资源，负载均衡被用于将网络流量分布到不同的路径上。特别是，等价代价多路径（ECMP）在当前数据中心中被广泛应用【38】。下一跳路径通过对数据包字段进行哈希运算，然后取可用路径数的模运算来选择。来自同一流的所有数据包会映射到相同的路径，因此这些数据包会按照发送顺序被依次接收。

However, numerous studies have shown that ECMP is unable to distribute the load evenly over different paths [9, 27, 52] when the traffic is highly skewed. A plethora of works have been proposed to address the shortcomings of ECMP. For instance, some works use per-packet switching [18, 23] to achieve near-optimal load balance, but result in massive amounts of out-of-order packets. Other works [11, 35, 52, 59, 62] split a flow into chunks of packets based on inactive time gaps, so-called flowlet switching. Although flowlet switching provides a way to perform load balancing and avoids out-of-order packets, it is an opportunistic mechanism. Hence, the efficiency of flowlet-based approaches depends on the traffic characteristics i.e. whether there are flowlets available.

然而，大量研究表明，在流量高度倾斜的情况下，ECMP无法在不同路径上均匀地分配负载【9, 27, 52】。大量研究成果被提出以解决ECMP的不足。例如，一些研究使用逐包交换（per-packet switching）【18, 23】来实现接近最优的负载均衡效果，但由此导致了大量乱序的数据包。其他研究【11, 35, 52, 59, 62】则基于非活跃时间间隔，将流拆分为称为流片的包段，进行流片切换（flowlet switching）。尽管流片切换提供了一种执行负载均衡且避免乱序数据包的方式，但它是一种机会性机制。因此，基于流片的方法的效率取决于流量特征，即是否存在可用的流片。

The motivation for this work comes from the observation that existing load balancing algorithms that improve upon ECMP are designed to run with TCP but not RDMA. In Fig. 1, we show how RDMA workloads perform using existing load balancing algorithms on our hardware testbed (see §4.2 for setup details). Regardless of the traffic load, existing approaches perform worse, if not on par, when compared to ECMP. We identify two causes for this performance degradation: (i) RDMA flow characteristics, and (ii) RDMA’s response to packet out-of-order packets.

本研究的动机来自于以下观察：在改进ECMP的现有负载均衡算法中，设计方案普遍是面向TCP而非RDMA的。在图1中，我们展示了在我们的硬件测试平台上使用现有负载均衡算法时，RDMA工作负载的表现（详见§4.2的实验设置）。无论流量负载如何，现有方法在性能上即便不逊色，也难以超越ECMP。我们将此性能下降归因于两点原因：（i）RDMA流量特征，以及（ii）RDMA对乱序数据包的响应。

Flowlet

Flowlet是指在一个数据流(flow)中由空闲间隙(idle gap)分隔的一串数据包burst. 这种机制首次在十多年前被提出, 是一种强大的负载均衡技术, 特别适用于处理网络不对称的情况.

Flowlet切换机制主要包含以下关键要素:

弹性特征：Flowlet具有显著的弹性特性,其大小会根据传输路径上的流量条件自动调整.
边界确定：
- Flowlet的边界通常由空闲时间间隔来确定
- 典型的边界时间设置约为10毫秒
- 边界时间的设置至关重要:
  - 设置过小会导致数据包乱序
  - 设置过大会降低负载均衡效果
路径选择：
- 当检测到新的flowlet时,可以为其选择新的传输路径
- 路径选择可以基于当前网络负载状况进行优化

举个例子，flowlet的gap time是如何产生的，假设有一个文件传输的场景：

初始传输：

TCP开始传送第一组数据包(Flowlet 1)
发送速率较快,数据包紧密相连

网络拥塞发生：

某些数据包丢失
TCP拥塞控制机制触发
发送速率降低
产生了一个明显的传输间隙(gap)

新的Flowlet形成：

间隙之后的数据包组成新的Flowlet
这个新的Flowlet可以选择不同的传输路径

RDMA flow characteristics: Fig. 2 shows the flowlet sizes for TCP and RDMA traffic using different inactive time thresholds ranging from 1𝑢𝑠 to 500𝑢𝑠. We see that for RDMA, flowlets are noticeably larger compared to TCP even for a small inactive time threshold of 10𝑢𝑠. This implies that given an inactive time gap, there are significantly fewer chances to find flowlets in RDMA traffic compared to TCP. These observations are similar to those reported by Lu et al. [42]. This can be attributed to the fact that TCP transmits in bursts (e.g. TSO [34, 46]) and uses ACKs with batch optimization in order to achieve I/O and CPU efficiency which naturally creates time gaps between transmissions [20]. On the other hand, RDMA performs hardware-based packet pacing per connection (i.e., rate shaping) resulting in a continuous packet stream with small time gaps. Thus, due to the lack of sufficiently large flowlet gaps, flowlet switching-based approaches cannot work well with RDMA.

RDMA流量特征： 图2展示了在不同时长的非活跃时间阈值（从1微秒到500微秒）下，TCP和RDMA流量的流片大小。我们可以看到，对于RDMA，即使非活跃时间阈值仅为10微秒，其流片也明显大于TCP流片。这意味着在相同的非活跃时间间隔下，RDMA流量中可找到的流片显著少于TCP流量。类似的观察结果在Lu等人的研究【42】中也有报道。其原因可以归结为TCP通过突发传输（例如TSO【34, 46】）并结合批处理优化的ACK机制来提高I/O和CPU的效率，这种机制会在传输之间自然形成时间间隔【20】。相较之下，RDMA基于硬件执行逐连接的包节奏控制（即速率整形），从而生成具有较小时间间隔的连续包流。因此，由于缺乏足够大的流片间隔，基于流片切换的负载均衡方法无法很好地适用于RDMA。

RDMA’s response to out-of-order packets: RoCEv2 1 inherits many of the design assumptions of RDMA in Infiniband networks, one of which is that there is generally no loss in the network and therefore packets are delivered in-order [28]. As a result, when an RNIC receives a packet out-of-order, it treats it as an indication of packet loss (e.g., due to network congestion) and immediately initiates loss recovery which results in the sending RNIC decreasing its sending rate. On the contrary, TCP is more tolerant to outof-order packets by buffering some out-of-order packets without immediate rate reductions or retransmissions (e.g., by waiting up to 3 dup-ACKs [13] or more [16]). Also, compared to TCP which is generally more programmable with kernel-level changes, RNICs are mostly fixed-function and typically have limited resources (e.g. for packet buffering) [60].

RDMA对乱序数据包的响应： RoCEv2继承了RDMA在Infiniband网络中的许多设计假设，其中之一是网络中通常不会出现丢包，且数据包是按顺序传输的【28】。因此，当RDMA网络接口卡（RNIC）接收到一个乱序的数据包时，会将其视为网络拥塞等原因导致的丢包的指示，并立即启动丢包恢复，这将导致发送端RNIC降低其发送速率。相反，TCP对乱序数据包的容忍度更高，可以对部分乱序数据包进行缓存，而不会立即降低速率或重传（例如，通过等待多达3个重复ACK【13】或更多【16】）。此外，相较于TCP可以通过内核级别的更改来实现较高的可编程性，RNIC多为固定功能，并且通常具有有限的资源（例如用于数据包缓存）【60】。

To quantify how out-of-order packets affect RDMA performance, we conduct two experiments consisting of one sender and one receiver connected to an Intel Tofino2 programmable switch [5]. The sender and receiver are both equipped with an NVIDIA Mellanox ConnectX-5 (CX5) [48] or ConnectX-6 (CX6) [49] RNICs that support different loss-recovery mechanisms, i.e., Go-Back-N (GBN) and Selective Repeat (SR), respectively. We artificially induce outof-order packet arrivals by randomly selecting a packet from the RDMA flow and recirculating it in the switch before forwarding it.

为了量化乱序数据包对RDMA性能的影响，我们进行了两个实验，实验中一台发送端和一台接收端通过Intel Tofino2可编程交换机【5】相连接。发送端和接收端均配备支持不同丢包恢复机制的NVIDIA Mellanox ConnectX-5（CX5）【48】或ConnectX-6（CX6）【49】RNIC，分别支持“回退N”（GBN）和“选择性重传”（SR）。我们通过在转发前随机选择一个RDMA流中的数据包并在交换机中重新循环来人为制造乱序数据包的到达。

Fig. 3 compares the FCTs 2 for short (10𝐾𝐵) and long (1𝑀𝐵) flows. We note that RDMA is highly sensitive to even a single out-oforder packet arrival. Generally, we observe that CX6 (using SR) exhibits better performance than CX5 (using GBN) due to fewer retransmissions. Nevertheless, in both cases, the performance is impacted by the reception of out-of-order packets which causes the sender to reduce its sending rate.

图3对比了短流（10KB）和长流（1MB）的流完成时间（FCT）。我们注意到，RDMA对哪怕单个乱序数据包的到达也非常敏感。总体来看，由于减少了重传次数，CX6（使用SR）比CX5（使用GBN）表现出更好的性能。然而，在这两种情况下，乱序数据包的接收都会导致性能下降，因为发送端会因此而降低发送速率。

The reason for the rate reduction is that the RNIC interprets the detection of packet gap as packet loss, even though the cause can be attributed to either packet drops or out-of-order packet arrivals. However, if the cause is due to out-of-order packets, then the rate reduction is unnecessary, and together with spurious retransmission, it leads to lower network utilization.

这种速率降低的原因是，RNIC将检测到的包序缺口解释为丢包，尽管其原因可能是包丢失或乱序到达。然而，如果原因是乱序数据包，那么此类速率降低是不必要的，再加上误发重传，最终导致了网络利用率的降低。

In this paper, we pose the following question: Is it possible to support fine-grained rerouting for RDMA flows to spread and load balance the traffic among multiple paths without causing out-oforder packet arrivals? We answer this question in the affirmative and propose a solution called ConWeave (or Congestion Weaving 3 ), a load balancing framework designed for RDMA in data centers.

在本文中，我们提出以下问题：是否可以支持RDMA流的细粒度重路由，以在多条路径之间分散和均衡流量负载，而不会导致乱序数据包的到达？ 我们对此问题给予肯定回答，并提出一种称为ConWeave（或称拥塞编织3）的解决方案，这是一种为数据中心中的RDMA设计的负载均衡框架。

ConWeave has 2 components, one running in the source and the other in the destination top-of-rack (ToR) switches. The component running in the source ToR switch continuously monitors the path delay for active flows and attempts to reroute whenever congestion is detected, instead of passively waiting for flowlet gaps. Such rerouting without sufficient packet time gaps can result in out-oforder packet arrivals at the destination ToR switch.

ConWeave包含两个组件，一个在源的机架顶部（ToR）交换机中运行，另一个在目的地的ToR交换机中运行。运行在源ToR交换机中的组件会持续监测活跃流的路径延迟，并在检测到拥塞时尝试进行重路由，而不是被动地等待流片间隙。然而，这种在数据包间隙不足的情况下的重路由可能会导致目的地ToR交换机接收到乱序数据包。

The key idea of ConWeave is to mask out-of-order packets from the RDMA end-hosts connected to the destination ToR switch. We do so by exploiting the state-of-the-art queue pausing/resuming features [39] on the Intel Tofino2 [5] programmable switch to put packets back in order, entirely in the data plane. To ensure that this can be done given the available hardware resources, ConWeave reroutes traffic in a principled manner such that out-of-order packets only arrive in a predictable pattern and these packets can be put back in the original order in the data plane efficiently. Notably, ConWeave is end-host agnostic. It is designed to be introduced in the network (i.e., at the ToRs) to work seamlessly with existing RNICs and applications without any modifications.

ConWeave的核心思想是将来自目的地ToR交换机连接的RDMA终端主机的乱序数据包“掩盖”掉。我们通过利用Intel Tofino2【5】可编程交换机上的最新队列暂停/恢复功能【39】来在数据平面中将数据包重新排序。为了确保这种方法在现有硬件资源下可行，ConWeave以一种系统化的方式进行重路由，使得乱序数据包仅以可预测的模式到达，从而能够在数据平面中高效地将这些数据包恢复到原始顺序。值得注意的是，ConWeave对终端主机是透明的。它设计为在网络中（即在ToR交换机处）引入，无需对现有的RNIC和应用程序进行任何修改，即可无缝运行。

As shown in Figure 4, ConWeave presents a new paradigm in load balancer designs. With in-network reordering 4 capabilities, ConWeave can reroute traffic frequently while keeping out-of-order packets at bay. As a result, ConWeave is able to reach a more optimal operating point compared to existing schemes.

如图4所示，ConWeave提出了一种负载均衡器设计的新范式。凭借网络内的重排序功能，ConWeave能够在频繁重路由的同时抑制乱序数据包的影响。因此，与现有方案相比，ConWeave能够达到更优的运行点。

The contributions of this paper are as follows:

• We present a lightweight design for resolving packet reordering on a commodity programmable switch. The system has been implemented using P4 running on the Intel Tofino2 [5] switch.

• We design ConWeave, a load balancer design that performs perRTT latency monitoring for active flows, and path switching while masking the out-of-order packet arrivals using the above in-network packet reordering scheme.

• We evaluate ConWeave on both software simulations and hardware testbed. Our results show that ConWeave improves the average and 99-percentile FCT by up to 42.3% and 66.8%, respectively, compared to the state-of-the-art.

The paper is organized as follows: in §2, we discuss how can we perform reordering using programmable switches; then, we discuss the design and implementation in §3; later, the evaluation results of ConWeave are presented in §4; before wrapping up in §7, we offer additional discussions and outline future work in §5 and summarize related work in §6.

本文的主要贡献如下：

我们提出了一种在商用可编程交换机上实现数据包重新排序的轻量级设计。该系统基于P4实现，运行于Intel Tofino2【5】交换机上。
我们设计了ConWeave，一种负载均衡器设计，它对活跃流进行逐RTT延迟监测，并在使用上述网络内数据包重新排序方案的同时执行路径切换，以掩盖乱序数据包的到达。
我们在软件模拟和硬件测试平台上评估了ConWeave。结果显示，与当前最先进的方案相比，ConWeave的平均流完成时间和99分位流完成时间分别提升了最多42.3%和66.8%。

本文的结构安排如下： 在§2中，我们讨论如何使用可编程交换机实现数据包重排序；接着在§3中介绍ConWeave的设计与实现；随后在§4中展示ConWeave的评估结果；在总结于§7之前，我们在§5和§6中提供额外讨论，并概述未来工作和相关研究。

Structure

很显然，2和3是本论文的核心