跳转至

ConWeave Design

The newfound capability to handle out-of-order packets in the network raises new opportunities for fine-grained load balancing mechanisms. Thus, we design ConWeave, a new load-balancing framework that tightly incorporates the network’s reordering capability with “cautious” rerouting decisions.

网络中处理乱序数据包的新能力为精细化负载均衡机制带来了新的机会。因此,我们设计了ConWeave,一个新的负载均衡框架,它将网络的重排序能力与“谨慎”的重路由决策紧密结合。

Here, we describe the design and implementation of ConWeave. First, we provide an overview in §3.1, followed by discussions on the key components of ConWeave in §3.2 and §3.3. Finally, we discuss the implementation aspects of ConWeave in §3.4. We list the key parameters and packet types in Table 1 and Table 2, respectively.

在这里,我们将描述ConWeave的设计与实现。首先,我们在§3.1中提供一个概述,随后在§3.2和§3.3中讨论ConWeave的关键组件。最后,我们在§3.4中讨论ConWeave的实现细节。我们在表1和表2中分别列出了关键参数和数据包类型。

Overview of ConWeave

The key idea of ConWeave is that we make decisions frequently (approximately every RTT) to determine if rerouting is advantageous based on network measurements. However, we need to take care that packets will be rerouted only when out-of-order packets can be efficiently sorted in the network prior to delivery to the end hosts.

ConWeave的关键理念是,我们根据网络测量结果频繁地(大约每个RTT)做出决策,以确定是否进行重路由。然而,我们需要确保只有在网络中能够有效地对乱序数据包进行排序并在交付到终端主机之前进行排序时,才会进行重路由。

Figure 7 depicts the overview of ConWeave:

• There are two components, one running on the source ToR switch and the other on the destination ToR switch. The ToR switches are connected through the data center network. We assume the use of some form of source routing so that the source ToR switch can "pin" a flow to a given path.

• The component on the source ToR performs the following functions: (1) latency monitoring to identify "bad" paths to avoid, (2) if congestion is detected, selects a new path and, (3) implements the mechanism which ensures that rerouting can be done "safely" without causing out-of-order arrival at the end hosts.

• The component at the destination ToR switch provides a packet reordering function that masks out-of-order delivery caused by rerouting.

To make further discussion of ConWeave more concrete, we refer to Fig. 8 when presenting the details using the example of a flow arrival and its rerouting. The example can be generalized to the case of many flows.

图7展示了ConWeave的概览:

alt text

  • 该框架由两个组件组成,一个运行在源ToR交换机上,另一个运行在目的地ToR交换机上。ToR交换机通过数据中心网络相连。我们假设使用某种形式的源路由,使得源ToR交换机可以“固定”某个流到指定路径。
  • 源ToR上的组件执行以下功能:
    • (1)延迟监控,识别需要避免的“差”路径;
    • (2)当检测到拥塞时,选择一条新的路径;
    • (3)实现确保重路由可以“安全”进行的机制,避免造成数据包乱序到达终端主机。
  • 目的ToR交换机上的组件提供数据包重排序功能,屏蔽由重路由引起的乱序交付。

为了使ConWeave的进一步讨论更加具体,我们在介绍流到达及其重路由的细节时,参考图8。这个示例可以推广到多个流的情况。

"Cautious" Rerouting Decisions

Ideally, we want fine-grained traffic rerouting, e.g., packet spraying, to maximize network utilization. However, this increases the number of packets that would arrive out-of-order in unpredictable patterns and would require multiple rounds of sorting at the receiving end. Thus, it is crucial for the rerouting design to produce predictable packet arrival patterns in order to exploit the hardware reordering capabilities efficiently. How can this be done?

We perform rerouting under the following three conditions: (i) the existing path is congested, (ii) there exists a viable path that is not congested, and (iii) any out-of-order packets caused by previous reroutes have been received at the destination ToR.

Conditions (i) and (ii) are imposed to ensure that rerouting is needed and a good alternative path is available. Conditions (iii) is imposed to produce predictable arrival patterns in the sense that any flow can only have in-flight packets in at most two paths at any instance of time. The reason is the following. First, for rerouting to occur, condition (iii) is met. All the packets sent in the previous rerouting should have arrived at the destination ToR switch and all current in-flight packets are traveling on a single path. After rerouting, there can now be two active paths with inflight packets. Condition (iii) prohibits another rerouting to occur until the condition becomes true again.

Next, we describe the ConWeave’s rerouting mechanism using Figure 8. In the initial state, a flow always begins with a new epoch.

理想情况下,我们希望实现细粒度的流量重路由,例如数据包喷发,以最大化网络利用率。然而,这会增加无序到达的数据包数量,导致不可预测的模式,并需要接收端进行多轮排序。因此,重路由设计必须产生可预测的数据包到达模式,以便有效利用硬件的重新排序能力。如何实现这一点呢?

我们在以下三个条件下执行重路由:(i) 现有路径出现拥塞,(ii) 存在一条不拥塞的可行路径,以及 (iii) 由于先前重路由引起的无序数据包已到达目的地的ToR。

条件 (i) 和 (ii) 确保了重路由的必要性以及备选路径的可用性。条件 (iii) 用于产生可预测的到达模式,确保任一流在任何时刻至多只有两个路径上存在在途数据包。具体原因如下:首先,要进行重路由,条件 (iii) 必须满足。前一次重路由发送的所有数据包应已到达目的ToR交换机,且所有当前在途数据包都在单一路径上传输。重路由后,可能会有两条路径上存在在途数据包。条件 (iii) 禁止在条件再次满足之前进行新的重路由。

接下来,我们使用图8描述ConWeave的重路由机制。在初始状态下,流总是以新的纪元开始。

alt text

alt text

alt text

alt text

alt text

alt text

Masking Packet Reordering

To ensure in-order packet delivery to the end hosts, we make use of the sorting primitives outlined in §2. Knowing that any flow can only have in-flight packets in at most two paths, sorting the packet streams is simple and can be done using only one queue to hold the out-of-order packets.

为确保数据包按顺序传递到终端主机,我们利用了 §2 中概述的排序原语。由于任何流在任意时刻最多只能在两条路径上存在在途数据包,对数据包流进行排序变得简单,只需使用一个队列来保存无序的数据包即可完成排序。

Implementation

alt text

We implement ConWeave’s data plane on an Intel Tofino2 programmable switch in ∼2400 lines of P4_16 [15] code. The data plane consists of two key components, i.e., the rerouting module and the sorting module. The implementation is available under [4].

alt text

ConWeave packet headers: We depict the layout of ConWeave’s packet headers in Fig. 10. To minimize overhead, ConWeave repurposes the reserved fields (which are not included in the invariant CRC computation) in the RDMA BTH header [14]. We use 8 bits to hold the PathID field which allows us to express up to 255 uplink paths in a 2-tier Clos topology in our prototype 5 . Next, the 3-bit Opcode field is used to differentiate between packets, i.e., normal, RTT _ REQUEST, RTT _ REPLY, CLEAR, and NOTIFY. The 2-bit Epoch field indicates the epoch of the packet 6 . Finally, the remaining 2 bits are allocated for the REROUTED, and TAIL flags, respectively. In addition, we include a separate header for ConWeave to carry the TX _ TSTAMP – the time when the packet leaves the SrcToR, and TAIL _ TX _ TSTAMP – the time when the last TAIL has been sent by SrcToR, separately.

ConWeave 数据包头部:我们在图10中展示了 ConWeave 数据包头部的布局。为了最小化开销,ConWeave 重新利用了 RDMA BTH 头部中的保留字段(这些字段未包含在不变的 CRC 计算中)[14]。我们使用 8 位来容纳 PathID 字段,这使我们能够在原型的两层 Clos 拓扑结构中表示多达 255 条上行路径。接下来,3 位的 Opcode 字段用于区分不同类型的数据包,即普通、RTT _ REQUEST、RTT _ REPLY、CLEAR 和 NOTIFY。2 位的 Epoch 字段表示数据包的纪元。最后,剩余的 2 位分别分配给 REROUTED 和 TAIL 标志。此外,我们为 ConWeave 添加了一个单独的头部,用于分别携带 TX _ TSTAMP(数据包离开 SrcToR 的时间)和 TAIL _ TX _ TSTAMP(SrcToR 发送最后一个 TAIL 的时间)。

RDMA BTH Header

BTH是RDMA数据包中的一个基础传输头部,用于在网络中传输和处理RDMA消息

  • BTH包含目标队列对(Queue Pair)和数据包序列号,使接收方能够正确重组接收到的数据包
  • 这种机制确保了数据的有序传输和完整性

具体的,建议回顾一下RDMA那篇论文

ConWeave packets: We refer to Table 2 for the ConWeave packets. ConWeave piggybacks information on existing packets to reduce overhead. More specifically, the DstToR mirrors the RTT _ REQUEST received and modifies it before sending the modified packet back to the SrcToR as the RTT _ REPLY. For CLEAR, we mirror and modify the TAIL or the 𝑇 resume timer packet at the DstToR. Finally, for NOTIFY, the DstToR mirrors and modifies the packet carrying the congestion signals (e.g., with the ECN bit marked) and then sends it to the SrcToR. Note that ConWeave control packets are always transmitted with the highest priority in the network and with payload truncation to ensure a low-latency feedback loop.

ConWeave 数据包:我们参考表 2 来了解 ConWeave 数据包。ConWeave 在现有数据包上捎带信息以减少开销。更具体地说,DstToR 会镜像接收到的 RTT _ REQUEST 并在发送回 SrcToR 之前对其进行修改,使之成为 RTT _ REPLY。对于 CLEAR,DstToR 会镜像并修改 TAIL 或 𝑇 resume 计时器数据包。最后,对于 NOTIFY,DstToR 会镜像并修改携带拥塞信号(例如,标记了 ECN 位)的数据包,然后将其发送到 SrcToR。请注意,ConWeave 控制数据包始终以网络中的最高优先级传输,并截短其有效负载,以确保低延迟的反馈循环。

这一段有点难以理解,这里加一些解释:

ConWeave的数据包处理机制

ConWeave通过在现有数据包上附加信息来减少开销,主要涉及三种控制包:

RTT测量包

Text Only
1
原始包(RTT_REQUEST) -> DstToR -> 修改后返回(RTT_REPLY)

举例:

  1. SrcToR发送RTT_REQUEST包测量延迟
  2. DstToR接收到后,修改包内容(如添加时间戳)
  3. 将修改后的包作为RTT_REPLY返回给SrcToR

CLEAR包

Text Only
1
TAIL包或Timer包 -> DstToR镜像并修改 -> 返回给SrcToR

举例:

  1. 当流结束时发送TAIL包
  2. 或当恢复计时器(Timer)触发时发送包
  3. DstToR接收后镜像并修改
  4. 返回给SrcToR表示可以清除相关状态

NOTIFY包

Text Only
1
带拥塞标记的包 -> DstToR镜像并修改 -> 通知SrcToR

举例:

  1. 网络中某包被标记ECN拥塞标志
  2. DstToR检测到拥塞标记
  3. 镜像该包并添加必要信息
  4. 发送给SrcToR通知拥塞状况

关键特点

  1. 所有控制包都使用最高优先级传输
  2. 通过截断有效载荷来确保低延迟反馈
  3. 复用现有数据包而不是生成新包,减少额外开销

下面这些具体的设计落实,个人认为不太重要,暂时可以跳过:

Timestamp resolution: Timestamps are used extensively in ConWeave. To minimize bandwidth and header overhead, we use only 16-bit timestamps in the ConWeave header (e.g., TX _ TSTAMP and TAIL _ TX _ TSTAMP). With 16 bits, we can keep track of up to 32𝑚𝑠 at 1𝑢𝑠 resolution. The most significant bit is used to keep track of potential wraparounds. We believe this is sufficient to handle the worst-case ToR-to-ToR path delay in data center networks.

时间戳分辨率:时间戳在 ConWeave 中被广泛使用。为了减少带宽和头部开销,我们在 ConWeave 头部中仅使用 16 位时间戳(例如,TX _ TSTAMP 和 TAIL _ TX _ TSTAMP)。使用 16 位,我们可以在 1𝑢𝑠 的分辨率下追踪最多 32 毫秒的时间。最高有效位用于追踪可能的回绕。我们认为这足以应对数据中心网络中最坏情况下的 ToR 到 ToR 的路径延迟。

3.4.1 Rerouting Module. To perform continuous RTT monitoring (§3.2.1), we make use of register arrays to store the timestamp when the last RTT _ REQUEST was sent in the data plane. Every forwarded packet would check against the timestamp stored to determine whether the RTT _ REPLY cutoff has been exceeded so that rerouting may be triggered. In addition, we maintain a set of states to keep track of rerouting status, e.g., timestamps to track connection status, current epoch, and whether the current path is rerouted or not.

For rerouting to happen (§3.2.2), there need to be available paths to select from. We keep track of the uplink path statuses using a 4-way associate hash table implemented using four register arrays, spanning across four pipeline stages. A packet would access all four registers to sample two paths and then decide whether to reroute or not.

3.4.1 重路由模块。为了进行连续的 RTT 监控(§3.2.1),我们利用寄存器数组来存储上一次在数据平面中发送 RTT _ REQUEST 时的时间戳。每个转发的数据包都会与存储的时间戳进行对比,以确定是否超过了 RTT _ REPLY 截止时间,从而触发重路由。此外,我们维护了一组状态来跟踪重路由状态,例如用于追踪连接状态的时间戳、当前纪元以及当前路径是否已重路由。

要实现重路由(§3.2.2),需要有可供选择的路径。我们通过使用四路关联哈希表来跟踪上行路径状态,该表由四个寄存器数组实现,分布在四个流水线阶段。每个数据包都会访问这四个寄存器来采样两条路径,然后决定是否进行重路由。

3.4.2 Reordering Module. To reorder packets, we make use of the queue pause/resume feature on the Intel Tofino2 [5, 39] to hold the out-of-order packets, i.e., REROUTED packets that arrive before the TAIL. For each uplink (depending on the port link rate), we dedicate 𝑁 − 1 queues out of the 𝑁 queues (e.g., 31 out of 32 queues for a 100G link). At any given time, reordering can be done for up to 𝑀 ∗ (𝑁 − 1) number of flows where 𝑀 is the number of downlinks to servers. Later in §4.1.3, we show that only a fraction amount of the queues are needed.

3.4.2 重排序模块。为了对数据包进行重排序,我们利用 Intel Tofino2 [5, 39] 的队列暂停/恢复功能来存放无序的数据包,即在 TAIL 到达之前到达的 REROUTED 数据包。对于每条上行链路(取决于端口链路速率),我们从 𝑁 个队列中分配 𝑁 − 1 个队列(例如,100G 链路分配 32 个队列中的 31 个)。在任意时刻,可以对最多 𝑀 ∗ (𝑁 − 1) 个流进行重排序,其中 𝑀 是到服务器的下行链路数量。稍后在 §4.1.3 中,我们将展示所需队列的数量仅占一小部分。

To reorder packets, the flow first needs to be assigned an available queue to hold the out-of-order packets. Similarly, we make use of a 4-way associative hash table realized using 4 register arrays to perform a lookup for available queues. To deal with TAIL losses, we make use of individual resume timers for each queue. Since today’s programmable switches lack timers, we realize the resume timer by mirroring the first out-of-order packet with payload truncation, then appending the specific connection information (e.g., connection ID, queue assigned) to the header before recirculating it. In every recirculation, it checks the associated timeout value.

为了重排序数据包,流首先需要分配一个可用队列来存放无序的数据包。同样,我们利用由 4 个寄存器数组实现的四路关联哈希表来查找可用队列。为应对 TAIL 丢失,我们为每个队列设置单独的恢复计时器。由于现今的可编程交换机缺乏计时器,我们通过镜像第一个无序数据包并截短其有效载荷,然后将特定连接信息(例如连接 ID、分配的队列)附加到头部,再循环发送该数据包来实现恢复计时器。在每次循环发送中,数据包会检查关联的超时值。

Once the assigned queue is flushed, the entry in the hash table is then updated by the recirculated packet to mark the queue as available for other flows. We drop this recirculated packet whenever the respective queue is flushed. Note that Tofino2 supports 400Gbps recirculation bandwidth and one recirculation in ConWeave typically takes ≈ 1𝑢𝑠. Thus, there is no queuing delay in recirculation unless the total number of recirculated packets is over 1 BDP of recirculation loop (≈50KB) or the number of connections concurrently with reordering is over ∼800, which is extremely rare.

一旦分配的队列被清空,循环发送的数据包会更新哈希表中的条目,将该队列标记为可供其他流使用。当相应队列被清空时,我们丢弃该循环发送的数据包。请注意,Tofino2 支持 400Gbps 的循环发送带宽,而 ConWeave 中一次循环发送通常仅需 ≈ 1𝑢𝑠。因此,除非循环发送的数据包总数超过循环发送路径的一个带宽延迟积(约 50KB)或同时进行重排序的连接数超过 ∼800(极为罕见),否则循环发送不会产生排队延迟。

3.4.3 Dataplane resource utilization. ConWeave uses stateful ALUs (SALUs) to maintain the connection states, e.g., timestamps, timers, path status, available queues and etc, in the data plane. In our prototype implemented on the Intel Tofino2, ConWeave requires ∼22% of the total SRAM memory and uses ∼44% of the available SALUs. The usage of other hardware resources (e.g., hash bits and VLIW instructions) is no more than ∼15% of available resources on the switch. In the current prototype implementation, we did not integrate ConWeave with the reference data-center switch implementation, i.e., switch.p4. Instead, we applied our own L2/L3 switching and routing implementation to realize a multi-tier topology via network virtualization. Based on the current implementation, we believe that there is sufficient headroom for integration with other data-plane programs. In cases where there are more active connections than what ConWeave supports, ConWeave applies ECMP to the rest as a fallback while dynamically maintaining hardware states with hot and active connections [33, 57].

3.4.3 数据平面资源利用。ConWeave 使用有状态算术逻辑单元(SALUs)来在数据平面中维护连接状态,例如时间戳、定时器、路径状态、可用队列等。在我们基于 Intel Tofino2 实现的原型中,ConWeave 大约占用了总 SRAM 内存的 22%,并使用了大约 44% 的可用 SALUs。其他硬件资源的使用(例如哈希位和 VLIW 指令)不超过交换机上可用资源的 15%。在当前的原型实现中,我们并未将 ConWeave 与参考的数据中心交换机实现(即 switch.p4)集成,而是应用了我们自己的 L2/L3 交换和路由实现,通过网络虚拟化实现多层拓扑。基于当前的实现,我们认为有足够的空间与其他数据平面程序进行集成。在活跃连接数超过 ConWeave 支持的最大连接数时,ConWeave 会对剩余连接应用 ECMP 作为后备方案,同时动态维护具有热连接和活跃连接的硬件状态 [33, 57]。