跳转至

Reordering out-of-order packets in the network: A primer

Generally, end-hosts assume the responsibility for putting the outof-order packets back in order. With the emergence of commodity programmable switches, we believe that the network itself can, and should play an important role in reordering packets. Ideally, with data plane support, more fine-grained load-balancing mechanisms can be introduced if the network can mask these out-of-order behaviors from the end hosts. So, the question becomes, how much reordering can commodity programmable switches, e.g., the Intel Tofino switches, support?

通常情况下,终端主机负责将乱序的数据包恢复到正确的顺序。然而,随着商用可编程交换机的出现,我们认为网络本身可以并且应当在数据包重新排序中发挥重要作用。 理想情况下,如果网络能够在数据层掩盖这些乱序行为,那么可以引入更细粒度的负载均衡机制。因此,问题变成了,商用可编程交换机(例如Intel Tofino交换机)究竟可以支持多大的重新排序能力?

Scenario: To make the discussion more concrete, we use the hypothetical scenario presented in Fig. 5. In the example, we assume packets are sent every 2𝑢𝑠. Fig. 5a shows four packets of a single flow transmitted across paths with different delays, e.g., 9𝑢𝑠, 6𝑢𝑠, 1𝑢𝑠 for S1, S2, and S3, respectively. At the source-ToR, the packets are sent to paths S1, S2, and S3 to find a better path in an iterative way. Fig. 5b shows the packet in-flight times and the sequence of arrivals at the destination-ToR, i.e., 4, 2, 1, and 3.

场景: 为了使讨论更具实质性,我们使用图5中提出的假设场景作为例子。在这个示例中,我们假设数据包每2微秒发送一次。图5a展示了单一流的四个数据包通过不同延迟的路径传输,例如,S1、S2和S3的延迟分别为9微秒、6微秒和1微秒。在源-ToR交换机处,数据包被发送到路径S1、S2和S3,以一种迭代的方式寻找更优的路径。图5b展示了数据包的传输时间和在目标-ToR交换机的到达顺序,即4、2、1和3。

alt text

Reordering on a Programmable Switch

In this paper, packet reordering refers specifically to the process of buffering and releasing the received packets at the destination-ToR switch so that the packets will be “restored” back to the same order as they were sent by the source-ToR switch.

To answer the question of how much reordering can a commodity programmable switch perform, we first explore what features are needed to reorder packets in the data plane. For simplicity, we assume that there is no packet loss. Basically, we need to keep track of the next expected packet sequence number to ensure inorder delivery and hold out-of-order packets on a per-flow basis if needed. For example, if the expected sequence is 1 but a packet with sequence number 2 arrives, it has to be held until packet 1 arrives.

To realize this logic, we need two key elements:

(1) a primitive to hold/release packets (say, a queue),

(2) stateful memory/operations that allow to update/check the state upon packet arrivals. The state operations include keeping track of the packet sequence number and the associated queue being used.

在本文中,数据包重新排序特指在目标-ToR交换机处对接收到的数据包进行缓冲并有序释放,使这些数据包被“还原”回源-ToR交换机发送时的顺序。

为了回答商用可编程交换机可以执行多大程度的重新排序,我们首先探讨在数据平面中实现数据包重新排序所需的特性。为简化分析,我们假设没有数据包丢失。基本上,我们需要跟踪下一个预期的数据包序列号以确保按序传送,并在必要时按流缓冲乱序数据包。例如,如果预期序列号是1,但到达的数据包序列号是2,那么需要暂存该数据包直到序列号1的数据包到达。

为了实现这一逻辑,我们需要两个关键元素:

(1) 一个用于暂存/释放数据包的原语(例如,一个队列);

(2) 有状态的内存/操作,以便在数据包到达时更新/检查状态。状态操作包括跟踪数据包的序列号以及所使用的队列。

Required features: Fortunately, recent programmable switches have such capabilities. Using the Tofino2 [5] as an example:

(1) Up to 128 First-In-First-Out (FIFO) queues per egress port with priority-based queue scheduling.

(2) Pause/Resume capability of an individual queue while maintaining line-rate packet processing for other queues.

(3) Tens of MBs stateful memory (e.g., register arrays) which can be updated by ALUs in the data plane.

所需功能特性:值得庆幸的是,最近的可编程交换机具备这些能力。以Tofino2【5】为例:

(1) 每个出口端口最多支持128个先进先出(FIFO)队列,具有基于优先级的队列调度功能。

(2) 可以暂停/恢复单个队列,同时保持其他队列的线路速率数据包处理。

(3) 数十MB的有状态内存(例如寄存器阵列),可由数据平面中的ALU进行更新。

Realizing mechanism: In Figure 6, we illustrate how the above primitives are used in the example scenario. In the beginning, we assign a queue 𝑄 0 dedicated to forwarding in-order packets. When packet 4 arrives, we compare the packet sequence with the state NEXT SEQ and find that it is out-of-order. We hold the packet in the empty queue 𝑄 1 . Similarly, when packet 2 arrives which is out-of-order, we hold it in 𝑄 2 . Next, packet 1 arrives and is in order. Thus, we immediately forward it through 𝑄 0 . At this point, packet 2 is the next packet in sequence and is released from 𝑄 2 . The NEXT SEQ is increased to 3 accordingly. Lastly, for packet 3 arrival, we forward it, flush 𝑄 1 , and increment NEXT SEQ to 5 in a similar way.

实现机制:在图6中,我们展示了如何在示例场景中使用上述原语。开始时,我们为按序转发数据包分配了一个专用队列𝑄0。当数据包4到达时,我们将其序列号与状态NEXT SEQ进行比较,发现其乱序,因此将其暂存在空队列𝑄1中。同样,当数据包2到达时也是乱序的,因此暂存在𝑄2中。接下来,按序到达的数据包1立即通过𝑄0转发。此时,数据包2是下一个序列的数据包,因此从𝑄2释放。相应地,NEXT SEQ增加到3。最后,数据包3到达后,我们进行转发,清空𝑄1,并以类似方式将NEXT SEQ递增至5。

alt text

Practical Considerations

While the above sorting mechanism is logically simple, there are a couple of practical issues.

Limited queue resource: The idea of holding out-of-order packets in the data plane is bounded by the number of queues available. For instance, while our example needs only two queues to hold two out-of-order packets, holding 𝑁 out-of-order packets would require 𝑁 queues in the worst case (imagine 𝑁 packets arrive in the reverse order). Unfortunately, there are only a limited number of queues that can be repurposed for sorting on hardware. Thus, this approach may not be amenable to cases where there can be arbitrary out-of-orders arrivals.

有限的队列资源:在数据平面中暂存乱序数据包的想法受到可用队列数量的限制。例如,尽管我们的示例只需两个队列来暂存两个乱序数据包,但在最坏情况下,需要𝑁个队列来暂存𝑁个乱序数据包(假设𝑁个数据包按相反顺序到达)。不幸的是,可用于硬件上的排序的队列数量有限。因此,这种方法可能不适用于可能出现任意乱序到达的情形。

Lack of sorting primitive: One may consider using packet recirculation within the switch for sorting. However, such an implementation is inefficient. In fact, this is not a viable option as there are scenarios whereby the incoming out-of-order packets rate exceeds the speed of the in-order packet forwarding. In such cases, the recirculation port overflows resulting in packet drops. Another option is to use a sorted queue (e.g., PIFO [56]). Unfortunately, such data structures are not available on existing switch ASIC, and its approximated implementation (e.g., AIFO [61] and SP-PIFO [10]) falls short of providing strictly in-order delivery unless substantial hardware resources are dedicated for reordering.

In addition, to achieve high-speed processing (e.g., tens of terabits), switching ASICs allow only a limited set of stateful operations and imposes strict time constraints (e.g., per-stage in packet processing pipeline). Therefore, a restricted mechanism but yet fulfills the packet reordering requirement is needed.

缺乏排序原语:有人可能考虑在交换机内使用数据包再循环进行排序。然而,这种实现效率低下。事实上,这是不可行的,因为在某些情况下,乱序数据包的到达速率超过了按序转发数据包的速率。在这种情况下,再循环端口会发生溢出,导致数据包丢失。另一种选择是使用排序队列(例如PIFO【56】)。不幸的是,现有交换机的ASIC上不具备此类数据结构,并且其近似实现(例如AIFO【61】和SP-PIFO【10】)无法提供严格的按序传递,除非为重新排序分配大量硬件资源。

此外,为了实现高速处理(例如,数十Tbps),交换机ASIC只允许有限的有状态操作,并对时间施加严格限制(例如,数据包处理流水线的每个阶段)。因此,需要一种受限但能够满足数据包重排序需求的机制。

ASIC

ASIC (Application Specific Integrated Circuit)是交换机上的专用集成电路,是交换机的核心组件,主要用于快速数据包处理,属于硬件层面设计。

在交换机中,ASIC负责处理数据包的路由、转发、过滤和QoS等功能。ASIC通过==硬件实现==这些功能,可以提供比软件实现更高的性能和更低的延迟。ASIC的设计是为了特定的应用场景,因此可以针对性地优化性能。

ASIC和CPU在交换机中扮演不同但互补的角色:

  • ASIC职责:
    • 完成主要的二三层转发功能
    • 维护MAC地址表和三层转发表
    • 执行硬件级别的快速转发
  • CPU职责:
    • 负责转发控制
    • 维护软件表项(路由表、ARP表等)
    • 配置ASIC的硬件转发表
    • 处理ASIC无法处理的特殊报文

Dealing with packet loss: Detecting out-of-order packet arrivals can be done by keeping track of the packet sequence number. However, a naive sequence tracing will stall when a packet loss occurs. The standard solution is to set a default waiting time and transmission proceeds to the next sequence after timeout. However, it is non-trivial to set a proper timeout value given that there can be substantial variability in path delay. Performing this task in the data plane with limited resources makes the task even more challenging.

处理数据包丢失:可以通过跟踪数据包的序列号来检测乱序数据包到达情况。然而,简单的序列跟踪在数据包丢失时会停止。标准解决方案是设置一个默认的等待时间,并在超时后继续传输下一个序列。然而,由于路径延迟可能存在较大变化,因此设置适当的超时时间并不容易。在数据平面中执行这一任务并且资源有限,进一步增加了此任务的难度。

Sorting Primitive

在高速网络中,数据包可能会出现乱序情况。为了保证服务质量,需要对乱序的数据包进行重排序。然而,在ASIC交换机中实现排序功能面临着严重的挑战。

现有解决方案及其局限性:

1. 数据包重循环方案

Text Only
1
入口 --> ASIC处理 --> 重循环端口 --> ASIC再次处理 --> 出口
  • 问题
    • 当乱序包到达速率超过有序包转发速率时会导致拥塞
    • 重循环端口可能溢出,造成丢包
    • 增加了额外的处理延迟

2. 排序队列方案(PIFO)

Text Only
1
数据包 --> 优先级计算 --> 按优先级插入队列 --> 按序输出
  • 局限性
    • 现有ASIC不支持真正的PIFO结构
    • 近似实现(如AIFO、SP-PIFO)需要大量硬件资源