IRN Design¶
We begin with describing the transport logic for IRN. For simplicity, we present it as a general design independent of the specific RDMA operation types. We go into the details of handling specific RDMA operations with IRN later in §5.
Changes to the RoCE transport design may introduce overheads in the form of new hardware logic or additional perflow state. With the goal of keeping such overheads as small as possible, IRN strives to make minimal changes to the RoCE NIC design in order to eliminate its PFC requirement, as opposed to squeezing out the best possible performance with a more sophisticated design (we evaluate the small overhead introduced by IRN later in §6).
IRN, therefore, makes two key changes to current RoCE NICs, as described in the following subsections: (1) improving the loss recovery mechanism, and (2) basic end-to-end flow control (termed BDP-FC) which bounds the number of in-flight packets by the bandwidth-delay product of the network. We justify these changes by empirically evaluating their significance, and exploring some alternative design choices later in §4.3. Note that these changes are orthogonal to the use of explicit congestion control mechanisms (such as DCQCN [38] and Timely [29]) that, as with current RoCE NICs, can be optionally enabled with IRN.
我们首先描述 IRN 的传输逻辑。为简化起见,我们将其呈现为一个独立于具体 RDMA 操作类型的通用设计。关于 IRN 如何处理特定 RDMA 操作的细节,我们将在后续 §5 中深入探讨。
对 RoCE 传输设计的改动可能会以新的硬件逻辑或额外的每流状态等形式引入开销。为了尽可能减小这类开销, IRN 致力于对 RoCE NIC 设计做出最小化的改动以消除其对 PFC 的需求 ,而不是通过一个更复杂的设计来榨取最佳性能(我们将在 §6 中评估 IRN 引入的微小开销)。
因此,IRN 对当前的 RoCE NIC 做了两项关键改动,具体如下小节所述:
(1) 改进丢包恢复机制
(2) 基础的端到端流量控制(称为 BDP-FC ),该机制通过网络的带宽时延积(Bandwidth-Delay Product)来限制在途数据包的数量
我们将在 §4.3 中通过经验性评估来验证这些改动的重要性,并探究一些替代的设计选项,从而证明其合理性。
值得注意的是, 这些改动与显式拥塞控制机制(如 DCQCN [38] 和 Timely [29])的使用是正交的,这意味着与当前 RoCE NIC 一样,这些拥塞控制机制也可以在 IRN 上选择性地启用。
IRN’s Loss Recovery Mechanism¶
As discussed in §2, current RoCE NICs use a go-back-N loss recovery scheme. In the absence of PFC, redundant retransmissions caused by go-back-N loss recovery result in significant performance penalties (as evaluated in §4). Therefore, the first change we make with IRN is a more efficient loss recovery, based on selective retransmission (inspired by TCP’s loss recovery), where the receiver does not discard out of order packets and the sender selectively retransmits the lost packets, as detailed below.
Upon every out-of-order packet arrival, an IRN receiver sends a NACK, which carries both the cumulative acknowledgment (indicating its expected sequence number) and the sequence number of the packet that triggered the NACK (as a simplified form of selective acknowledgement or SACK).
An IRN sender enters loss recovery mode when a NACK is received or when a timeout occurs. It also maintains a bitmap to track which packets have been cumulatively and selectively acknowledged. When in the loss recovery mode, the sender selectively retransmits lost packets as indicated by the bitmap, instead of sending new packets. The first packet that is retransmitted on entering loss recovery corresponds to the cumulative acknowledgement value. Any subsequent packet is considered lost only if another packet with a higher sequence number has been selectively acked. When there are no more lost packets to be retransmitted, the sender continues to transmit new packets (if allowed by BDP-FC). It exits loss recovery when a cumulative acknowledgement greater than the recovery sequence is received, where the recovery sequence corresponds to the last regular packet that was sent before the retransmission of a lost packet.
SACKs allow efficient loss recovery only when there are multiple packets in flight. For other cases (e.g., for single packet messages), loss recovery gets triggered via timeouts. A high timeout value can increase the tail latency of such short messages. However, keeping the timeout value too small can result in too many spurious retransmissions, affecting the overall results. An IRN sender, therefore, uses a low timeout value of RTO low only when there are a small N number of packets in flight (such that spurious retransmissions remains negligibly small), and a higher value of RTO hiдh otherwise. We discuss how the values of these parameters are set in §4, and how the timeout feature in current RoCE NICs can be easily extended to support this in §6.
这一部分非常具有技巧性,建议参考原文更容易理解。这里找AI简单写了个例子便于理解:
(1) 以前的笨办法 (RoCE 的 Go-Back-N)
假设老王要给小美寄一套10本的漫画(#1 到 #10)
- 老王按顺序把 #1, #2, #3, #4, #5... 一本本发出去
- 意外发生:快递公司把 #3 号漫画弄丢了
- 小美收到了 #1, #2,然后突然收到了 #4!
- 小美的笨办法:她是个死心眼,觉得必须按顺序收。看到 #4 来了,她会想:“不对啊,我的 #3 还没到呢!这个 #4 我不要!” 然后她会把 #4 扔掉。之后快递员送来的 #5, #6, #7... 她也通通扔掉
- 同时,她给老王打个电话(发送NACK):“喂!老王,我还在等你的 #3 啊!”
- 老王的笨办法:老王接到电话,只知道 #3 没到,但他不知道小美已经把 #4, #5, #6... 都扔了。他最简单的做法就是:“好吧,那我从 #3 开始,后面所有的都给你重新发一遍!”
结果:老王把 #3, #4, #5, #6, #7, #8, #9, #10 全部重发了一遍。这里面 #4 到 #10 其实第一次就送到了,只是被小美扔了,现在又重发,造成了巨大的浪费和拥堵 😅😅😅
(2) 现在的聪明办法 (IRN 的选择性重传)
我们看看聪明的IRN是怎么做的。同样是 #3 漫画丢了
- 小美收到了 #1, #2,然后突然收到了 #4
- 小美的聪明做法:她不会扔掉 #4!她会先把 #4 收下,放在旁边的书架上存好。
- 然后,她给老王打一个内容更丰富的电话(带有
SACK
的NACK):“喂!老王!我已经收到了 #1 和 #2(这叫累积确认)。另外,我刚刚收到了 #4,这说明我正缺一本 #3 呢(这叫选择性确认)!” - 老王的聪明做法:老王有个核对清单 (
bitmap
)。他听到小美的电话,就在清单上把 #1, #2, #4 都打上勾。他一看清单,立刻就明白了:“哦!原来只是 #3 丢了!” - 于是,他精确地只把 #3 这一本漫画重新打包,补发过去
结果:没有任何浪费!老王只补发了真正丢失的那一本。小美收到补发的 #3 后,和书架上存着的 #4, #5... 完美地凑成了一整套。效率极高!😍😍😍
(3) IRN如何巧妙地处理“超时”问题?
上面的方法有个前提: 小美得收到一个“乱序”的包裹(比如 #4 ),她才知道 #3 丢了 🌟
但万一丢的是最后一个包裹呢? 或者,这次总共就只寄一个包裹,结果还丢了呢?
这时候,小美永远等不到“乱序”的包裹,她只能干等。老王也收不到任何信息,也只能干等。这就尴尬了,我们称之为“超时”...
IRN为此设计了一个智能闹钟策略:
场景一:寄一大堆漫画时
老王一次性寄10本漫画。他心里有数:就算中间丢了一本,后面很快会有其他本送到,小美自然会打电话来报告。所以他可以耐心一点,把闹钟设得久一些 (较高的超时值 \(RTO_{high}\))。这样可以避免因为快递稍微慢了一点点,他就紧张地以为丢件了而乱重发。
场景二:只寄一本漫画时
老王这次只给小美补寄一本绝版的签名册。如果这独苗苗丢了,那就全完了,两人会永远等下去。
所以,老王这次会把闹钟设得非常短 (较低的超时值 \(RTO_{low}\))。只要闹钟一响,小美还没回信说“收到了”,老王就立刻判断“出事了!”,马上重新发一本。这样就保证了重要的、单个的消息不会因为丢失而石沉大海。
IRN’s BDP-FC Mechanism¶
The second change we make with IRN is introducing the notion of a basic end-to-end packet level flow control, called BDP-FC, which bounds the number of outstanding packets in flight for a flow by the bandwidth-delay product (BDP) of the network, as suggested in [17]. This is a static cap that we compute by dividing the BDP of the longest path in the network (in bytes) 2 with the packet MTU set by the RDMA queue-pair (typically 1KB in RoCE NICs). An IRN sender transmits a new packet only if the number of packets in flight (computed as the difference between current packet’s sequence number and last acknowledged sequence number) is less than this BDP cap.
BDP-FC improves the performance by reducing unnecessary queuing in the network. Furthermore, by strictly upper bounding the number of out-of-order packet arrivals, it greatly reduces the amount of state required for tracking packet losses in the NICs (discussed in more details in §6).
As mentioned before, IRN’s loss recovery has been inspired by TCP’s loss recovery. However, rather than incorporating the entire TCP stack as is done by iWARP NICs, IRN: (1) decouples loss recovery from congestion control and does not incorporate any notion of TCP congestion window control involving slow start, AIMD or advanced fast recovery, (2) operates directly on RDMA segments instead of using TCP’s byte stream abstraction, which not only avoids the complexity introduced by multiple translation layers (as needed in iWARP), but also allows IRN to simplify its selective acknowledgement and loss tracking schemes. We discuss how these changes effect performance towards the end of §4.
我们对IRN做出的第二个改动是引入了一种基础的端到端包级流量控制概念,我们称之为 BDP-FC。该机制根据网络的带宽时延积 (Bandwidth-Delay Product, BDP) 来限制一个流在途的未确认数据包数量,这一思路源于 [17] 的建议。这是一个静态的上限值,其计算方式为: 用网络中最长路径的BDP(以字节为单位)² 除以由RDMA队列对设置的数据包MTU(在RoCE网卡中通常为1KB)。仅当在途数据包数量(通过当前数据包序列号与上一个被确认的序列号之差计算得出)小于此BDP上限时,IRN发送端才会传输新的数据包。
BDP-FC 通过减少网络中不必要的排队来提升性能。此外,通过严格限制乱序包到达的数量上限,它极大地减少了网卡中追踪丢包所需的状态量(详情在 §6 中有更深入的讨论)。
如前所述,IRN的丢包恢复机制受TCP的启发。然而,IRN 并非像 iWARP 网卡那样完整地集成TCP协议栈,而是:
- 将丢包恢复与拥塞控制解耦 ,并未引入任何TCP拥塞窗口控制的概念,如慢启动、AIMD(加性增、乘性减)或高级快速恢复
- 直接在RDMA段 (segment) 上进行操作,而不是使用TCP的字节流抽象 。这不仅避免了(iWARP所需的)多层转换所带来的复杂性,也使得IRN能够简化其选择性确认和丢包追踪方案
我们将在 §4 的结尾部分讨论这些改动对性能的影响。