跳转至

Discussion and future work

ConWeave on SmartNICs: While ConWeave is designed to exploit the capabilities of commodity programmable switches, our key insights can also be applied to recent SmartNICs. For instance, the switch logic of ConWeave can be implemented on the Nvidia BlueField DPUs [47] or the Intel E2000 IPUs [32], albeit with different trade-offs in resource usage, deployment cost, and complexity. Nevertheless, running ConWeave on ToR switches has the following advantages. First, it incurs less redundancy in path state maintenance as the ToR switch serves as a natural aggregation point for path monitoring, selection, and switching. Second, since the resources used for packet reordering on the ToR switch can be shared by all servers in the rack, this can lead to reduced resource usage due to statistical multiplexing.

ConWeave在SmartNIC上的应用:虽然ConWeave是为了利用商用可编程交换机的能力而设计的,但我们的关键理念也可以应用于近期的SmartNIC。例如,ConWeave的交换逻辑可以在Nvidia BlueField DPUs [47]或Intel E2000 IPUs [32]上实现,尽管这会在资源使用、部署成本和复杂性上带来不同的权衡。然而,在ToR交换机上运行ConWeave有以下优势。首先,它在路径状态维护上产生的冗余较少,因为ToR交换机作为路径监控、选择和切换的自然聚合点。其次,由于ToR交换机上用于数据包重排序的资源可以被机架中的所有服务器共享,这可以通过统计复用减少资源使用。

Scaling to larger network: Switch hardware resource limitation presents a scalability challenge for ConWeave when the network expands to millions of servers. We note that hardware resource usage, such as the number of queues and the amount of on-chip memory, is proportional to the number of active flows that necessitate packet reordering at the destination ToR. The number of such flows, in turn, depends only on the number of servers per rack and applications per server. Neither number grows significantly with increasing network size. For example, in a fat-tree topology [8], the size of the network increases cubic to the number of servers per rack.

扩展到更大网络:当网络扩展到数百万台服务器时,交换机硬件资源限制为ConWeave带来了可扩展性挑战。我们注意到,硬件资源使用情况(如队列数量和片上内存量)与需要在目的ToR进行数据包重排序的活动流的数量成正比。此类流的数量又仅取决于每个机架的服务器数量和每台服务器的应用程序数量。这两个数字随着网络规模的增加并不会显著增长。例如,在脂肪树拓扑[8]中,网络的规模是与每个机架的服务器数量的立方成正比的。

In the rare cases where switch hardware resources are exhausted, unresolved out-of-order packets can lead to performance degradation. To reduce the likelihood of resource exhaustion, one can consider either using external switch memory [36] such as host DRAM and SmartNICs to buffer packets temporarily, or applying admission control so that destination ToRs permit source ToRs to do rerouting only when there are spare resources. We leave these investigations for future work.

在极少数情况下,当交换机硬件资源耗尽时,未解决的乱序数据包可能导致性能下降。为了减少资源耗尽的可能性,可以考虑使用外部交换机内存[36],例如主机DRAM和SmartNICs来临时缓冲数据包,或者应用接纳控制,使得目的ToR仅在有空闲资源时允许源ToR进行重新路由。我们将这些问题留待未来的工作进行探索。

Interaction with congestion control: The primary focus of this work revolves around DCQCN, the de-facto standard transport protocol in commodity RNICs. In ConWeave design, DCQCN’s ECN-based congestion marking offers two distinct advantages. Firstly, it ensures that the delay resulting from packet reordering is not erroneously attributed to network congestion. Secondly, the ECN threshold provides valuable insights into the minimal time required to alleviate congestion within the queue. On the other hand, ConWeave is also compatible with delay-based protocols such as Swift [37]. However, it is essential that any delay incurred due to packet reordering at the destination ToR switch should not be interpreted as a congestion signal in these cases.

与拥塞控制的交互:本研究的主要关注点是DCQCN,这是商用RNIC中的事实标准传输协议。在ConWeave设计中,DCQCN基于ECN的拥塞标记提供了两个显著优势。首先,它确保了由于数据包重排序导致的延迟不会错误地被归因于网络拥塞。其次,ECN阈值提供了有关缓解队列中拥塞所需的最小时间的宝贵信息。另一方面,ConWeave也与基于延迟的协议兼容,如Swift [37]。然而,在这些情况下,必须确保由于目的ToR交换机上数据包重排序所产生的任何延迟不应被解释为拥塞信号。

Integrating with rate control: In its current prototype, while ConWeave takes the approach of avoiding congested paths by rapid and frequent rerouting, it does not take into account the effect on rate control. For instance, after it reroutes from a congested to an idle path, the congestion feedback from the previous path can still unnecessarily reduce the rate. It will be interesting to investigate how a predictable and scheduled load balancing mechanism like ConWeave can be co-designed with a rate control mechanism.

与速率控制的集成:在当前的原型中,尽管ConWeave通过快速和频繁的重新路由来避免拥塞路径,但它并没有考虑对速率控制的影响。例如,当ConWeave从拥塞路径重新路由到空闲路径时,来自先前路径的拥塞反馈仍然可能不必要地减少速率。探索如何将像ConWeave这样的可预测和计划的负载均衡机制与速率控制机制共同设计将是一个有趣的课题。

explanation

场景示例:

  1. 流量在路径A上传输,发生拥塞
  2. ConWeave将流量切换到空闲的路径B
  3. 路径A的拥塞反馈仍然会到达发送方
  4. 发送方收到拥塞信号后降低传输速率
  5. 但此时流量已经在新路径B上,降低速率是不必要的

Incremental deployment: ConWeave’s design allows operators to incrementally deploy ConWeave in data centers running alongside non-ConWeave ToR switches. For inter-rack communications that involve non-ConWeave ToR switches, the default ECMP is applied. The optimum partial deployment strategy for maximum benefits remains an area for further investigation in future research.

增量部署:ConWeave的设计允许运营商在数据中心内逐步部署ConWeave,与非ConWeave的ToR交换机并行运行。对于涉及非ConWeave ToR交换机的机架间通信,默认采用ECMP。为了最大化效益,最佳的部分部署策略仍然是未来研究中的一个待探索领域。

RDMA load balancing in data centers: There has been plenty of literature on addressing data center network load balancing in various granularities from per-flow to per-packet. Apart from the conventional Equal Cost Multi-Path (ECMP), existing works predominantly leverage flowlet [11, 35, 52, 59], a set of sub-streams of a flow stream divided by the inactive time gap, to proactively avoid out-of-order delivery. However, as highlighted previously, such an opportunistic design turns out to be not effective on RDMA given the fewer chances to reroute [42]. Per-packet rerouting (e.g., spraying [18] or DRILL [23]) may provide near-optimal load balancing if out-of-order delivery does not matter, but it incurs an enormous performance impact on RDMA. On the contrary, ConWeave masks the out-of-orders in the network while operating a load balancing at a fine granularity. Table 5 summarizes the existing literature and ConWeave in the context of RDMA.

数据中心中的RDMA负载均衡:关于解决数据中心网络负载均衡的问题,已有大量文献涵盖了从每流到每包不同粒度的研究。除了传统的等成本多路径(ECMP)方法,现有的研究主要利用流片(flowlets)[11, 35, 52, 59],即通过非活动时间间隔将流流分割为一组子流,主动避免乱序交付。然而,正如之前所强调的,考虑到重新路由的机会较少,这种机会主义设计在RDMA中并不有效[42]。每包重新路由(例如,喷洒算法[18]或DRILL[23])如果乱序交付无关紧要,可能提供近乎最优的负载均衡,但它会对RDMA性能产生巨大的影响。相反,ConWeave通过在精细粒度上进行负载均衡的同时掩盖网络中的乱序,提供了更好的解决方案。表5总结了现有文献和ConWeave在RDMA背景下的比较。

Some works consider a multi-path transport design with endhost modifications. MP-RDMA [42] proposes a multi-path RDMA transport through custom-designed RNICs, or purely softwarebased implementation [58]. Moreover, it may not be compatible with the legacy RNICs and thus it cannot be easily deployed in data centers. In contrast, ConWeave is complementary to existing routing protocols and operates on current commodity programmable switches and RNICs.

一些研究考虑了带有终端主机修改的多路径传输设计。MP-RDMA [42] 提出了通过定制设计的RNIC或纯软件实现的多路径RDMA传输[58]。然而,它可能与传统的RNIC不兼容,因此不能在数据中心中轻松部署。相比之下,ConWeave与现有的路由协议是互补的,并且可以在当前的商用可编程交换机和RNIC上运行。

Packet reordering on programmable switches: With the emergence of data plane programmability, efforts have been made to fully/partially offload functions at the end hosts to the switching hardware for performance acceleration. The packet reordering (or, sorting) function in the programmable switch has been explored in the context of packet scheduling. For instance, many queue abstraction designs have been proposed to flexibly express a variety of scheduling algorithms and to be efficiently implemented on programmable switches [10, 54–56, 61]. However, their primitives are substantially more expensive to support as their requirements to reorder packets for a per-flow basis on hardware are also more complex. ConWeave’s requirement is simpler and is designed to satisfy the packet reordering need in the context of load balancing.

可编程交换机上的数据包重排序:随着数据平面可编程性的出现,已经有许多努力将终端主机的功能完全或部分卸载到交换硬件上,以加速性能。在可编程交换机的背景下,数据包重排序(或排序)功能已在数据包调度中得到探讨。例如,许多队列抽象设计被提出,以灵活地表达各种调度算法,并能够高效地在可编程交换机上实现[10, 54–56, 61]。然而,这些原语的支持成本相对较高,因为它们在硬件上对每流进行数据包重排序的要求也更加复杂。相比之下,ConWeave的需求更为简单,旨在满足负载均衡背景下的数据包重排序需求。

Offloading packet reordering to application/transport layer: Instead of avoiding packet out-of-order in the network, some works [22, 26, 27, 45] implement a dedicated reordering buffer on the transport or application layer. However, these approaches are either too complex to be implemented on commodity RNIC hardware or incur a significant overhead on the CPU negating the benefits of using RDMA. Furthermore, they predominantly rely on congestion oblivious rerouting (i.e., packet spraying) whose susceptibility to network asymmetry is well-known by previous studies [59, 62].

将数据包重排序卸载到应用层/传输层:一些研究[22, 26, 27, 45]在传输层或应用层实现了一个专用的重排序缓冲区,而不是在网络中避免数据包乱序。然而,这些方法要么在商用RNIC硬件上实现起来过于复杂,要么会对CPU产生显著的开销,从而抵消了使用RDMA的好处。此外,它们主要依赖于对拥塞不敏感的重新路由(即数据包喷洒),而这一方法在网络不对称性下的易受影响性已被先前的研究广泛认识到[59, 62]。

Load balancing in HPC: The emergence of AI/ML applications has led to a strong emphasis within the high-performance computing (HPC) community on achieving optimal (RDMA) network load balancing. Concurrent with ConWeave, leading companies in the field of HPC, including Cisco [3], NVIDIA [7], and Broadcom [2], have developed proprietary systems that incorporate per-packet rerouting and packet reordering capabilities in their proprietary switches and/or DPUs integrated into SmartNICs. ConWeave shares the same goal on RDMA load balancing with these systems and closely aligns itself with this emerging industrial trend. ConWeave’s design and its publicly available implementation can serve as a possible benchmark for further research on RDMA load-balancing using commodity programmable hardware.

高性能计算中的负载均衡:AI/ML应用的兴起使得高性能计算(HPC)领域更加注重实现最优的(RDMA)网络负载均衡。与ConWeave同时,HPC领域的领先公司,包括思科[3]、英伟达[7]和博通[2],也开发了专有系统,这些系统在其专有交换机和/或集成在SmartNIC中的DPU中集成了每包重新路由和数据包重排序的能力。ConWeave与这些系统在RDMA负载均衡方面具有相同的目标,并与这一新兴的工业趋势紧密对接。ConWeave的设计及其公开可用的实现可以作为进一步研究基于商用可编程硬件进行RDMA负载均衡的一个可能基准。

SmartNIC

SmartNIC是一种智能网络接口卡,它允许在购买后加载附加软件来添加或支持新功能。与传统网卡相比,SmartNIC具有更强的计算能力和更大的板载存储器。

  1. 基础网卡阶段
    • 主要负责基础的数据帧处理
    • 通过DPDK和SR-IOV提供虚拟机网络接入
  2. 第一代SmartNIC
    • 实现复杂网络数据平面功能(data-plane)
    • 提供可编程硬件加速引擎
  3. 第二代SmartNIC
    • 增强控制面和数据面加速(control-plane)
    • 支持更多基础设施层服务