PowerTCP: Pushing the Performance Limits of Datacenter Networks¶

先用gemini过一遍, 很显然这文章完全是CC领域的, 我就读个乐呵积累一下吧

这篇论文主要针对数据中心网络中现有拥塞控制(Congestion Control, CC)算法的局限性, 提出了一种名为 PowerTCP 的新算法.

现有问题分析: 作者将现有的 CC 算法类比为电路
- 电压型(Voltage-based): 基于绝对网络状态(如队列长度, RTT), 例如 DCTCP, CUBIC
  - 优点是: 稳定
  - 缺点是: 反应慢/容易反应过度
- 电流型(Current-based): 基于状态变化率(如 RTT 梯度), 例如 TIMELY
  - 优点是: 反应快
  - 缺点是: 缺乏唯一的平衡点, 稳定性较差
- 结论: 单一维度无法兼顾快速响应和精确控制
核心理念 - 功率(Power):
- 引入了"网络功率"的概念, 即 电压 × 电流
- 在网络中, 这大致对应于 (队列长度 + BDP) × (总传输速率)
- PowerTCP 利用这一综合指标来进行拥塞窗口的更新
PowerTCP 算法特点:
- 机制: 主要依赖带内网络遥测(INT, In-band Network Telemetry) 从交换机获取精确的队列和带宽信息
- 优势: 结合了电压型和电流型的优点, 实现了毫秒级的快速响应(应对突发流量), 同时保持系统的稳定性和极低的队列积压(低延迟)
- 性能: 在仿真中, 相比 DCQCN 和 TIMELY, 短流完成时间减少了 80%
0-PowerTCP:
- 针对不支持可编程交换机的场景, 作者还提出了一种纯端主机的近似版本, 利用 RTT 来估算功率

Introduction¶

The performance of more and more cloud-based applications critically depends on the underlying network, requiring datacenter networks (DCNs) to provide extremely low latency and high bandwidth. For example, in distributed machine learning applications that periodically require large data transfers, the network is increasingly becoming a bottleneck [36]. Similarly, stringent performance requirements are introduced by today's trend of resource disaggregation in datacenters where fast access to remote resources (e.g., GPUs or memory) is pivotal for the overall system performance [36]. Building systems with strict performance requirements is especially challenging under bursty traffic patterns as they are commonly observed in datacenter networks [12,16,47,53,55].

These requirements introduce the need for fast and accurate network resource management algorithms that optimally utilize the available bandwidth while minimizing packet latencies and flow completion times. Congestion control (CC) plays an important role in this context being ''a key enabler (or limiter) of system performance in the datacenter'' [34]. In fact, fast reacting congestion control is not only essential to efficiently adapt to bursty traffic [29,48], but is also becoming increasingly important in the context of emerging reconfigurable datacenter networks (RDCNs) [13,14,20,33,38,39,50]. In these networks, a congestion control algorithm must be able to quickly ramp up its sending rate when high-bandwidth circuits become available [43].

Traditional congestion control in datacenters revolves around a bottleneck link model: the control action is related to the state i.e., queue length at the bottleneck link. A common goal is to efficiently control queue buildup while achieving high throughput. Existing algorithms can be broadly classified into two types based on the feedback that they react to. In the following, we will use an analogy to electrical circuits 1 to describe these two types. The first category of algorithms react to the absolute network state, such as the queue length or the RTT: a function of network ''effort'' or voltage defined as the sum of the bandwidth-delay product and in-network queuing. The second category of algorithms rather react to variations, such as the change of RTT. Since these changes are related to the network ''flow'', we say that these approaches depend on the current defined as the total transmission rate. We tabulate our analogy and corresponding network quantities in Table 1. According to this classification, we call congestion control protocols such as CUBIC [21], DCTCP [7], or Vegas [15] voltage-based CC algorithms as they react to absolute properties such as the bottleneck queue length, delay, Explicit Congestion Notification (ECN), or loss. Recent proposals such as TIMELY [41] are current-based CC algorithms as they react to the variations, such as the RTT-gradient. In conclusion, we find that existing congestion control algorithms are fundamentally limited to one of the two dimensions (voltage or current) in the way they update the congestion window.

We argue that the input to a congestion control algorithm should rather be a function of the two-dimensional state of the network (i.e., both voltage and current) to allow for more informed and accurate reaction, improving performance and stability. In our work, we show that there exists an accurate relationship between the optimal adjustment of the congestion window, the network voltage and the network current. We analytically show that the optimal window adjustment depends on the product of network voltage and network current. We call this product network power: current × voltage, a function of both queue lengths and queue dynamics.

Figure 1 illustrates our classification. Existing protocols depend on a single dimension, voltage or current. This can result in imprecise congestion control as the protocol is unable to distinguish between fundamentally different scenarios, and, as a result, either reacts too slowly or overreacts, both impeding performance. Accounting for both voltage and current, i.e., power, balances accurate inflight control and fast reaction, effectively providing the best of both worlds.

In this paper we present POWERTCP, a novel power-based congestion control algorithm that accurately captures both voltage and current dimensions for every control action using measurements taken within the network and propagated through in-band network telemetry (INT). POWERTCP is able to utilize available bandwidth within one or two RTTs while being stable, maintaining low queue lengths, and resolving congestion rapidly. Furthermore, we show that P OW ER TCP is Lyapunov-stable, as well as asymptotically stable and has a convergence time as low as five update intervals (Appendix A). This makes POWERTCP highly suitable for today's datacenter networks and dynamic network environments such as in reconfigurable datacenters.

POWERTCP leverages in-network measurements at programmable switches to accurately obtain the bottleneck link state. Our switch component is lightweight and the required INT header fields are standard in the literature [36]. We also discuss an approximation of POWERTCP for use with nonprogrammable, legacy switches.

To evaluate POWERTCP, we focus on a deployment scenario in the context of RDMA networks where the CC algorithm is implemented on a NIC. Our results from largescale simulations show that POWERTCP reduces the 99.9percentile short flow completion times by 80% compared to DCQCN [56] and by 33% compared to the state-of-the-art low-latency protocol HPCC [36]. We show that P OW ER TCP maintains near-zero queue lengths without affecting throughput or incurring long flow completion times even at 80% load. As a case study, we explore the benefits of POWERTCP in reconfigurable datacenter networks where it achieves 80−85% circuit utilization and reduces tail latency by at least 2× compared to the state-of-the-art [43]. Finally, as a proof-of-concept, we implemented POWERTCP in the Linux kernel and the telemetry component on an Intel Tofino programmable line-rate switch using P4 [18].

In summary, our key contributions in this paper are:

We reveal the shortcomings of existing congestion control approaches which either only react to the current state or the dynamics of the network, and introduce the notion of power to account for both.
POWERTCP, a power-based approach to congestion control at the end-host which reacts faster to changes in the network such as an arrival of burst, fluctuations in available bandwidth etc.,
An evaluation of the benefits of POWERTCP in traditional DCNs and RDCNs.
As a contribution to the research community and to facilitate future work, all our artefacts have been made publicly available at: https://powertcp.self-adjusting.net.

越来越多的云应用性能关键性地依赖于底层网络, 这对数据中心网络(DCN)提出了提供极低延迟和高带宽的严苛要求. 例如, 在需要周期性进行大规模数据传输的分布式机器学习应用中, 网络正日益成为系统的瓶颈 [36]. 同样, 数据中心资源解耦(resource disaggregation)的趋势也带来了严格的性能需求, 其中快速访问远程资源(如 GPU 或内存)对于系统整体性能至关重要 [36]. 在数据中心网络中常见的突发流量模式下, 构建满足严格性能要求的系统具有极大的挑战性 [12,16,47,53,55].

这些需求迫切需要快速且精准的网络资源管理算法, 以便在最小化数据包延迟和流完成时间的同时, 最优地利用可用带宽. 拥塞控制(CC)在此背景下发挥着重要作用, 被视为"数据中心系统性能的关键赋能者(或限制因素)" [34]. 事实上, 快速响应的拥塞控制不仅对于高效适应突发流量至关重要 [29,48], 而且在新兴的可重构数据中心网络(RDCNs)背景下也愈发重要 [13,14,20,33,38,39,50]. 在此类网络中, 当高带宽链路(circuit)可用时, 拥塞控制算法必须能够迅速提升其发送速率 [43].

传统的数据中心拥塞控制围绕瓶颈链路模型展开: 其控制行为与网络状态(即瓶颈链路的队列长度)相关. 其共同目标是在实现高吞吐量的同时, 有效控制队列积压.

现有的算法根据其响应的反馈类型, 大致可分为两类. 下文中, 我们将利用电路类比来描述这两类算法:

第一类算法响应绝对网络状态, 例如队列长度或往返时间(RTT):
- 这可视作网络"势能"或电压(Voltage)的函数
- 定义为带宽时延积(BDP)与网内排队量之和
第二类算法则响应状态的变化量, 例如 RTT 的变化率
- 由于这些变化与网络的"流量"相关, 我们将此类方法称为依赖于电流(Current), 即总传输速率
- 我们在表 1 中列出了该类比及其对应的网络物理量

根据这一分类, 我们将:

CUBIC [21], DCTCP [7] 或 Vegas [15] 等拥塞控制协议称为电压型(Voltage-based)CC 算法, 因为它们响应的是绝对属性, 如瓶颈队列长度, 延迟, 显式拥塞通知(ECN)或丢包
近期提出的方案如 TIMELY [41] 则属于电流型(Current-based)CC 算法, 因为它们响应的是变化量, 例如 RTT 梯度

alt text

综上所述, 我们发现现有的拥塞控制算法在更新拥塞窗口时, 从根本上局限于电压或电流这单一维度

我们认为, 拥塞控制算法的输入应当是网络二维状态(即同时包含电压和电流)的函数, 从而实现更全面, 精准的响应, 进而提升性能与稳定性.

在本文中, 我们证明了拥塞窗口的最佳调整量与网络电压及网络电流之间存在确切的关系.

我们通过理论分析表明, 最佳窗口调整量取决于网络电压与网络电流的乘积. 我们将该乘积称为网络功率(Network Power): 即 电流 × 电压, 它是队列长度与队列动态特性的函数.

图 1 展示了我们的分类方法:

alt text

现有协议依赖于单一维度(电压或电流). 这可能导致拥塞控制不精确, 因为协议无法区分本质上不同的场景, 从而导致反应过慢或反应过度, 这两者都会阻碍性能.

综合考虑电压和电流(即功率), 能够平衡精确的 inflight 控制与快速响应, 有效地兼得二者之长.

在本文中, 我们提出了 PowerTCP, 这是一种新型的基于功率的拥塞控制算法.

通过利用网络内测量并通过带内网络遥测(INT)进行传输, PowerTCP 能够在每次控制操作中准确捕捉电压和电流这两个维度的状态.

何为INT? 与普通网络测量有什么区别?

带内网络遥测: In-band Network Telemetry (INT) 是一种网络测量技术, 允许网络设备(如交换机和路由器)在数据包传输过程中嵌入测量信息. 这使得终端设备能够实时获取网络状态, 如延迟, 丢包率和队列长度等, 从而实现更精确的网络管理和优化.

看看原文中的描述:

alt text

(1) 普通网络测量

主要是端到端(End-to-End)的"黑盒"猜测

纯端到端 (如 RTT):

中间的 Switch 完全不理会测量任务, 近乎"无状态"负责转发
Sender 发包, Receiver 收包并回 ACK
Sender 算出时间差, 猜测中间发生了什么:
- 是排队了? 还是光纤断了绕路了? 只能猜个大概, Sender 完全不知道到底发生了什么

轻量级辅助 (如 ECN/DCTCP):

中间 Switch 参与了, 但仅仅是打一个标记
- e.g. 把 IP 头里的 ECN 位设为 1
Switch 不告诉 Sender 队列具体有多长, 只说"我这里有点堵(超过阈值了)"
- 这是一个二进制(0/1)的信息, 非常模糊

(2) INT

这并不是 Switch 单独给 Sender 发短信! 否则会产生大量额外的控制报文, 导致网络爆炸!

而是"搭便车"(Piggyback), "超级白盒"!

Sender 发出一个普通的数据包
Hop 1 (Switch) 收到包, 在包头里插入它当前的元数据("我是 Switch A, 当前队列长度 5KB, 带宽 100G")
Hop 2 (Switch) 收到包, 继续在包头里追加它的元数据("我是 Switch B, 当前队列长度 0KB...")
Receiver 收到这个被"撑大"了的包
- 此时, Receiver 拥有了整个路径上所有节点的详细健康报告
关键一步: Receiver 在给 Sender 回复 ACK 时, 会把最关键的信息(通常是路径上最拥堵那个节点的队列信息, 即瓶颈信息)提取出来, 放在 ACK 里发回给 Sender

PowerTCP 能够在 1 到 2 个 RTT 内充分利用可用带宽, 同时保持系统稳定, 维持低队列长度并迅速消除拥塞.

此外, 我们证明了 PowerTCP 具有李雅普诺夫稳定性(Lyapunov-stable)和渐进稳定性(asymptotically stable), 且收敛时间低至 5 个更新周期(附录 A). 这使得 PowerTCP 非常适用于当今的数据中心网络以及诸如可重构数据中心等动态网络环境.

PowerTCP 利用可编程交换机上的网内测量来精确获取瓶颈链路状态. 我们的交换机组件是轻量级的, 所需的 INT 报头字段也是文献中的标准配置 [36]. 我们还讨论了针对非可编程传统交换机的 PowerTCP 近似实现方案.

为了评估 PowerTCP, 我们聚焦于 RDMA 网络背景下的部署场景, 其中 CC 算法在网卡(NIC)上实现. 大规模仿真结果表明, 与 DCQCN [56] 相比, PowerTCP 将 99.9% 分位的短流完成时间降低了 80%; 与最先进的低延迟协议 HPCC [36] 相比, 降低了 33%. 我们展示了 PowerTCP 即使在 80% 的负载下, 也能维持接近零的队列长度, 且不影响吞吐量或导致长流完成时间增加.

作为一个案例研究, 我们探索了 PowerTCP 在可重构数据中心网络中的优势, 它实现了 80-85% 的链路利用率, 并且相比现有技术 [43] 将尾部延迟降低了至少 2 倍. 最后, 作为概念验证, 我们在 Linux 内核中实现了 PowerTCP, 并使用 P4 [18] 在 Intel Tofino 可编程线速交换机上实现了遥测组件.

综上所述, 本文的主要贡献如下:

揭示了现有拥塞控制方法的缺陷, 即仅对当前状态或网络动态做出反应, 并引入了"功率"概念以兼顾两者
提出了 PowerTCP, 一种端主机上的基于功率的拥塞控制方法, 能更快地响应网络变化(如突发流量到达, 可用带宽波动等)
评估了 PowerTCP 在传统 DCN 和 RDCN 中的优势
为了贡献研究社区并促进未来工作, 我们所有的工件均已公开在: https://self-adjusting.net

Dealing with congestion has been an active research topic for decades with a wide spectrum of approaches, including buffer management [3,10,17] and scheduling [9,25,45,46]. In the following, we will focus on the most closely related works on end-host congestion control.

Approaches such as [7,51,56] (e.g., DCTCP, D 2 TCP) rely on ECN as the congestion signal and react proportionally. Such algorithms require the bottleneck queue to grow up to a certain threshold, which results in queuing delays. ECN-based schemes remain oblivious to congestion onset and intensity. Protocols such as TIMELY [41], SWIFT [34], CDG [23], DX [35] rely on RTT measurements for window update calculations. TIMELY and CDG partly react to congestion based on delay gradients, remaining oblivious to absolute queue lengths. TIMELY, for instance, uses a threshold to fall back to proportional reaction to delay instead of delay gradient. SWIFT, a successor of TIMELY, only reacts proportionally to delay. As a result, SWIFT cannot detect congestion onset and intensity unless the distance from target delay significantly increases. In contrast, θ-P OWER TCP also being a delay-based congestion control algorithm updates the window sizes using the notion of power. As a result, θ-P OWER TCP accurately detects congestion onset even at near-zero queue lengths.

XCP [30], D 3 [52], RCP [19] rely on explicit network feedback based on rate calculations within the network. However, the rate calculations are based on heuristics and require parameter tuning to adjust for different goals such as fairness and utilization. HPCC [36] introduces a novel use of in-band network telemetry and significantly improves the fidelity of feedback. Our work builds on the same INT capabilities to accurately measure the bottleneck link state. However, as we show analytically and empirically, HPCC’s control law then adjusts rate and window size solely based on observed queue lengths and lacks control accuracy compared to P OWER TCP. Our proposal P OWER TCP uses the same feedback signal but uses the notion of power to update window sizes leading to significantly more fine-grained and accurate reactions.

Receiver-driven transport protocols such as NDP [22], HOMA [42], and Aeolus [26] have received much attention lately. Such approaches are conceptually different from classic transmission control at the sender. Importantly, receiverdriven transport approaches make assumptions on the uniformity in datacenter topologies and oversubscription [22]. P OWER TCP is a sender-based classic CC approach that uses our novel notion of power and achieves fine-grained control over queuing delays without sacrificing throughput.

类别	代表算法	核心机制	主要缺陷/局限性	PowerTCP 的改进/优势
基于 ECN	DCTCP, DCQCN	ECN 标记反馈	需队列积压触发, 导致高延迟; 无法感知拥塞起始	可在零队列下工作, 反应更敏锐
基于延迟	TIMELY, SWIFT	RTT 或 RTT 梯度	反应迟钝(需延迟显著增加)或忽略绝对队列长度	利用功率概念, 在低延迟下精准检测拥塞
显式反馈	HPCC, XCP	网内计算或 INT	HPCC 仅基于队列长度(电压)调节, 精度不足	结合电压与电流(即功率), 控制更精准
接收端驱动	NDP, HOMA	接收端调度	依赖特定网络拓扑假设, 通用性受限	无拓扑假设限制, 通用性强

PowerTCP: Pushing the Performance Limits of Datacenter Networks¶

Introduction¶

Related Work¶