跳转至

ABM: Active Buffer Management in Datacenters

先用gemini过一遍, 很显然这文章完全是CC领域的, 依旧是读个乐呵积累一下吧

(1) 核心背景与问题

背景: 现代数据中心交换机通常使用共享缓冲区 (Shared Buffer) 架构来吸收流量突发和避免丢包

现状: 目前的管理机制通常是分层的

  1. 缓冲区管理 (BM): 在设备层级决定每个队列的最大长度 (如 Dynamic Thresholds, DT), 关注空间分配
  2. 主动队列管理 (AQM): 在队列层级决定每个包是否丢弃/标记 (如 RED, ECN), 关注时间/延迟控制

问题: BM 和 AQM 缺乏协作, 导致了三个主要问题

  1. 缺乏隔离性: 不同优先级的流量相互干扰, 甚至出现"优先级反转"现象
  2. 不可控的排空时间: 导致高排队延迟, 增加了短流的完成时间
  3. 不可预测的突发容忍度: 难以在保证高吞吐量的同时有效吸收突发流量

(2) ABM 的设计方案

  • 核心理念: ABM 结合了 BM 的空间视角 (总缓冲区占用率) 和 AQM 的时间视角 (队列排空速率)
  • 算法逻辑: ABM 动态计算每个队列的阈值, 其核心公式考虑了以下因素
    • 全局缓冲区剩余空间: 继承自 DT 算法, 用于感知整体拥塞
    • 队列排空速率: 引入时间维度, 确保分配的缓冲区能及时排空
    • 同优先级拥塞队列数量: 用于确保同优先级内的公平性和隔离
    • 配置参数: 用于设定优先级的权重

(3) ABM 的三大关键特性

论文通过理论分析证明了 ABM 具备以下特性:

  1. 隔离性: 保证每个优先级有最小缓冲区可用, 并限制最大占用, 防止单一流量垄断
  2. 有界的排空时间: 限制排队延迟, 即使在重负载下也能快速释放缓冲区
  3. 可预测的突发容忍度: 突发容忍度不再依赖于不可控的背景流量, 而是与突发到达率相关

Introduction

Network devices are equipped with a buffer to avoid drops during transient congestion and to absorb bursts. To reduce cost and maximize utilization, the on-chip buffer is shared across its queues. This sharing naturally leads to various problems. Concretely, the excessive growth of a queue might harm the performance of another queue, which might be starved, deprived of throughput, etc. Worse yet, such harmful interference might occur across queues that are seemingly independent e.g., queues that are mapped to different ports or queues that are formed by independent applications.

Network devices typically employ a hierarchical packet admission control to orchestrate the use of the shared space. First, a Buffer Management (BM) algorithm [16, 22, 33] dynamically splits the buffer space across queues. Second, an Active Queue Management (AQM) algorithm [24, 32, 42] manages the buffer slice that BM allocates to each individual queue by selectively admitting the incoming packets. Historically, BM and AQM evolved independently with orthogonal goals. We visualize this in Figure 1. BM aims at achieving isolation across queues by managing the spatial allocation of the buffer at the device level. Intuitively, BM's goal is to avoid long-lived queue starvation, effectively enforcing fairness in the steady state. For instance, Dynamic Thresholds [22] aims at weighted fairness across multiple queues in a device. In contrast, AQM's goal is to maintain stable queuing delays by managing the temporal allocation of the buffer at the queue level. Intuitively, AQM prevents bufferbloat by avoiding packets stay in the buffer for "too long" [24, 32, 42]. For instance, ECN-based AQM such as RED [24] control the queue lengths via ECN marking; delay-based AQM such as Codel [32, 42] control the queueing delays at a fixed reference value.

While this decoupling has been reasonable and successful in the past as it allowed BM and AQM schemes to evolve further, two recent datacenter trends make the need for coordination between them pressing. First, buffer size is not keeping up with the increase in switch capacity [20, 27]. In effect, BM no longer has enough buffer available to provide isolation to each queue. Second, as traffic becomes more bursty and incast scenarios more prevalent [18, 43, 50], the transient state of the buffer needs to be controlled at the device-level [20]. To keep up with these trends, a buffer-sharing scheme needs to provide isolation, bounded drain time and high burst tolerance.

In this paper, we show that today's BM and AQM schemes are fundamentally unable to independently satisfy these requirements. Driven by this insight, we propose ABM, an Active Buffer Management algorithm that incorporates the insights from both BM and AQM to achieve high burst absorption without sacrificing throughput. Concretely, ABM leverages both total buffer occupancy at the device level and the individual queue drain time. Essentially, ABM is a function of both spatial (used by BM) and temporal (used by AQM) characteristics of the buffer, effectively providing the best of both worlds as shown in Figure 1. We analytically show that unlike state-of-the-art, ABM achieves strong isolation properties and maintains stable buffer drain time. This allows ABM to provide high and predictable burst absorption while achieving high throughput. We consider ABM practical, as it operates using statistics that are already used by BM or AQM algorithms, thus ABM is well within the capabilities of today's devices.

Our results from large-scale simulations show that ABM improves the flow completion times for short flows by up to 94% compared to existing BM and AQM schemes. Moreover, ABM is not only compatible with advanced congestion control algorithms (e.g., TIMELY, DCTCP and PowerTCP) but improves their performance in terms of tail FCTs by up to 76% under bursty workloads. Finally, we show that unlike traditional buffer management schemes, ABM works well on various buffer sizes, including shallow buffers (e.g., Tomahawk [1, 3]).

We view our work as the beginning towards a new class of ABM algorithms which react to both total buffer occupancy and the queuing delay.

In summary, our contributions in this paper are:

  • We reveal the fundamental limitations of BM and AQM schemes that prevent optimally absorbing bursts (Section 2.2).
  • We analytically show the critical limitations of the state-of-the-art buffer management scheme (Section 2.3).
  • We design Active Buffer Management (ABM), an algorithm that achieves high burst absorption and maintains high throughput by leveraging both total buffer occupancy and queue drain time (Section 3).
  • An extensive evaluation that demonstrates the benefits of ABM in the datacenter context (Section 4).
  • As a contribution to the research community, to ensure reproducibility and facilitate future work, we made all of our artifacts publicly available at https://github.com/inet-tub/ns3-datacenter.

网络设备通常配备缓冲区 (Buffer), 以避免瞬态拥塞期间的丢包并吸收突发流量.

为了降低成本并最大化利用率, 片上缓冲区 (On-chip buffer) 通常在其所有队列之间共享.

这种共享机制自然会导致各种问题. 具体而言:

某个队列的过度增长可能会损害另一个队列的性能, 导致后者陷入饥饿, 吞吐量受损等情况.

更糟糕的是, 这种有害干扰甚至可能发生在看似相互独立的队列之间, 例如 映射到不同端口的队列由独立应用程序形成的队列.

网络设备通常采用分层的数据包准入控制 (Packet admission control) 机制来协调共享空间的使用.

首先, 缓冲区管理 (Buffer Management, BM) 算法 [16, 22, 33] 在队列之间动态划分缓冲区空间.

其次, 主动队列管理 (Active Queue Management, AQM) 算法 [24, 32, 42] 通过选择性地接纳传入数据包, 来管理 BM 分配给每个单独队列的缓冲区份额.

历史上, BM 和 AQM 是独立演进的, 且具有正交 (orthogonal) 的目标. 我们在图 1 中对此进行了可视化展示:

alt text

  • BM 旨在通过在设备层级管理缓冲区的空间分配来实现队列间的隔离
    • 直观地说, BM 的目标是避免长期的队列饥饿, 从而在稳态下有效地强制实现公平性
    • 例如, 动态阈值 (Dynamic Thresholds) [22] 旨在实现设备内多个队列之间的加权公平性
  • AQM 的目标是通过在队列层级管理缓冲区的时间分配来维持稳定的排队延迟
    • 直观地说, AQM 通过避免数据包在缓冲区中停留"太久"来防止缓冲区膨胀 (bufferbloat) [24, 32, 42]
    • 例如, 基于 ECN 的 AQM (如 RED [24]) 通过 ECN 标记来控制队列长度; 基于延迟的 AQM (如 Codel [32, 42]) 则将排队延迟控制在一个固定的参考值

尽管这种解耦在过去是合理且成功的 (因为它允许 BM 和 AQM 方案进一步演进), 但近期数据中心的两个趋势使得二者之间的协同变得迫切:

首先, 缓冲区大小并没有随着交换机容量的增加而同步增长 [20, 27].

实际上, BM 已不再有足够的缓冲区来为每个队列提供隔离.

其次, 随着流量突发性增强以及 Incast 场景变得更加普遍 [18, 43, 50], 需要在设备层级控制缓冲区的瞬态状态 [20].

为了适应这些趋势, 缓冲区共享方案需要提供:

  1. 隔离性 (isolation)
  2. 有界的排空时间 (bounded drain time)
  3. 高突发容忍度 (high burst tolerance)

在本文中, 我们指出, 现有的 BM 和 AQM 方案从根本上无法独立满足这些要求. 基于这一洞察, 我们提出了 ABM, 即一种主动缓冲区管理 (Active Buffer Management) 算法, 它融合了 BM 和 AQM 的理念, 在不牺牲吞吐量的情况下实现高突发吸收能力.

具体而言, ABM 利用了设备层级的总缓冲区占用率和单个队列的排空时间.

本质上, ABM 是缓冲区空间特性 (BM 所用) 和时间特性 (AQM 所用) 的函数, 从而有效地结合了两者的优势, 如图 1 所示.

我们通过理论分析表明, 与最先进的技术不同, ABM 实现了强隔离特性并维持了稳定的缓冲区排空时间. 这使得 ABM 在实现高吞吐量的同时, 能够提供高且可预测的突发吸收能力. 我们认为 ABM 是切实可行的, 因为它使用的统计数据已被 BM 或 AQM 算法所采用, 因此 ABM 完全处于当今设备的能力范围之内.

我们的大规模仿真结果表明, 与现有的 BM 和 AQM 方案相比, ABM 将短流的流完成时间 (FCT) 缩短了高达 94%. 此外, ABM 不仅兼容先进的拥塞控制算法 (如 TIMELY, DCTCP 和 PowerTCP), 而且在突发工作负载下, 将其尾部 FCT (tail FCTs) 性能提升了高达 76%.

最后, 我们展示了与传统缓冲区管理方案不同, ABM 在各种缓冲区大小上都能良好运行, 包括浅缓冲区 (shallow buffers) (例如 Tomahawk [1, 3]).

我们将这项工作视为迈向新一类 ABM 算法的开端, 这类算法能够同时对总缓冲区占用率和排队延迟做出反应.

综上所述, 本文的贡献如下:

  • 我们揭示了阻碍 BM 和 AQM 方案优化吸收突发流量的根本局限性 (Section 2.2)
  • 我们通过理论分析展示了最先进的缓冲区管理方案的关键局限性 (Section 2.3)
  • 我们设计了主动缓冲区管理 (ABM) 算法, 该算法通过利用总缓冲区占用率和队列排空时间, 实现了高突发吸收能力并维持了高吞吐量 (Section 3)
  • 进行了广泛的评估, 展示了 ABM 在数据中心环境下的优势 (Section 4)
  • 作为对研究社区的贡献, 为了确保可复现性并促进未来的工作, 我们将所有工件公开在 https://github.com/inet-tub/ns3-datacenter

Optimally managing switch buffers has been an active area of research for more than two decades with a wide range of approaches, including BMs [16, 17, 19, 21-23, 33, 44] and AQMs [24, 32, 42, 49], scheduling [15, 30, 47] and end-host congestion control [12, 14, 35, 38-40, 48].

BM schemes such as FAB [16], Cisco's IB [5], TDT [31] and EDT [46] rely on DT [22] and attempt to give more of the remaining buffer to short flows. By relying on DT, these schemes inherit its pitfalls (Section 2.3).

AQM schemes such as RED [24], ARED [25], Codel [32] and PIE [42] control queue lengths or delays but under unrealistic assumptions. Indeed, AQM schemes assume per-queue buffer isolation i.e., that the maximum length per queue is static. Yet, in practice, this length dynamically changes, often depriving AQM from the required buffer to operate normally.

End-host congestion control algorithms have the potential to reduce the buffer requirements, but are orthogonal to our work. Control algorithms such as DCTCP [14], DCQCN [51] use marking schemes as feedback to adjust the sending rates. TIMELY [38] uses RTT-gradient approach to rapidly react to congestion onset. Even more advances algorithms use a variety of feedback signals e.g., HPCC [35] uses inflight bytes, Swift [34] uses delay and PowerTCP [12] uses the power. Yet, end-host congestion control algorithms cannot act on the first RTT packets, and are fundamentally unable to orchestrate the buffer-sharing at all times.

本文将过去二十多年关于交换机缓冲区管理的研究分为三大类: 缓冲区管理 (BM), 主动队列管理 (AQM) 以及端侧拥塞控制 (End-host CC)

  1. 缓冲区管理方案 (Buffer Management Schemes, BM)

    • 典型算法: FAB [16], Cisco's IB [5], TDT [31], EDT [46]
    • 核心机制:
      • 依赖于 动态阈值 (DT) [22] 算法
      • 试图将更多的剩余缓冲区空间分配给短流 (short flows)
    • 局限性: 由于依赖 DT, 这些方案继承了 DT 的所有固有缺陷 (如缺乏隔离性, 不可控的排空时间等)
  2. 主动队列管理方案 (Active Queue Management, AQM)

    • 典型算法: RED [24], ARED [25], Codel [32], PIE [42]
    • 核心机制: 控制队列长度 (Queue Lengths) 或延迟 (Delays)
    • 局限性:
      • 基于不切实际的假设: 假设每个队列拥有静态的缓冲区隔离 (即最大队列长度是固定的)
      • 实际失效: 在实际场景中, 共享缓冲区的长度是动态变化的, 这导致 AQM 经常无法获得正常运行所需的缓冲区空间
  3. 端侧拥塞控制 (End-host Congestion Control)

    • 典型算法:
      • 基于标记 (Marking): DCTCP [14], DCQCN [51]
      • 基于 RTT 梯度: TIMELY [38]
      • 基于多种反馈: HPCC (Inflight bytes) [35], Swift (Delay) [34], PowerTCP (Power) [12]
    • 与本文关系: 与本文工作正交 (Orthogonal), 有潜力降低缓冲区需求
    • 局限性:
      • 无法对 首个 RTT 的数据包做出反应
      • 从根本上无法在所有时刻协调交换机的缓冲区共享
分类 (Category) 典型方案 (Examples) 机制/目标 (Mechanism/Goal) 局限性 (Limitations)
Buffer Management (BM) FAB [16], Cisco IB [5], TDT [31], EDT [46] 依赖 Dynamic Thresholds (DT);
旨在为短流留出更多缓冲区.
继承了 DT 算法的缺陷 (见 Section 2.3), 无法解决共享缓冲区的根本问题.
Active Queue Management (AQM) RED [24], ARED [25], Codel [32], PIE [42] 控制队列长度或排队延迟. 假设队列具有静态隔离 (Static Isolation); 在动态调整的共享缓冲区中往往因空间不足而失效.
End-host Congestion Control DCTCP [14], TIMELY [38], HPCC [35], PowerTCP [12] 利用 ECN, RTT, Inflight Bytes 等反馈调节发送速率. 无法处理首个 RTT 的拥塞; 无法直接编排 (Orchestrate) 交换机的缓冲区共享.