跳转至

CellReplay: Towards accurate record-and-replay for cellular networks

核心问题: 现有的蜂窝网络录制与重放工具(如业界标杆 Mahimahi)存在显著的误差, 无法准确反映蜂窝网络在不同负载下的性能变化

(1) Motivation

蜂窝网络的复杂性: 蜂窝网络的性能(带宽, 延迟)受无线干扰, 基站调度, 移动性等因素影响极大, 且不仅随时间变化, 还随工作负载(Workload)变化

现有工具的局限性 (Mahimahi的问题):

Mahimahi 使用"饱和器(Saturator)"持续发送数据包以填满链路, 记录下最大可能的吞吐量(Packet Delivery Opportunities, PDOs), 并在重放时应用固定的传播延迟

主要缺陷:

  1. 低估了往返时延 (RTT): 蜂窝网络的基准 RTT 是动态变化的, 而 Mahimahi 使用固定的最小 RTT 的一半作为传播延迟, 导致平均低估了约 13-17% 的 RTT
  2. 忽略了带宽与负载的依赖关系: 蜂窝网络分配给用户的带宽取决于用户发送的数据量
    • 短突发流量(Short bursts)获得的带宽通常低于长流(Long streams)
    • Mahimahi 始终重放"饱和状态"下的高带宽, 导致对轻负载应用的性能评估过于乐观(例如低估网页加载时间)

(2) Design

为了解决上述问题, 作者提出了 CellReplay, 其核心思想是 基于负载的双重录制与插值重放

  • 录制阶段 (Recording):

    • 使用两部手机同时进行录制, 以捕获两种极端工况下的网络行为:
      1. 轻负载 (Light Workload): 运行"数据包列探测(Packet Train Probing)", 用于捕获动态的基准 RTT 和轻负载下的带宽(Light PDOs)
      2. 重负载 (Heavy Workload): 运行类似 Mahimahi 的"饱和器(Saturator)", 用于捕获最大带宽(Heavy PDOs)
  • 重放阶段 (Replay):

    • 仿真器根据应用程序实际发送的流量模式, 在"轻负载"和"重负载"迹线(Traces)之间动态切换
    • 机制:
      1. 当应用开始发送数据时, 首先使用"轻负载"迹线
      2. 如果数据传输持续较长(数据包序列变长), 则平滑过渡并拼接进"重负载"迹线
      3. 当出现空闲(Idle)后, 系统重置回轻负载状态
  • 校准 (Calibration): 在录制前会自动进行参数校准(如探测包的大小, 间隔等), 以适应特定的网络环境

核心idea

(1) Mahimahi 基于 preset net workload. 不够真

(2) CellReplay 有两个 preset net workload (high/low)

根据 h/l 的预设值, 进行插值, 给出当前情况下最"真"的workload

CellReplay 证明了在蜂窝网络仿真中, 必须考虑网络性能对流量负载的依赖性. 通过记录"轻"和"重"两种状态并进行动态插值, 它提供了一个比现有最先进工具更准确的仿真平台.

Introduction

Cellular network performance, including bandwidth and latency, can vary significantly due to factors such as wireless interference, environmental obstructions, and handovers, especially in mobile environments [12,13,22,23,34]. The gold standard for evaluating application and protocol performance on cellular networks is, hence, to test them directly on live cellular networks. However, live testing is time-consuming, as experiments must be conducted across many different network conditions and can produce different results—due to different signal strengths, types of wireless service (e.g., 5G millimeter wave, 5G low-band, and 4G), kinds of interference, rates of mobility, physical locations, etc. Repeating each experiment multiple times is crucial to ensure statistically reliable results given the performance variability in cellular networks. In addition to being time-consuming, experiments are often difficult to reproduce. A lack of control over the environment makes it infeasible, for instance, to compare the effects of a protocol change under identical network conditions.Thus, researchers and app developers often turn to simulation or emulation for much of their evaluation, hoping to replicate a representative environment that yields performance similar to a live network. Simulators and emulators, such as ns3 [28] or Linux’s netem-tc, offer various options for configuring delay, jitter, bandwidth, packet loss, and more. While they can adjust these parameters to emulate specific and realistic conditions, properly tuning them to accurately represent the dynamic behavior of real-world cellular networks remains a challenging and open problem.

蜂窝网络的性能(包括带宽和延迟)会因无线干扰, 环境遮挡以及切换(Handovers)等因素而发生显著变化, 尤其是在移动环境中 [12,13,22,23,34]. 因此, 评估应用程序和协议在蜂窝网络上性能的"黄金标准"是直接在现网(Live Cellular Networks)中进行测试.

然而, 现网测试十分耗时, 因为实验必须涵盖多种不同的网络条件, 且由于信号强度, 无线服务类型(如 5G 毫米波, 5G 低频段和 4G), 干扰类型, 移动速率及物理位置等因素的差异, 结果往往不尽相同.

鉴于蜂窝网络性能的波动性, 重复多次实验对于确保结果在统计上的可靠性至关重要. 除了耗时之外, 现网实验通常难以复现.

例如, 由于缺乏对环境的控制, 要在完全相同的网络条件下比较协议更改前后的效果几乎是不可行的.

因此, 研究人员和应用开发者往往转向模拟(Simulation)或仿真(Emulation)进行大部分评估工作, 希望构建一个能产生类似现网性能的代表性环境.

ns3 [28] 或 Linux 的 netem-tc 等模拟器和仿真器提供了配置延迟, 抖动, 带宽, 丢包率等参数的多种选项. 虽然它们可以调整这些参数以模拟特定且现实的条件, 但如何正确调整参数以准确表征现实世界蜂窝网络的动态行为, 仍是一个具有挑战性的开放性问题.

A more realistic approach is to record network performance traces (e.g., latency, bandwidth, or packet loss) over time using predefined workloads (e.g., RTT probing) on a real-world network and replay those traces in an emulated network for the tested apps. This method allows for recording different traces under various conditions (e.g., locations) and testing multiple apps using such recorded traces. Record-and-replay emulation was pioneered by Noble et al. [27]. More recently, the Mahimahi network emulator [25] can also replay recorded cellular network traces and has been instrumental in the design and evaluation of several notable networked systems and protocols (e.g., [5,14,18,20,24,26,30,31,36,37,39,41,43,45,46]).

However, we found that Mahimahi can produce inaccurate results compared to real-world tests in important cases, particularly for latency-sensitive and bursty workloads. For instance, in our evaluation, we observed an average bias of approximately 17.1% in web page load times (PLTs) when comparing Mahimahi emulation to running the application in the same commercial cellular environment where the traces were recorded. This error is a persistent underestimation of the PLT rather than just random variation. This issue affects other applications as well and the error may even be greater. For example, we observed a 49% error for 250 KB file downloads when Mahimahi emulated a commercial Verizon 5G as shown in §5.6.

一种更现实的方法是在真实网络上使用预定义的负载(例如 RTT 探测)随时间录制网络性能迹线(Trace, 如延迟, 带宽或丢包), 并在仿真网络中针对被测应用重放这些迹线. 这种方法允许在不同条件(如不同地点)下录制不同的迹线, 并利用这些录制的迹线测试多种应用. 录制与重放(Record-and-Replay)仿真技术由 Noble 等人 [27] 开创.

近年来, Mahimahi 网络仿真器 [25] 同样能够重放录制的蜂窝网络迹线, 并在多个重要网络系统和协议的设计与评估中发挥了关键作用(例如 [5,14,18,20,24,26,30,31,36,37,39,41,43,45,46]).

然而, 我们发现 Mahimahi 在重要场景下, 特别是针对时延敏感型和突发型负载(Bursty Workloads)时, 产生的结果与真实世界测试相比存在不准确性:

例如, 在我们的评估中, 当比较 Mahimahi 仿真与在录制迹线的同一商用蜂窝环境中运行应用程序时, 我们观察到网页加载时间(PLT)存在约 17.1% 的平均偏差. 这种误差并非随机波动, 而是对 PLT 的持续性低估.

该问题同样影响其他应用程序, 误差甚至可能更大. 例如, 如 §5.6 所示, 当 Mahimahi 仿真商用 Verizon 5G 网络时, 我们观察到 250 KB 文件下载存在 49% 的误差.

Thus, despite record-and-replay emulation being practical and widely used, it does not support high-fidelity testing of networked systems and protocols. Minimizing emulation error is crucial, particularly for wireless protocol and application research, where record-and-replay emulations are often the most feasible evaluation platform. These errors could affect any evaluation and may even alter its conclusions, as we demonstrated in the ABR algorithms use case (§5.9). Therefore, we asked: What is causing this emulation error? And, is there a way to fix it to faithfully record and replay real-world cellular network performance?

因此, 尽管录制与重放仿真技术实用且被广泛使用, 但它并不支持对网络系统和协议的高保真测试.

最小化仿真误差至关重要, 特别是对于无线协议和应用研究而言, 因为录制与重放仿真往往是最可行的评估平台. 这些误差可能会影响任何评估结果, 甚至可能改变结论, 正如我们在自适应码率(ABR)算法用例中所演示的那样(§5.9).

因此, 我们提出以下问题:

  1. 是什么导致了这种仿真误差?
  2. 是否存在一种方法可以修正它, 从而忠实地录制并重放现实世界蜂窝网络的性能?
学一下写作

为什么要研究 optimized-mahimahi?

(1) 现有的 record-and-replay 方法存在显著误差

(2) 最小化仿真误差很重要 -> 当前的误差甚至会影响结论

我们做的贡献是:

  1. 误差会对结论有什么影响 ("为什么要解决这个问题")
  2. 误差的产生原因是什么 ("问题如何产生")
  3. 我们如何修正误差 ("如何解决问题")

Our first contribution is to study how the record-and-replay method used by Mahimahi can result in persistent bias (§3). Mahimahi records packet delivery opportunities by continuously saturating the link with packets (a “saturator” workload) and noting when packets arrive at the endpoint. It then replays this trace as a schedule for when the link can deliver packets after delaying those packets using a fixed propagation delay, for any workload.

However, we found that this method causes two fundamental issues. First, it fails to fully capture network base latency changes, which are prevalent in cellular networks. In fact, our measurements show that Mahimahi underestimates RTT by 13.25% and 16.88% across two operators. Second, the available bandwidth that a cellular network provides to an end-to-end connection depends significantly on that connection’s workload 1 . For example, in our measurement using Verizon 5G, a long train with 100 back-to-back packets experiences 2.6 times higher delivery rate than a short train with 10 packets. In such cases, Mahimahi’s saturator (i.e., heavy traffic) would see a higher rate than what shorter traffic should experience. This dependency between cellular network available bandwidth and workload poses a fundamental challenge for record-and-replay because the whole point is to record one trace (which is necessarily running one workload) and replay that trace under a variety of applications for testing. If available bandwidth depends on the workload, is faithful record-and-replay feasible?

我们的第一个贡献

是研究了 Mahimahi 所使用的录制与重放方法是如何导致持续性偏差的(§3).

Mahimahi 通过使用数据包持续饱和链路(即"饱和器(Saturator)"负载)并记录数据包到达端点的时间来记录数据包传输机会(Packet Delivery Opportunities).

然后, 它在重放时将此迹线作为链路传递数据包的时间表, 并对任何负载下的数据包施加固定的传播延迟.

然而, 我们发现这种方法导致了两个根本性问题:

(1) 它未能充分捕捉蜂窝网络中普遍存在的网络基础延迟(Base Latency)变化:

事实上, 我们的测量表明, Mahimahi 在两个运营商网络中分别低估了 13.25% 和 16.88% 的往返时延(RTT)

(2) 蜂窝网络提供给端到端连接的可用带宽在很大程度上取决于该连接的负载

例如, 在我们对 Verizon 5G 的测量中, 包含 100 个背对背(Back-to-back)数据包的长序列(Long Train)所获得的传输速率是仅包含 10 个数据包的短序列的 2.6 倍.

在这种情况下, Mahimahi 的饱和器(即重流量)所记录的速率会高于短流量实际应体验到的速率.

蜂窝网络可用带宽与负载之间的这种依赖关系给录制与重放带来了根本性的挑战, 因为其核心目的在于录制一条迹线(必然是基于某种负载运行的), 并在测试中针对各种应用程序重放该迹线.

如果可用带宽取决于负载, 那么忠实的录制与重放是否可行?

Our second contribution is to address these fundamental problems in a record-and-replay system called CellReplay. To solve the workload-dependence problem, one obvious approach would be to record performance under every possible workload. However, this is impractical and degenerates into simply testing every application directly on the live network, which is what record-and-replay emulation is trying to avoid. In other words, we can only record a limited number of different workloads. Another option would be to build a white-box emulation of providers’ underlying resource allocation policies; but these are proprietary and vary across providers, so we seek a black-box method based on end-to-end observations.

The approach we take is to record just two representative workloads (light and heavy) simultaneously, chosen at extremes on the range of traffic patterns, and then interpolate between them during replay to achieve high accuracy across a wide range of workloads. During the recording phase, we use two phones: one running a heavy saturator workload and the other running a light workload. The light workload is calibrated to capture RTTs and light-workload bandwidth, but is not too light as to capture the network’s transition from light to heavy bandwidth allocations. During replay, the emulator applies delay using the RTT trace and initially releases packets according to the light trace. It then splices in the heavy trace during longer packet sequences before eventually returning to the light trace after an idle period. This technique addresses the two key problems with Mahimahi’s approach mentioned above, namely capturing (1) dynamic RTTs and (2) bandwidth that depends on workload.

We implemented CellReplay using an architecture similar to Mahimahi—an emulated network interface that can be used by unmodified applications. Using randomized trials, we evaluated CellReplay’s accuracy by comparing the application performance when running under CellReplay emulation to the live networks. We tested two commercial providers’ 5G mid-band and low-band deployments, and covered multiple network conditions, including non-ideal conditions (e.g., in a crowded library) and mobility (e.g., driving). We evaluated two real-world application traffic patterns: randomized file downloads and web page loads with HTTP/1.1 and HTTP/2. These applications cover a variety of workloads, ranging from periodic small to heavy flows in file downloads to complex interleaved traffic from web page loads. Additionally, we used CellReplay to evaluate the startup phase of multiple adaptive bitrate (ABR) implementations for 4K video streaming.

我们的第二个贡献

是通过一个名为 CellReplay 的录制与重放系统来解决这些根本性问题.

为了解决负载依赖性问题, 一个显而易见的方法是录制每一种可能负载下的性能.

然而, 这是不切实际的, 且会退化为直接在现网上测试每一个应用程序, 但这正是录制与重放仿真试图避免的.

换句话说, 我们只能录制有限数量的不同负载.

另一种选择是构建运营商底层资源分配策略的白盒仿真;

但这些策略是专有的, 且不同运营商之间存在差异!

因此我们寻求一种基于端到端观测的黑盒方法.

我们采取的方法是: 同时录制两个位于流量模式范围两端的代表性负载(轻负载和重负载), 然后在重放期间在它们之间进行插值(Interpolate), 以在广泛的负载范围内实现高精度.

在录制阶段, 我们使用两部手机:

一部运行重型饱和器负载, 另一部运行轻型负载.

轻型负载经过校准以捕获 RTT 和轻负载带宽, 但其轻量程度受到控制, 以便能捕捉网络从轻带宽分配到重带宽分配的过渡.

在重放期间, 仿真器使用 RTT 迹线施加延迟, 并最初根据轻负载迹线释放数据包.

随后面临较长的数据包序列时, 它会拼接(Slice in)重负载迹线, 并最终在空闲期(Idle Period)后返回轻负载迹线.

该技术解决了上述 Mahimahi 方法中的两个关键问题, 即捕捉(1)动态 RTT 和(2)依赖于负载的带宽.

我们使用与 Mahimahi 类似的架构实现了 CellReplay: 即一个可被未经修改的应用程序使用的仿真网络接口.

通过随机试验, 我们将应用程序在 CellReplay 仿真下运行的性能与现网性能进行比较, 从而评估了 CellReplay 的准确性. 我们测试了两家商用运营商的 5G 中频段和低频段部署, 并涵盖了多种网络条件, 包括非理想条件(如拥挤的图书馆)和移动场景(如驾驶). 我们评估了两种现实世界的应用流量模式: 随机文件下载以及使用 HTTP/1.1 和 HTTP/2 的网页加载. 这些应用程序涵盖了多种负载, 从文件下载中的周期性小流到重流, 再到网页加载中复杂的交错流量. 此外, 我们还使用 CellReplay 评估了多种 4K 视频流自适应码率(ABR)实现的启动阶段.

We find that CellReplay substantially reduces emulation error. In web page load tests, CellReplay reduces emulation error from 17.1% with Mahimahi to 6.7%, representing a 60.8% improvement. For randomized file download tests, CellReplay lowers mean file download time errors from 7.9%-49% with Mahimahi to just 0.2%-22.4%. Moreover, CellReplay achieves lower error when replicating application performance under non-ideal network conditions, such as inside a basement (15.22% error in Mahimahi vs. 5.87% in CellReplay) and a crowded library (22.51% vs. 8.47%), and during user mobility, such as walking (14.48% vs. 4.13%) and driving (13.15% vs. 6.97%). Finally, we demonstrate CellReplay’s usefulness in evaluating ABR algorithms, as it preserves the relative ordering of ABR performance and avoids the biases observed in Mahimahi. We discuss challenges and future directions for improvement in §6. We release CellReplay alongside with its recorded traces as an open source at https://github.com/williamsentosa95/cellreplay

我们发现 CellReplay 大幅降低了仿真误差. 在网页加载测试中, CellReplay 将仿真误差从 Mahimahi 的 17.1% 降至 6.7%, 改善幅度达 60.8%. 对于随机文件下载测试, CellReplay 将平均文件下载时间误差从 Mahimahi 的 7.9%-49% 降低至仅 0.2%-22.4%. 此外, 在复制非理想网络条件下的应用性能时, CellReplay 实现了更低的误差, 例如在地下室(Mahimahi 误差 15.22% vs. CellReplay 5.87%)和拥挤的图书馆(22.51% vs. 8.47%), 以及在用户移动过程中, 如步行(14.48% vs. 4.13%)和驾驶(13.15% vs. 6.97%). 最后, 我们证明了 CellReplay 在评估 ABR 算法方面的有效性, 因为它保持了 ABR 性能的相对排序, 并避免了在 Mahimahi 中观察到的偏差. 我们在 §6 中讨论了挑战和未来的改进方向. 我们将 CellReplay 及其录制的迹线作为开源项目发布在 https://github.com/williamsentosa95/cellreplay.

2.1 Cellular network record-and-replay

The goal of record-and-replay network emulation (within the scope of this paper) is to emulate the end-to-end network performance of an application communicating between two endpoints, ensuring performance similar to that of a live network counterpart. During the recording phase, user equipment (UE) and the server send traffic according to a predefined workload (e.g., sending packets beyond the link bottleneck rate), while observed performance metrics (e.g., throughput) are logged. This workload should be independent of the tested applications, allowing us to record traces once and reuse them for multiple applications, regardless of whether they are UDP- or TCP-based. During replay, this trace is consumed by an emulated network interface. Real applications (e.g., a web browser and web server) can connect through this interface, and traffic between the endpoints will experience artificial network conditions (e.g., time-varying latency and bandwidth) as if they were communicating over a cellular network, even though they reside on the same physical host. Our goal is for any metrics of interest—including transport-level and applicationlevel metrics such as flow completion time or web PLT—to closely match those of the live network.

Record-and-replay can be applied to any type of network, but our interest here is on cellular networks. Their performance can be time-varying, vendor-dependent, and environment-dependent, making it difficult to generate conditions—whether in simulators, emulators with handpicked or even calibrated [42] parameters, or testbeds—that match real-world complexity. Thus, record-and-replay is especially useful in such environments, but it is also challenging to execute well.

Note that record-and-replay deals with end-to-end conditions and does not require any link- or physical-layer information or support from network operators. Like past work, we do not need to determine which hops along the path cause certain performance effects. This means that the observed performance, and its replay, may result from a combination of sources (e.g., the 5G RAN, service provider core, or the Internet to a remote endpoint). However, major performance variations are expected to originate from the cellular network [22]. We sometimes refer to the observed performance as coming from a cellular link, the path, or simply the network; all terms are equivalent for our purposes.

录制与重放(Record-and-Replay)

网络仿真的目标在于仿真两个端点间通信应用程序的端到端网络性能, 确保其性能与实网对应环境下的表现相近.

  1. 在录制阶段: 用户设备(UE)和服务器根据预定义的负载(Workload)发送流量(例如, 以超过链路瓶颈速率发送数据包), 同时记录观测到的性能指标(如吞吐量)
    • 该负载应独立于被测应用程序, 从而允许我们: "只需录制一次 Trace, 即可将其复用于多种应用程序, 无论这些程序是基于 UDP 还是 TCP"
  2. 在重放阶段: 仿真网络接口会使用该迹线
    • 真实应用程序(例如 Web 浏览器和 Web 服务器)可通过此接口建立连接, 端点间的流量将经历人工模拟的网络条件(如随时间变化的延迟和带宽)
    • 仿佛它们正在通过蜂窝网络进行通信, 尽管它们实际上驻留在同一台物理主机上

我们的目标是使任何关注的指标——包括传输层和应用层指标, 如流完成时间(Flow Completion Time)或网页加载时间(PLT)——都能与现网中的指标紧密匹配.

录制与重放技术可应用于任何类型的网络, 但本文的关注点在于蜂窝网络. 蜂窝网络的性能具有时变性, 且依赖于供应商和环境, 这使得无论是在模拟器中, 还是在使用手动选取甚至校准过参数的仿真器 [42] 中, 抑或是测试床(Testbeds)中, 都难以生成能够匹配现实世界复杂性的网络条件.

因此, 录制与重放技术在此类环境中尤为有用, 但要将其执行好也充满挑战.

值得注意的是: 录制与重放处理的是端到端条件, 不需要任何链路层或物理层信息, 也不需要网络运营商的支持!

与以往的工作一样, 我们无需确定路径上的哪一跳导致了特定的性能影响.

这意味着: 观测到的性能及其重放结果可能是由多种来源共同作用产生的(例如 5G 无线接入网 RAN, 服务提供商核心网, 或通往远程端点的互联网路径)

然而, 主要的性能波动预计源自蜂窝网络 [22]. 我们有时将观测到的性能称为来自蜂窝链路, 路径或简称为网络; 就本文目的而言, 这些术语是等价的.

Network emulators. Popular network emulators, such as NetEm [16] and dummynet [32], can emulate cellular networks. Google Chrome also provides configuration profiles with fixed latency and bandwidth for cellular networks, such as "Fast" and "Slow" 3G [3]. Pantheon [42] provides calibrated emulators based on parameters like fixed propagation delay, bottleneck link rate, isochronicity, etc. These configurations are tuned to match packet traces collected from a path (including cellular network) using various congestion control protocols. iBox [7] extends this by incorporating cross-traffic. However, fixed parameters, by definition, do not capture timevarying effects, which are common in cellular networks.

Record-and-replay network emulation. Noble et al. pioneered the concept of recording the end-to-end network characteristics of a wireless network and replaying them in an emulated network in 1997 [27]. However, it is designed to emulate WaveLAN, which differs fundamentally from modern cellular networks. More recently, NemFi [21] was introduced as a record-and-replay emulator for WiFi. NemFi’s design is specific to WiFi (e.g., emulating frame aggregation) and it is not readily applicable to emulating cellular network paths.

Mahimahi. In 2015, Netravali et al. demonstrated a framework for recording and replaying HTTP traffic [25] called Mahimahi, which also included a network emulator derived from CellSim [38] to replay time-varying uplink and downlink rates in cellular networks. Mahimahi has since become the state-of-the-art record-and-replay emulator for cellular networks and is widely used to evaluate various networked applications.

We detail Mahimahi’s record-and-replay approach 2 , as it serves as an important reference for this paper. Fig. 1 illustrates the process for the uplink only, as the same approach applies to the downlink. Mahimahi records time-varying link rates using a Saturator, which saturates both the uplink and downlink with MTU-sized packets (e.g., 1500 bytes) to ensure that the base station always has packets to deliver. The endpoint then records the arrival time of each packet. During emulation, Mahimahi treats each arrival timestamp as an opportunity to deliver a packet. A sequence of such timestamps constitutes a packet delivery opportunity (PDO) trace. Each PDO entry represents an opportunity to deliver an MTUsized amount of data, which can be either a single MTU-sized packet or multiple smaller packets whose combined sizes add up to the MTU. If no packets are queued for delivery when the PDO occurs, the opportunity is lost. Mahimahi also emulates the RTT delays on a cellular link, albeit using a fixed propagation delay. That delay is determined by measuring the minimum packet RTT (e.g., via ICMP ping) and halving that value.

网络仿真器 (Network emulators):

流行的网络仿真器, 如 NetEm [16] 和 dummynet [32], 可以仿真蜂窝网络

  • Google Chrome 也为蜂窝网络提供了具有固定延迟和带宽的配置模板, 例如"快速(Fast)"和"慢速(Slow)"3G [3]
  • Pantheon [42] 提供了经过校准的仿真器, 这些仿真器基于固定传播延迟, 瓶颈链路速率, 等时性(Isochronicity)等参数
    • 这些配置经过调整, 以匹配使用各种拥塞控制协议从路径(包括蜂窝网络)收集的数据包迹线
  • iBox [7] 通过引入交叉流量(Cross-traffic)对此进行了扩展

然而, 根据定义, 固定参数无法捕捉蜂窝网络中常见的时变效应

录制与重放网络仿真 (Record-and-replay network emulation):

Noble 等人于 1997 年开创了录制无线网络端到端网络特征并在仿真网络中重放的概念 [27]

然而, 该技术旨在仿真 WaveLAN, 这与现代蜂窝网络有着本质区别

最近, NemFi [21] 作为一种面向 WiFi 的录制与重放仿真器被提出. NemFi 的设计专用于 WiFi(例如仿真帧聚合), 并不容易应用于仿真蜂窝网络路径

Mahimahi:

2015 年, Netravali 等人展示了一个名为 Mahimahi 的框架, 用于录制和重放 HTTP 流量 [25], 该框架还包含一个源自 CellSim [38] 的网络仿真器, 用于重放蜂窝网络中随时间变化的上行和下行速率.

此后, Mahimahi 已成为最先进的(State-of-the-art)蜂窝网络录制与重放仿真器, 并被广泛用于评估各种网络应用程序.

我们详细介绍了 Mahimahi 的录制与重放方法, 因为它作为本文的重要参考. 图 1 仅展示了上行链路的过程, 同样的方法也适用于下行链路:

alt text

Mahimahi 使用 "饱和器(Saturator)" 录制随时间变化的链路速率, 该饱和器使用 MTU 大小的数据包(例如 1500 字节)使上行和下行链路饱和, 以确保基站始终有数据包要发送

然后, 端点记录每个数据包的到达时间

在仿真期间, Mahimahi 将每个到达时间戳视为传递一个数据包的机会. 此类时间戳序列构成了数据包传递机会(PDO)迹线

每个 PDO 条目代表传递 MTU 大小数据量的机会, 这既可以是一个 MTU 大小的数据包, 也可以是多个大小总和为 MTU 的较小数据包

如果在 PDO 出现时没有数据包排队等待传递, 则该机会丢失

Mahimahi 还仿真了蜂窝链路上的 RTT 延迟, 尽管使用的是固定的传播延迟. 该延迟通过测量最小数据包 RTT(例如通过 ICMP ping)并取该值的一半来确定

Live record-and-replay is hard

Why is record-and-replay challenging? Also, why does the current state-of-the-art method (i.e., Mahimahi) fail to accurately replicate the performance of networked applications on a cellular network? Below, we answer both questions using measurements and insights from real cellular networks.

The measurements in this section were collected from two commercial cellular networks: T-Mobile 5G mid-band and Verizon 5G low-band, using a Samsung Galaxy S22 (SGS) phone tethered to a laptop. The laptop, equipped with an Intel i7 CPU and 16GB RAM, ran Ubuntu 20.04 and served as our client. The server was located within close proximity (<10 miles) of the client. We confirmed that all results remained consistent when using a different phone model (Google Pixel 5).

为什么录制与重放(Record-and-Replay)具有挑战性? 此外, 为何当前的最新方法(即 Mahimahi)无法准确复制网络应用程序在蜂窝网络上的性能? 本节将利用来自真实蜂窝网络的测量数据和见解来回答这两个问题.

本节中的测量数据收集自两个商用蜂窝网络: T-Mobile 5G 中频段和 Verizon 5G 低频段. 实验使用连接到笔记本电脑的三星 Galaxy S22 (SGS) 手机进行. 该笔记本电脑配备 Intel i7 CPU 和 16GB 内存, 运行 Ubuntu 20.04 操作系统, 作为客户端. 服务器位于距客户端较近的范围内(<10 英里). 我们已确认, 使用不同型号的手机(Google Pixel 5)时, 所有结果保持一致.

3.1 Variability in base RTT

The base RTT is defined as the round-trip time (RTT) of a packet from the client to the server when there is no self-inflicted congestion. In cellular networks, this RTT is expected to be variable, as packets frequently experience delays (jitter) due to link-layer retransmissions, channel contention, base station scheduling, and device mobility. Emulating this variability is critical for testing latency-sensitive applications such as VR/AR and remote driving.

Mahimahi also emulates delay variability based on packet delivery traces. Despite using a fixed propagation delay, it must hold the packet until it sees a PDO before releasing it (Fig. 1). However, our measurements suggest that it fails to fully capture the base RTT variability. To quantify this error, we compared the packet RTT reported in live experiment with that of Mahimahi.

Specifically, we conducted repeated packet RTT tests and Mahimahi recordings individually over live networks, following the randomized trial approach (§5.3). The packet RTT test involves a client sending a 1400-byte UDP packet (roughly an MTU-size) every 50 ms to our echo server and noting each packet’s RTT. We repeated this test 10 times, and both the RTT test and Mahimahi recording session lasting 60 seconds. Next, we ran the exact same packet RTT test under Mahimahi’s emulated interface, using the recorded trace and setting the propagation delay to half of the minimum RTT from the live packet RTT tests.

Figure 2 shows the cumulative distribution function (CDF) of packet RTTs on the live network and Mahimahi replay. It indicates that Mahimahi underestimates packet RTT (by 16.88% and 13.25% at the median for T-Mobile and Verizon, respectively), and its distribution differs from that of the live network (as seen in the shape of the CDF curve). This suggests that simply increasing Mahimahi’s fixed propagation delay (i.e., shifting the CDF curve to the right) does not capture the variability. Note that this experiment was performed under stationary conditions with a strong signal, where network performance is more stable. In a mobile scenario, where packet RTT can vary more or even change, Mahimahi’s fixed propagation delay approach may perform even worse.

This is because PDOs, in principle, only partially capture base delay changes. Figure 3 illustrates a case where the packet base delay changes at t 2 from 20 ms to 40 ms, and Mahimahi fails to apply the correct delay for a certain packet. Note that base delay changes may occasionally occur in live cellular networks due to factors such as increased retransmission delays caused by a weakened radio signal. In this illustration, during the recording phase, the four packets delivered from t 1 to t 2 experience a 20 ms base delay, while packets from t 2 to t 3 experience a 40 ms base delay. The receiver indeed perceives a delay change since, after receiving four packets, it does not receive any packets for 20 ms before receiving the next set. This 20 ms “blackout” period is also reflected in the PDO trace during the replay.

A PDO blackout means no packet delivery. Any packets scheduled for delivery during this period will be delayed until the next available opportunity. As a result, only packets arriving during the blackout period (relative to the link emulator) will experience a delay, while others will not. However, sparse workloads, such as those in Fig. 3b, may have packets arriving outside the blackout period and thus not experiencing any delay.

Conclusion: Fixed propagation delay and PDOs are insufficient to model cellular network delay variability. Therefore, we need to record packet RTTs over time through probing during the recording phase and apply time-varying delays during the replay.

基础 RTT(Base RTT)定义为在没有自发性拥塞(Self-inflicted congestion)的情况下, 数据包从客户端到服务器的往返时间. 在蜂窝网络中, 由于链路层重传, 信道争用, 基站调度和设备移动性等原因, 数据包经常经历延迟(抖动), 因此该 RTT 预计是可变的. 仿真这种变异性对于测试延迟敏感型应用(如 VR/AR 和远程驾驶)至关重要.

Mahimahi 也基于数据包传递迹线来仿真延迟变异性. 尽管使用了固定的传播延迟, 但在重放时, 它必须等到观测到数据包传递机会(PDO)后才释放数据包(图 1). 然而, 我们的测量表明, 它未能完全捕捉到基础 RTT 的变异性. 为了量化这一误差, 我们比较了现网实验报告的数据包 RTT 与 Mahimahi 报告的 RTT.

具体而言, 我们遵循随机试验方法(§5.3), 分别在现网上进行了重复的数据包 RTT 测试和 Mahimahi 录制. 数据包 RTT 测试涉及客户端每 50 毫秒向我们的回声服务器(Echo Server)发送一个 1400 字节的 UDP 数据包(大致为 MTU 大小), 并记录每个数据包的 RTT. 我们将此测试重复了 10 次, 且 RTT 测试和 Mahimahi 录制会话均持续 60 秒. 接下来, 我们在 Mahimahi 的仿真接口下运行完全相同的数据包 RTT 测试, 使用录制的迹线, 并将传播延迟设置为现网数据包 RTT 测试中最小 RTT 的一半.

图 2 显示了现网和 Mahimahi 重放中数据包 RTT 的累积分布函数(CDF):

alt text

结果表明, Mahimahi 低估了数据包 RTT(在 T-Mobile 和 Verizon 网络的中位数处分别低估了 16.88% 和 13.25%), 并且其分布与现网分布不同(如 CDF 曲线的形状所示). 这表明, 仅仅增加 Mahimahi 的固定传播延迟(即向右移动 CDF 曲线)并不能捕捉到这种变异性. 需要注意的是, 该实验是在信号强且静止的条件下进行的, 此时网络性能相对稳定. 在移动场景中, 数据包 RTT 可能会发生更大变化甚至剧烈改变, Mahimahi 的固定传播延迟方法表现可能会更差.

这是因为从原理上讲, PDO 仅部分捕捉了基础延迟的变化. 图 3 展示了一个案例:

数据包基础延迟在 t2 时刻从 20 毫秒变为 40 毫秒, 而 Mahimahi 未能对特定数据包应用正确的延迟.

alt text

值得注意的是, 由于信号减弱导致重传延迟增加等因素, 现网蜂窝网络中偶尔会发生基础延迟变化. 在该示例中, 在录制阶段, 从 t1 到 t2 传递的四个数据包经历了 20 毫秒的基础延迟, 而从 t2 到 t3 的数据包经历了 40 毫秒的基础延迟. 接收端确实感知到了延迟变化, 因为在接收四个数据包后, 它在接下来的 20 毫秒内未收到任何数据包, 之后才收到下一组. 这段 20 毫秒的"黑障(Blackout)"期也反映在重放期间的 PDO 迹线中.

PDO 黑障意味着没有数据包传递. 任何计划在此期间传递的数据包都将被推迟到下一个可用机会. 结果是, 只有在黑障期间到达的数据包(相对于链路仿真器而言)才会经历延迟, 而其他数据包则不会. 然而, 稀疏负载(Sparse Workloads, 如图 3b 所示)可能会有数据包在黑障期之外到达, 从而不会经历任何延迟.

结论: 固定的传播延迟和 PDO 不足以模拟蜂窝网络的延迟变异性. 因此, 我们需要在录制阶段通过探测记录随时间变化的数据包 RTT, 并在重放期间应用时变延迟.

3.2 Performance depends on workload

Recall that Mahimahi uses a Saturator to ensure that the network always has packets ready to send, and so any available PDOs will be consumed and recorded. Then, a subset of those PDOs is used when replaying any given workload. An underlying assumption is that the same PDOs would have been available for the replayed workload. However, our experiments on live cellular networks show that the network substantially changes the PDOs it provides depending on the workload. We reached this conclusion upon observing that short flows consistently experience lower bandwidth than longer flows.

To demonstrate this, we conducted live experiments in which our server periodically sent packet trains to a client. Each train consists of N back-to-back UDP packets, each 1400 bytes, followed by a 100 ms gap—long enough to clear out any packets from previous trains). We refer to N as the train size. Each packet in a train is tagged with a train number and a sequence number reflecting its order within a train (P 0 ,...,P N−1 ). The client then records each packet’s arrival time. Additionally, the client sends a 100-byte ACK back to the server upon receiving the last packet of a train (P N−1 ). The train completion time (TCT) is defined as the time at which the server receives the ACK minus the time it sent P 0 . We performed this test with different train sizes following our randomized trial approach (§5.3). Each test lasted 5 seconds, and there are 12 tests with different train sizes (1, 10, 25, 50, ..., 500) in one trial. We repeated this trial 50 times. Finally, we also recorded network performance with the Saturator, as Mahimahi would.

Figure 4 shows the mean TCT for each train size on TMobile and Verizon 5G. The dashed black line represents the TCT if the link had a fixed bandwidth equal to that observed by the Saturator (equivalent to the mean TCT with Mahimahi replaying the Saturator). If network performance were independent of workload, then the mean bandwidth would remain the same for all train sizes, and the mean TCT would follow a linear function of the amount of data being delivered, i.e., a linear function of N. In particular, it should coincide with the bandwidth observed by Saturator. However, the observed TCTs do not conform to a straight line and generally do not match the Saturator line, with TCTs being up to 11.5% higher than the Saturator in T-Mobile and 35.8% higher in Verizon. This indicates that the service experienced by the train workloads is consistently and significantly different than the service experienced by the Saturator’s heavy workload.

To better understand these observations, we examined the arrival times of packets within each train, revealing the PDO patterns recorded by different trains. For each packet P i ∈ { P 0 ,...,P N−1 } in each train, we calculated its relative arrival time as t(P i )−t(P 0 ), where t(·) denotes the receiver’s observed arrival time of a packet. We present the mean relative arrival time as a function of packet sequence number i for different train sizes N in Figure 5.

These results confirm that the link’s rate (or PDO) depends on the workload. The network provides a lower delivery rate for the first few packets in any train (note that in these plots, a higher slope indicates a lower delivery rate). As the train progresses, the delivery rate increases, and the slope begins to approach that of the Saturator. We also repeated the same packet train tests in the opposite (uplink) direction but found that the link’s rate remains uniform regardless of N (i.e., it follows a straight line). We suspect that the rate-workload dependence in the downlink results from the operator’s proprietary packet scheduling implementation. For example, cellular network packet scheduling may depend on historical application traffic and the current queue depth when scheduling packets [9]. This rate-workload dependence was observed across all conditions tested in §5, including both peak and off-peak hours.

Interestingly, T-Mobile and Verizon have dramatically different implementations. T-Mobile’s delivery rate approaches the Saturator’s rate (slope) as i increases, mostly regardless of N. Verizon’s delivery rate, on the other hand, asymptotes to significantly different values depending on N, with larger N approaching the Saturator’s heavy-workload delivery rate more closely. Additionally, in T-Mobile, the first 50 packets of trains with N > 50 are delivered more slowly than the 50 packets of the train with N = 50, whereas Verizon shows an inverted behavior. This further complicates record-and-replay, as we aim for a general and relatively accurate approach across different operators and locations.

We also found a more minor way in which performance depends on workload: the RTT of a packet varies with packet length by an amount not explained by throughput. For instance, based on our measurements on Verizon, the RTT of a 100-byte packet is 6.8 ms faster than that of a 1400-byte packet, even though the difference in serialization time at the bottleneck link rate (60 Mbps) should have been only ≈ 0.17 ms. This outcome aligns with findings from the prior latency study on 5G [12] and may be attributed to the additional time required for reassembling larger data chunks. Due to space constraints, we omit detailed results.

Conclusion: Cellular network performance can depend significantly on the workload. Our observations indicate that cellular providers allocate delivery rates (i.e., bandwidth or PDOs) differently for light and heavy workloads. The Saturator forces the link into its heaviest workload mode, which generally increases the available bandwidth. Consequently, using Saturator’s PDOs for a lighter workload can result in consistent bias (flows complete faster than they should). This suggests that diverse workloads are needed to capture different PDOs, but choosing the right representative workloads remains challenging.

回想一下, Mahimahi 使用饱和器(Saturator)来确保网络始终有数据包准备发送, 从而消耗并记录所有可用的 PDO. 然后, 这些 PDO 的子集被用于重放任何给定的负载. 这里隐含的一个假设是, 这些 PDO 对于被重放的负载原本也是可用的.

然而, 我们在现网蜂窝网络上的实验表明, 网络提供的 PDO 会根据负载发生显著变化. 我们在观察到短流(Short Flows)始终比长流(Long Flows)经历更低的带宽后证实了这一结论.

为了证明这一点, 我们进行了现网实验, 其中服务器周期性地向客户端发送数据包序列(Packet Trains). 每个序列由 N 个背对背(Back-to-back)UDP 数据包组成, 每个数据包 1400 字节, 随后是一个 100 毫秒的间隔(足以清除之前序列的任何数据包). 我们将 N 称为序列大小(Train Size). 序列中的每个数据包都标记有序列号和反映其在序列内顺序的序号. 然后, 客户端记录每个数据包的到达时间. 此外, 客户端在收到序列的最后一个数据包后, 向服务器发送一个 100 字节的 ACK. 序列完成时间(TCT)定义为服务器收到 ACK 的时间减去其发送的时间. 我们按照随机试验方法(§5.3)使用不同的序列大小进行了此测试. 每个测试持续 5 秒, 一次试验中包含 12 个不同序列大小(1, 10, 25, 50, ..., 500)的测试. 我们将此试验重复了 50 次. 最后, 我们还像 Mahimahi 那样使用饱和器录制了网络性能.

图 4 显示了 T-Mobile 和 Verizon 5G 上每个序列大小的平均 TCT:

alt text

黑色虚线表示如果链路具有与饱和器观测到的带宽相等的固定带宽时的 TCT(相当于 Mahimahi 重放饱和器时的平均 TCT). 如果网络性能独立于负载, 那么所有序列大小的平均带宽将保持一致, 平均 TCT 将遵循传递数据量的线性函数, 即 N 的线性函数.

特别是, 它应该与饱和器观测到的带宽重合.

然而, 观测到的 TCT 并不符合直线规律, 且通常与饱和器线不匹配, 在 T-Mobile 中 TCT 比饱和器高出 11.5%, 在 Verizon 中高出 35.8%. 这表明, 序列负载所体验到的服务与饱和器的重负载所体验到的服务存在持续且显著的差异.

为了更好地理解这些观察结果, 我们检查了每个序列内数据包的到达时间, 揭示了不同序列录制到的 PDO 模式. 对于每个序列中的每个数据包, 我们计算其相对到达时间, 其中表示接收端观测到的数据包到达时间.

我们在图 5 中展示了作为数据包序号 i 函数的不同序列大小 N 下的平均相对到达时间:

alt text

这些结果证实了链路速率(或 PDO)取决于负载. 网络为任何序列中的前几个数据包提供了较低的传递速率(注意在这些图中, 斜率越大表示传递速率越低). 随着序列的进行, 传递速率增加, 斜率开始接近饱和器的斜率. 我们还在相反方向(上行链路)重复了相同的数据包序列测试, 但发现链路速率与 N 无关, 保持均匀(即遵循直线). 我们怀疑下行链路中的速率-负载依赖性源于运营商专有的数据包调度实现. 例如, 蜂窝网络数据包调度在调度数据包时可能依赖于历史应用流量和当前队列深度 [9]. 这种速率-负载依赖性在 §5 测试的所有条件中均被观察到, 包括高峰时段和非高峰时段.

有趣的是, T-Mobile 和 Verizon 的实现截然不同. T-Mobile 的传递速率随着 i 的增加而接近饱和器的速率(斜率), 这在很大程度上与 N 无关. 另一方面, Verizon 的传递速率根据 N 趋近于显著不同的值, 较大的 N 更接近饱和器的重负载传递速率. 此外, 在 T-Mobile 中, N > 50 的序列的前 50 个数据包比 N = 50 的序列的 50 个数据包传递得更慢, 而 Verizon 则表现出相反的行为. 这进一步复杂化了录制与重放, 因为我们的目标是在不同运营商和位置实现通用且相对准确的方法.

我们还发现了性能依赖于负载的一个较次要的方式: 数据包的 RTT 随数据包长度变化, 其变化量无法用吞吐量来解释. 例如, 根据我们在 Verizon 上的测量, 100 字节数据包的 RTT 比 1400 字节数据包快 6.8 毫秒, 即使在瓶颈链路速率(60 Mbps)下的序列化时间差异本应仅约为 0.17 毫秒. 这一结果与先前关于 5G 延迟的研究 [12] 发现一致, 可能归因于重组较大数据块所需的额外时间. 由于篇幅限制, 我们省略了详细结果.

结论: 蜂窝网络性能可能显著依赖于负载. 我们的观察表明, 蜂窝提供商为轻负载和重负载分配不同的传递速率(即带宽或 PDO).

饱和器强制链路进入其最重负载模式, 这通常会增加可用带宽.

因此, 将饱和器的 PDO 用于较轻的负载可能会导致持续偏差(流完成速度比实际应有的更快).

这表明需要多样化的负载来捕捉不同的 PDO, 但选择合适的代表性负载仍然具有挑战性.

CellReplay

4.1 Design overview

At a high level, we want to solve the problems of capturing time-varying base RTT (§3.1) and workload-dependent performance (§3.2) in both record and replay.

We begin with the latter problem (§3.2). To achieve highly accurate emulation, an obvious solution is to record performance under different workloads. However, recording every possible workload is impractical and degenerates into simply testing the apps directly on the live network. We also aim for the recorded workload to be independent of the tested apps (§2.1). Therefore, we can only record a limited number of different workloads.

From §3, we observe that short and continuous traffic are handled differently, while medium-length flows exhibit performance somewhere between the extremes. Inspired by this observation, our key approach is to record two workloads chosen at the extreme points on the spectrum of traffic patterns: (1) Packet train probing to capture link PDOs under short and bursty load (light PDOs), and (2) Saturator to capture PDOs under heavy continuous load (heavy PDOs). These workloads capture the essential behavior of link rate differentiation under light and heavy flows. We used two phones to record both traces simultaneously and show in our evaluation that interference between them is limited in practice.

During replay, we leverage both PDOs to match the provided workload. When the application under test begins sending packets, we initially release the first sequence of packets according to the light PDOs and then transition to heavy PDOs as the packet sequence lengthens. After a certain gap in the workload, we return to the light PDO trace.

Returning to the problem of time-varying RTT (§3.1), we design the packet trains to avoid inflating queues, so that it gives us a good measurement of base RTT. The packet train probing serves a dual purpose: to record changing base RTTs (for any workload) and PDOs for shorter packet sequences.

Finally, the effectiveness of the above design depends on parameter choices. For example, a too-small train will not capture the network’s light workload behavior completely, forcing us to go to the heavy PDO trace too soon; if the trains are too large, we cannot sample frequently as network performance may then resemble that of a heavy workload, and there is a risk of inflating base RTT measurements due to congestion. Thus, before recording, we conduct a calibration phase to determine train size, train gaps, and other parameters that will yield the least error.

In summary, CellReplay has three components. When recording network traces in a specific environment, we first perform an automated calibration of parameters in that environment, and then start recording live traces by running packet train probing and Saturator in parallel. These traces are then used to emulate the network during replay. The following subsections detail each of these components: record (§4.2) and replay (§4.3), before returning to calibration (§4.4), which is best understood after seeing the rest of the design.

4.2 Recording network traces

There are three time-series metrics we want to record: (1) base delay, (2) light PDOs, and (3) heavy PDOs.

Base delay and light PDOs. The base delay trace should reflect the network’s round-trip time (RTT) without any queueing delays introduced by the workload itself. Ideally, this trace would be captured by periodically measuring the RTTs of small packets. The one-way base delay can be estimated by halving the RTT.

Light PDOs can be captured by periodically sending a limited number of back-to-back packets, i.e., a packet train, in both the uplink and the downlink. The number of packets should be small enough to capture the network’s light workload behavior and ideally some of the transition to moderate workload, without pushing the network into heavy workload mode. In particular, the train should be short enough to avoid “warming up” the network for the following train. As a result, both the base delay trace and light PDOs share similar requirements. We can collect both simultaneously using a packet train probing workload on a single device. This workload uses MTU-sized packets, as a significant amount of traffic is still required to capture the transition point between light and heavy modes.

Figure 6 provides an example of how this process works. In every G ms, (a) the client sends U back-to-back MTUsized packets to the server. Upon receiving the first packet of the train, (b) the server sends back D back-to-back MTUsized packets. The server also (c) records each packet’s arrival within that train and uses it to calculate the uplink light PDOs as the arrival time of each packet minus the arrival time of the first packet (since, during replay, the base delay will be added). When the client receives the corresponding downlink train, (d) it infers the current base RTT as the receipt time of the first downlink packet minus the send time of the first uplink packet (within that train). It then calculates the downlink light PDOs based on packet arrival times, just as the server did.

Heavy PDOs. The heavy PDOs are collected using a Saturator (similar to Mahimahi) that saturates the link with packets beyond its bottleneck rate, effectively “requesting” the link to remain in max bandwidth mode. In practice, we developed our own Saturator tool, which sends MTU-sized packets in both the uplink and downlink at fixed upload and download rates, eliminating the need for two phones as in Mahimahi’s Saturator [38]. We overestimated (by 25%) the max link bandwidth measured using an existing bandwidth test application like iperf or speedtest. We confirmed that the reported throughput from our Saturator is similar to UDP iperf.

However, running both Saturator and packet train probing on a single device is not feasible, as the Saturator will overload the queue, leading to two issues: inflating the base delay measurement and keeping the link in maximum bandwidth state. One solution is to run these workloads in separate trials, which may be permissible under stationary conditions but is less ideal under mobility. Alternatively, we chose to perform these workloads on separate identical phones placed in close proximity. This is possible since most (if not all) cellular network providers employ user-separated queues [38] such that the Saturator traffic will not inflate the packet train probing measurement results. Beyond the known separation of queues, we confirmed that light vs. heavy bandwidth allocation is also separated on both Verizon and T-Mobile: when one phone runs the Saturator, the other phone running packet train probing still experiences light-workload service.

Note that the two-phone method is not without limitations. The phones may not always connect to the same base station all the time, especially in a mobile environment where handoffs could occur slightly differently. We leave this discrepancy for future work.

4.3 Replaying network traces

CellReplay takes input traces of base delay, light PDOs, and heavy PDOs over time to emulate network performance in a virtual interface. At a high level, CellReplay first applies a base delay to each packet based on the delay trace, adjusts the delay for any latency offset from packet-size calibration (§4.4), and then releases packets according to either the light or heavy PDOs.

In more detail, CellReplay operates in two states: active and inactive. Initially, CellReplay is in the inactive state until it receives a packet at some time t, relative to the start of the emulation. This event triggers CellReplay to enter the active state, which involves preparation as shown in Figure 7. CellReplay searches for the most recent base delay (DELAY) and light PDOs (LightPDO) where the timestamp is ≤ t. Since the trace is sampled per G, linear interpolation is used to assign DELAY to packets arriving between two samples. CellReplay then saves DELAY and constructs temporary PDOs (TempPDO) by adding every PDO entry in LightPDO with t +DELAY. It then concatenates these with the suffix of the heavy PDOs, starting from t +DELAY +max(LightPDO)+1. The system is now done entering the active state.

As long as the system remains in the active state, packets are initially delayed by DELAY plus a size-based delay compensation comp(size(P)). As discussed in §3.2, base delay may depend on packet size; the specific adjustment comp(·) is determined during calibration. DELAY and TempPDO remain unchanged unless the system enters an inactive state 3 . After a packet is delayed, it is either placed in a PDO queue or dropped if the queue exceeds B bytes. Packets are dequeued according to the time schedule in TempPDO using byte-wise dequeueing. This process mirrors Mahimahi’s PDO replay, with CellReplay using the temporary (concatenated) PDO trace. As a result, early packets in the active state will experience light PDOs, while later packets will experience heavy PDOs. Once F milliseconds pass without any packets in the PDO queue, CellReplay returns to the inactive state. Any future arriving packet will then trigger the procedure to reenter the active state, as described above.

4.4 Parameter Calibration

We describe how to select values for the parameters in Table 1. The parameters \(U\), \(D\), \(G_{min}\), \(F\), and \(comp(s)\) are exclusive to CellReplay and are calibrated in every new environment before recording traces. This process is automated. \(B\) is a standard network emulation parameter, derived using a classical max-min approach [11]. For details, see §A.3.

Setting \(U\) and \(D\). We profile the network to determine a packet train size that provides the best overall approximation of the network across other sizes. We first conduct randomized experiments with different packet train sizes (the same as §3.2) using a fixed train gap that is conservatively large enough to ensure the link returns to its light-workload state. In our implementation, we use a 100 ms gap, and the set of train sizes we consider is \(\{5, 25, 50, 75, \dots, X\}\) where \(X\) is chosen such that the resulting mean sending rate (including gaps between trains) is half of the bottleneck throughput.

After running for 10 trials, we compute \(R_N(i)\) for each train size, which represents the mean relative arrival time of the \(i\)-th packet in an \(N\)-packet train. The relative arrival time is the packet’s arrival time minus that of the first packet in its train. \(R_N\) essentially represents the mean light PDOs of an \(N\)-sized train. We further define \(R^*_N(i)\) as the estimated mean arrival time of the \(i\)-th packet in replay mode, assuming we choose to record trains of size \(N\). More specifically, recall that during replay, we follow light PDOs before splicing in the heavy PDOs; therefore, \(R^*_N(i) = R_N(i)\) for \(i \leq N\), and otherwise, \(R^*_N(i) = R_N(N) + heavy(i-N)\), where \(heavy(x)\) is the delivery delay of \(x\) packets based on the mean throughput of the heavy workload.

The purpose of \(R^*_N(i)\) is to help us calculate the estimation error of \(R_N(N)\) (i.e., the mean relative arrival time of the last packet of an \(N\)-packet train) for every other train size \(N\) that we have tested. Let \(L\) be the train size used to estimate the error for other trains. Fig. 8(a) shows the estimation error when using \(L = 100\). The blue line is \(R^*_{100}(i)\), which represents the PDOs based on the concatenation of mean light and heavy PDOs. If we use that to estimate \(R_N(N)\) of other trains \(N \in \{25, 50, 200\}\), the prediction will result in some error (shown in red). For each train size \(L\), we compute the mean error over all tested train sizes, and our chosen train size is the \(L\) that yields the smallest mean error. We conduct the entire procedure in two directions (uplink and downlink) separately to choose the train lengths \(U\) and \(D\).

Finding \(G_{min}\). If the train gap \(G\) is too small, the link will not have enough time to reset to its light-workload mode before the next train arrives. We typically set \(G = 50\) ms for stationary conditions and \(G = 100\) ms for mobile conditions. However, in certain environments, this may not be large enough. We thus aim to find \(G_{min}\), the smallest value at which the link has enough time to return to its light-workload mode, ensuring that our chosen \(G\) is at least that value.

We conduct another randomized experiment, this time testing different train gaps. Again, we send sequences of trains; however, in this case, we fix the train length at our chosen value (\(U\) or \(D\), for uplink or downlink, respectively, in separate experiments) and vary the gap \(g\). We begin with a conservatively large gap (as in the previous experiment) and test gaps of decreasing size; in our implementation, \(g \in 100, 90, 80, \dots, 10\) ms. Let \(r_{last}(g)\) denote the mean relative arrival time of the last packet in trains with gap \(g\). Intuitively, as \(g\) decreases to a too-small size, the link will begin staying in its heavy-workload mode, causing \(r_{last}(g)\) to decrease. We set \(G_{min}\) as the smallest \(g\) for which \(r_{last}(g)\) is within 20\% of its value with the conservatively large gap, i.e., \(r_{last}(100ms)\). In the example of Fig. 8(b), CellReplay selects \(G_{min} = 30\) ms.

Inferring \(F\). Recall that \(F\) determines how long CellReplay’s emulated link remains idle in the active state before transitioning back to the inactive state. We derive \(F\) using the same data collected to select \(G_{min}\), which involves calculating the difference between \(G_{min}\) and the time required for the queue to clear, which is observable as \(r_{last}(G_{min})\). For details, see §A.2.

Inferring \(comp(s)\). We profile how RTT is affected by packet size by sending randomly sized packets between \(\{100, 200, \dots, 1400\}\) bytes every 50 ms to a receiver that responds with a 100-byte ACK. We then measure the RTT difference for a packet size of \(s\) compared to the RTT of 1400-byte packets and model this difference as \(comp(s)\). We describe this in more detail in §A.1.

(1) Design Overview

  • 核心目标:
    1. 捕捉随时间变化的基础 RTT(Base RTT)
    2. 解决网络性能对负载(Workload)的依赖性
  • 核心思路: 只录制两种极端的负载情况, 而非所有可能的负载
    1. Light Workload: 使用 Packet Train Probing "探针包" 捕捉基础 RTT 和轻负载下的传输机会 (Light PDOs)
    2. Heavy Workload: 使用 Saturator "饱和包" 捕捉最大带宽下的传输机会 (Heavy PDOs)
  • 混合重放: 在重放时, 根据应用实际发送的流量, 在"轻"和"重"迹线之间动态插值与切换
值得学习的点
  1. 在 Light 与 High 之间做 "插值"
  2. 自动化根据环境调参
    • 思想: "在正式比赛(录制)前, 先让运动员(网络)跑几圈热身赛(校准), 记录下他的体能恢复速度(Gmin)和爆发力持续时间(U/D), 然后根据这组体测数据来制定比赛策略"

(2) Recording Network Traces

  • 录制内容:

    1. 基础延迟 (Base Delay): 无排队延迟的 RTT, 通过探测包获取
    2. Light PDOs: 通过周期性发送短的背对背数据包序列(Packet Train)来获取
    3. Heavy PDOs: 通过饱和器持续发送超过瓶颈速率的数据包来获取
  • 双设备策略:

    • 为了避免两种负载互相干扰(例如饱和器导致队列堆积, 从而污染基础 RTT 的测量), 使用两部相同的手机同时录制
    • 一部运行"轻负载探测", 另一部运行"重负载饱和器"
    • 利用蜂窝网络对不同用户队列隔离的特性, 确保数据互不影响

(3) Replaying Network Traces

  • 工作机制: CellReplay 作为一个虚拟网络接口, 维护 Active 和 Inactive 两种状态
  • 动态拼接算法:
    • 当数据包到达激活系统时, 首先根据录制的 Light PDOs 释放数据包
    • 如果数据包序列变长(超过轻负载覆盖范围), 则平滑拼接 (Splice)Heavy PDOs
    • 如果出现足够长的空闲时间(Idle gap), 系统重置回轻负载状态
  • 延迟处理: 应用录制的基础延迟, 并根据数据包大小进行微调补偿(Compensation)
详细展开: 如何微调, 如何插值, 如何拼接

(1) 如何插值, 如何拼接

蜂窝网络的特性是: 刚开始传数据时带宽较低(冷启动), 持续传数据后带宽变高(热状态).

  • Light Trace (轻负载迹线): 记录了网络"冷启动"时的表现(前几个包发得慢).
  • Heavy Trace (重负载迹线): 记录了网络"热状态"时的表现(全速发送).

CellReplay 的目标是: 当应用发数据时, 先用 Light Trace 里的时间表, 如果应用发的包很多, 就平滑过渡到 Heavy Trace 的时间表.

CellReplay 维护了一个状态机, 分为 Inactive(非活跃)Active(活跃) 两种状态.

Step 1: 激活与查找

假设现在是仿真时间的 t 时刻, 应用发来了一个数据包, 且系统当前处于 Inactive(空闲)状态.

这时, CellReplay 判定一个新的"传输流"开始了. 它会做两件事:

  1. 查找基准延迟 (Base Delay Interpolation):

    • 录制的迹线是离散的(例如每 50ms 采样一次)
    • 如果数据包在 t=25 到达, 系统会在录制的 t=0 和 t=50 的基准延迟之间进行线性插值, 计算出当前时刻精确的
    • 注意: 这就是文中提到的"插值"的主要应用场景, 是为了获取精准的启动延迟
  2. 合成临时传输时间表 (Constructing TempPDO):

    • 这是最关键的一步. 系统不会简单地"切换"模式, 而是当场生成一张全新的数据包发送时间表(TempPDO). 这张表是由两部分拼接而成的
    • 头部(来自 Light Trace):
      • 取出一组录制好的轻负载 PDO(比如前 10 个包的相对到达时间): 将这些时间加上当前时刻 t 和基准延迟 \(BaseDelay\)
    • 尾部(来自 Heavy Trace):
      • 当 Light Trace 用完后(比如第 11 个包), 系统需要通过 Heavy Trace 来补充后续的发送机会
      • 关键的拼接点: 系统会在 Heavy Trace 的时间轴上, 找到紧接着 Light Trace 结束的那个时间点, 截取后续的 Heavy PDOs, 并拼接到表单后面.
      • 数学逻辑: Heavy 部分的开始时间 = t + \(BaseDelay\) + max(LightPDO) + 1
      • 效果: 这模拟了如果流量持续不断, 网络带宽将无缝过渡到"全速"状态

Step 2: 队列执行

现在, 系统手里有了一张合成好的时间表 TempPDO:

  • 第 1 个包: 将在发送(Light 模式)
  • 第 2 个包: 将在发送(Light 模式)
  • ...
  • 第 N 个包: 将在发送(Light 模式结束)
  • 第 N+1 个包: 将在发送(进入 Heavy 模式)
  • 第 N+2 个包: 将在发送(Heavy 模式, 速率变快)

实际运行逻辑: 仿真器将应用发来的数据包放入队列.

  • 如果应用只发了 5 个包, 它们全都会落在 Light 模式的时间段内被发走(模拟了短流慢)
  • 如果应用发了 1000 个包, 前 N 个包走 Light 模式, 剩下的 900 多包会自然地滑入 Heavy 模式的时间段, 享受高带宽

Step 3: Reset / Inactive

系统并不是永远停留在 Active 状态. 文中有个关键参数 ** F (Idle Gap)**.

  • 如果队列空了, 且持续了 F 毫秒没有新的数据包到来
  • CellReplay 判定网络连接已经"冷却"
  • 状态机切换回 Inactive
  • 后果: 下一个到来的数据包, 将再次触发"第一步", 重新从 Light Trace 开始

(2) 如何调参? 如何确保参数效果好?

下面 Parameter Calibration 会介绍

(4) Parameter Calibration

  • 自动化校准: 在正式录制前, 系统会自动运行校准程序以适应特定的网络环境
  • 关键参数:
    • 序列长度 (U, D): 寻找最佳的数据包序列长度, 使其既能捕捉轻负载特征, 又能准确预测向重负载的过渡, 且误差最小
    • 序列间隔 (\(G_{min}\)): 确定最小的时间间隔, 确保链路在两次探测之间有足够时间恢复到"轻负载"状态(重置状态)
    • 大小补偿 (comp(s)): 测量不同大小数据包对 RTT 的影响(小包和大包的 RTT 可能不同), 以便在重放时修正

详细解析一下, 应该如何调参, 如何确保参数效果好?

CellReplay 的设计之所以能生效, 很大程度上依赖于它不仅是一个"录制工具", 更是一个包含自动化校准(Automated Calibration)的系统

在正式录制之前, CellReplay 会先运行一套"探针程序"来摸底当前网络的脾气. 怎么知道参数合不合适?

核心逻辑是: 用一部分测量数据去预测另一部分已知数据, 看谁的预测误差最小(类似于机器学习中的交叉验证)

[1] 序列长度 (U 和 D): 多长才算"轻"负载?

这是最关键的参数:

  • 如果序列太短(例如只发 1 个包), 捕捉不到网络从"冷"到"热"的过渡
  • 如果序列太长, 网络直接进入"热"状态, 就失去了录制"轻负载"的意义

  • 校准方法("赛马"机制):

    1. 全范围测试: 系统首先发送各种不同长度 (N) 的数据包序列(例如 5, 25, 50, ..., 500 个包), 记录下它们真实的传输时间表
    2. 模拟预测: 假设我们选定长度 L=100 作为标准. 系统会尝试用这个 L 的数据去预测其他长度(比如 L=200 或 L=300)的表现
    3. 计算误差: 将"预测出的时间表"与"真实测到的 N=200 时间表"进行对比, 计算误差
    4. 优胜劣汰: 系统会轮流把每种长度都试作标准 (L), 计算其对所有其他长度的平均预测误差. 最终选择那个平均误差最小的长度作为正式录制的序列参数 (U 或 D)

结论: 怎么知道合适? 因为这个长度在"预测其他所有长度"的任务中表现最好

[2] 最小间隔 (\(G_{min}\)): 多久不发包网络会"冷却"?

我们需要知道网络在空闲多久后会重置回"轻负载"状态(冷启动状态). 如果间隔太短, 网络还维持在高速模式, 下一次探测就会失真

校准方法(倒序逼近法):

  1. 基准测试: 先用一个很大的间隔(例如 100ms)发包, 记录下此时"标准"的冷启动传输速率
  2. 压力测试: 保持包序列长度不变, 不断缩短两次发送之间的间隔(90ms, 80ms, ... 10ms)
  3. 寻找拐点:
    • 当间隔缩短到某个临界点时, 你会发现传输速率突然变快了(因为网络没来得及"冷却", 还保留着高速资源)
    • 文中定义的标准是: 当观测到的传输行为偏离"基准值"超过 20% 时, 就认为间隔太短了
  4. 确定参数: 取那个能保持行为正常的最小间隔作为 \(G_{min}\)

[3] 空闲超时 (F): 仿真器何时切回"轻"模式?

这个参数决定了重放阶段状态机何时从 Active(拼接模式)重置回 Inactive(等待模式)

校准方法:

  • 直接由 \(G_{min}\) 推导而来
  • 逻辑是: 既然 \(G_{min}\) 是真实网络重置所需的物理时间, 那么仿真器里的重置倒计时 F 也应该对应这个时间(扣除掉队列清空的时间)

[4] 大小补偿 (comp(s)): 大包比小包慢多少?

不同大小的数据包, 其 RTT 差异并不仅仅是传输时间(Size/Bandwidth)造成的, 网络设备处理大包可能本身就慢一些

校准方法:

  • 发送各种大小的包(100B, 200B, ... 1400B)
  • 测量它们的 RTT
  • 建立一个简单的函数 comp(s) 来拟合包大小与额外延迟的关系. 在重放时, 根据应用发送的包大小查表补偿延迟
参数合适的标准是什么

简单来说, CellReplay 并不依赖"经验值"或"默认值"

它认为参数"合适"的标准是: 在当前此时此地的网络环境下, 这组参数能以最小的数学误差复现出我们在校准阶段观测到的各种流量模式(从短流到长流), verifying in situ!!!

这就好比在正式比赛(录制)前, 先让运动员(网络)跑几圈热身赛(校准), 记录下他的体能恢复速度(Gmin)和爆发力持续时间(U/D), 然后根据这组体测数据来制定比赛策略

Evaluation

Our goal is to evaluate the accuracy of CellReplay’s emulation in replicating application performance compared to its live network counterpart. We also compare CellReplay with Mahimahi [25]. We implemented CellReplay record tool in Java and Python 3 to send and receive UDP packets. We extended the Mahimahi shell to support CellReplay replay, allowing unmodified applications to run inside the shell and experience the emulated network conditions induced by CellReplay. For more details, refer to §B.

The evaluation includes experiments that test CellReplay’s accuracy across (1) different networked applications, including web browsing and random file transfers using TCP, (2) different cellular providers and technologies, including T-Mobile, Verizon, 5G mid-band, and 5G low-band, and (3) different environmental conditions, such as good signal strength, weak signal strength, crowded areas, and various mobility levels (stationary, walking, and driving). For full details on the environments and their calibration parameters, refer to §C. Finally, we present a use case of using CellReplay and Mahimahi to evaluate ABR algorithms.

我们的目标是评估 CellReplay 仿真在复现应用性能方面与其现网(Live Network)对应情况相比的准确性. 我们还将 CellReplay 与 Mahimahi [25] 进行了比较. 我们使用 Java 和 Python 3 实现了 CellReplay 录制工具, 用于发送和接收 UDP 数据包. 我们扩展了 Mahimahi 的 shell 以支持 CellReplay 重放, 允许未经修改的应用程序在该 shell 内运行, 并体验由 CellReplay 引入的仿真网络条件. 更多详情请参阅附录 B.

评估包括以下实验:

(1) 跨不同网络应用的准确性测试, 包括网页浏览和使用 TCP 的随机文件传输;

(2) 不同蜂窝运营商和技术, 包括 T-Mobile, Verizon, 5G 中频段和 5G 低频段;

(3) 不同环境条件, 如强信号, 弱信号, 拥挤区域以及各种移动水平(静止, 步行和驾驶)

有关环境及其校准参数的详细信息, 请参阅附录 C. 最后, 我们展示了一个使用 CellReplay 和 Mahimahi 评估自适应码率(ABR)算法的用例

5.1 Experimental setup (只展开 5.1)

We designed two test setups: a live network and an emulation test setup, as shown in Figure 9. The live network test setup was used for running application tests on the live network. During the tests, we tethered a laptop to phones connected to 5G or 4G networks. The client application (e.g., a web browser) and the application server (e.g., a web server) communicate via a UDP tunnel (based on [42]). We deployed our server (i.e., the remote endpoint) in the same geographical area–within 10 miles of the phones–to minimize the network path length and, consequently, reduce the likelihood of experiencing congestion over long paths. We also used a similar setup, albeit without a tunnel, to record the cellular network traces using UDP traffic. However, instead of a single phone, we used two identical phones to separately perform packet train probing and the Saturator workload.

We used the emulation test setup to test applications under an emulated network interface that employed either CellReplay’s or Mahimahi’s replay approach. Although this setup is similar to the live network test, we made two modifications: (1) we replaced the USB-tethered interface with a direct high-speed Ethernet connection to our server via a single switch, and (2) we ran the tunnel inside either CellReplay’s or Mahimahi’s replay network-emulation shell, which emulates the network using recorded traces. The same client and server devices were used for record and replay.

End-point specifications. We used two Samsung Galaxy S22 (SGS) and two Google Pixel 5 (Pixel) phones for testing. We had an unlimited data plan from both T-Mobile and Verizon. Since we observed no performance difference between the SGS and Pixel across different operators, we connected our SGS devices to T-Mobile and our Pixel devices to Verizon for convenience. The laptop in our setups featured an Intel Core i7 CPU, 16GB RAM, and a 512GB SSD, running Ubuntu 20.04.

我们设计了两种测试装置: 现网测试装置和仿真测试装置, 如图 9 所示:

alt text

现网测试装置用于在现网中运行应用程序测试:

在测试期间, 我们将笔记本电脑(通过网络共享)连接到接入 5G 或 4G 网络的手机上. 客户端应用程序(例如 Web 浏览器)和应用服务器(例如 Web 服务器)通过 UDP 隧道(基于文献 [42])进行通信. 我们将服务器(即远程端点)部署在同一地理区域——距离手机 10 英里以内——以最小化网络路径长度, 从而降低长路径上发生拥塞的可能性. 我们还使用了类似的装置(尽管没有隧道)来通过 UDP 流量录制蜂窝网络迹线. 然而, 我们不是使用单部手机, 而是使用两部相同的手机分别执行数据包序列探测(Packet Train Probing)和饱和器(Saturator)负载.

我们使用仿真测试装置:

在采用 CellReplay 或 Mahimahi 重放方法的仿真网络接口下测试应用程序. 虽然该装置与现网测试类似, 但我们进行了两处修改:

  1. 我们将 USB 网络共享接口替换为通过单个交换机连接到服务器的直接高速以太网连接
  2. 我们在 CellReplay 或 Mahimahi 的重放网络仿真 shell 中运行隧道, 该 shell 使用录制的迹线仿真网络

录制和重放使用了相同的客户端和服务器设备.

端点规格:

我们使用了两部三星 Galaxy S22 (SGS) 和两部 Google Pixel 5 (Pixel) 手机进行测试.

我们拥有 T-Mobile 和 Verizon 的无限数据套餐. 由于我们观察到 SGS 和 Pixel 在不同运营商之间的性能没有差异, 为方便起见, 我们将 SGS 设备连接到 T-Mobile, 将 Pixel 设备连接到 Verizon.

我们装置中的笔记本电脑配备了 Intel Core i7 CPU, 16GB RAM 和 512GB SSD, 运行 Ubuntu 20.04 操作系统.

Discussion and future work

We discuss CellReplay’s use cases, limitations, and plans for future improvements.

Use cases. As shown in §5.9, CellReplay can be used to evaluate new applications and protocols on cellular networks and provide more accurate emulation of real network performance compared to the state-of-the-art approach. CellReplay is superior for latency-sensitive applications, as it can emulate base delay variability, and applications with variable flow sizes, as it can emulate the bandwidth-workload dependency. Adaptive applications (e.g., ABR) that react to network measurement will also receive more accurate performance results with CellReplay.

We also provide traces that researchers and developers can use for testing on CellReplay. While recording traces with CellReplay requires a bit more effort than Mahimahi due to the use of two phones, this is a minor issue, as once a diverse set of traces is recorded, users can easily replay them without the phones (just as in Mahimahi). Since CellReplay’s implementation is based on Mahimahi’s shell, unmodified applications can easily use CellReplay emulated interface.

Inaccuracies in CellReplay. While we have made significant progress in faithfully replaying cellular performance, CellReplay involves several simplifications and assumptions: (1) CellReplay does not record and replay random packet losses (although it drops packets when the queue overflows and can be manually configured for a set random drop rate). We notice, however, that cellular links under stationary conditions are robust to random packet drops (e.g., due to packet corruption) due to link-layer retransmission [22]. However, packet drops are more frequent during handovers in mobility [15]. (2) CellReplay uses fixed calibration parameters before each recording session. A more adaptive selection of parameters could help when network conditions change during recording. (3) CellReplay’s two-phone setup has some weaknesses. Under mobility, both phones may connect to different base stations and report different performances. Moreover, although we did not observe major interference, greater interference may occur with other providers and conditions. In the future, we will be able to use a single phone with Dual-SIM Dual-Active (DSDA) modem [1] for recording, as DSDA allows simultaneous traffic transmission across two SIMs. Each of these areas represents an opportunity for future improvement. We note, however, that our evaluation results account for the errors caused by these inaccuracies.

Improving CellReplay interpolation accuracy. A straightforward approach is to gather additional data points for interpolation. However, since we need to record data simultaneously (e.g., while walking or driving), we are limited to running only a few workloads with a few phones. Another viable approach is to leverage ML to model complex, workload-dependent network performance and providers’ resource allocation policies. We can train an ML model based on recorded workload and performance traces to predict network performance (e.g., PDOs) for a given test workload. However, this approach may require extensive data collection to capture RAN scheduling behavior, increasing the recording effort and making it time-consuming.

Adding more cellular network specific features. CellReplay could be improved by explicitly emulating cellular network-specific features, especially those that affect application performance. These include radio resource control (RRC) delays, handover, and other relevant factors.

Other limitations. CellReplay probes UDP traffic to record network traces, meaning it cannot capture the effects of network discrimination based on IP protocol types, such as from TCP middlebox intervention [8].

  1. 适用场景: 相比现有最先进的方法(如 Mahimahi), CellReplay 更适合评估
    • 延迟敏感型应用: 因为它能仿真基础延迟(Base Delay)的动态变化
    • 流量大小多变的应用: 因为它能仿真带宽对负载(Workload)的依赖性
    • 自适应应用(如 ABR 视频流): 能提供更准确的网络反馈
  2. 当前的局限性与误差
    • 不记录随机丢包: 目前只模拟队列溢出导致的丢包, 未记录随机丢包(尽管在静止状态下链路层重传掩盖了大部分丢包, 但在移动切换时丢包较多)
    • 参数校准固定: 目前的参数在录制前一次性校准, 无法在录制过程中随网络环境变化而自适应调整
    • 双手机方案的缺陷:
      • 在移动场景下, 两部手机可能连接到不同的基站
      • 存在潜在的信号干扰风险(尽管作者未观察到严重干扰)
  3. Future Work
    • 硬件改进: 计划利用支持 DSDA(双卡双通) 的单部手机来替代双手机方案, 实现单设备同时进行双流录制
    • 算法与插值改进:
      • 增加插值数据点(受限于设备数量, 难度较大)
      • 引入 机器学习 (ML): 利用 ML 模型学习复杂的负载与性能关系及基站调度策略, 但这需要海量数据来训练
    • 增加蜂窝网络特性: 明确仿真蜂窝特有的机制, 如无线资源控制(RRC)延迟和基站切换(Handover)的影响