SplitSim: Large-Scale Simulations for Evaluating Network Systems Research¶

这篇论文主要解决在网络与分布式系统研究中, 如何兼顾 仿真规模 (Scale)、逼真度 (Fidelity) 和 资源效率 (Efficiency) 的难题.

(1) 背景与问题

现有困境:

物理测试床: 规模不够大, 且缺乏灵活性 (难以定制硬件或拓扑)
协议级仿真 (如 ns-3): 缺乏端主机 (End-Host) 的软件/硬件细节, 准确性不足
全系统仿真 (如 SimBricks): 虽然逼真, 但资源消耗极大, 难以扩展到大规模 (例如模拟数千个节点需要数千个CPU核心)

核心挑战:

需要在逼真度、仿真时间和计算资源之间找到平衡, 而现有工具缺乏这种灵活性

(2) Design

SplitSim 是一个基于模块化仿真 (Modular Simulation) 的框架, 通过以下四大技术实现大规模、低成本的端到端仿真:

混合逼真度仿真 (Mixed-Fidelity Simulations):

核心思想:
- 不是所有节点都需要全系统仿真
- 对关键节点使用高精度模拟器 (如 gem5, QEMU), 对背景流量节点使用低精度协议级模拟器 (如 ns-3)
优势: 大幅降低 CPU 资源需求

解耦式并行化 (Parallelization through Decomposition):

核心思想: 将单一的瓶颈模拟器 (如单线程的 ns-3 或 gem5) 拆分为多个并行进程
实现: 使用 SplitSim 适配器 (Adapters) 和 Trunk 通道连接这些拆分后的进程, 实现同步

轻量级性能分析器 (Profiling / WTPG):

工具: 等待时间分布图 (Wait-Time Profile Graph, WTPG)
作用: 可视化 "谁在等谁", 帮助用户识别哪个组件是性能瓶颈, 从而决定如何进一步拆分并行或合并进程

配置与编排框架 (Configuration & Orchestration):

抽象: 将 "系统配置" (主要指拓扑、硬件参数) 与 "仿真实现" (用什么模拟器、运行多少进程) 分离
优势: 用户可以灵活切换不同的仿真策略 (例如从全精度切换到混合精度), 而无需重写系统定义

(3) Evaluation

规模:
- 成功在单台机器 (24核) 上模拟了 1200个节点 的数据中心网络, 运行20秒的系统时间仅需175分钟
性能对比:
- 比 SST 模拟器快 2倍以上
- 并行化 gem5 (8核配置) 实现了约 5倍的加速
- 在 ns-3 并行化上, 性能与原生并行化相当或略好, 但灵活性更高

Introduction¶

Research on large-scale network and distributed systems often faces the challenge of access to adequate testbeds. Most researchers and even many practitioners do not have access to testbeds that are large enough and/or provide the necessary flexibility, control, and hardware. For example, a new datacenter congestion control algorithm might require specific configuration parameters at each network switch in the datacenter, or a distributed system accelerated with in-network processing requires new programmable switches deployed at specific points in the network.

In these cases we typically rely on a patchwork evaluation that combines end-to-end measurements in a small physical testbed with protocol-level simulations for evaluating at scale. However, this methodology compromises accuracy of end-to-end system behaviors at scale. The physical testbed is by necessity too small and protocol-level simulations do not model many system components, from NIC behavior, to host-interconnects, memory hierarchy, and the whole OS and application-level software stack.

We argue that simulated end-to-end testbeds can bridge this gap. In this paper, we take a top-down approach, starting from detailed but slow small-scale end-to-end simulations, we tackle the practical and fundamental challenges in scaling up.

We use existing work on modular end-to-end simulation [11, 18] as a starting point. Modular end-to-end simulations combine and connect different best-of-breed simulators for different system components, and through modularity flexibly cover a broad range of use-cases. These simulations scale up by running separate components as parallel processes communicating and synchronizing them efficiently, locally or distributed across physical machines. SimBricks [11] has demonstrated that this approach scales to simulate a network of 1000 single-core hosts and NICs running Memcached on Linux. However, while technically feasible, this simulation of 10 s of application workload required 6–20 h of simulation time, depending on configuration, on 26 machines with 96 vCPUs, for \(600-\)2000 today on ec2.

With SplitSim, we enable large-scale simulations with more reasonable cost-benefit ratios through a combination of techniques. First, we provide a configuration and orchestration framework for end-to-end simulations that simplifies simulating a concrete system flexibly in different ways, by separating the configuration of the simulated system from concrete simulation instantiation choices. We then leverage modularity to implement mixed fidelity simulations where some system component instances are simulated in less accurate simulators to drastically reduce CPU resources needed, while key instances remain in accurate simulators. Next, we design generic building blocks for reducing simulation time by parallelizing bottleneck simulators by decomposing them into multiple parallel, connected, and synchronized processes, Finally, we introduce lightweight synchronization and communication profiling to inform the user about bottlenecks and resource efficiency across component simulator instances.

In our evaluation we demonstrate that SplitSim enables evaluation of large scale systems in networks of up to 1200 hosts, while running complete OS and application stacks for key nodes. SplitSim simulations thus enable full end-to-end application evaluation in large networks. By combining mixed fidelity simulation with parallelization through decomposition, SplitSim can simulate 20 seconds of a large-scale system in less than 3 hours while running on a single machine. The modular approach and flexible configuration and orchestration framework makes SplitSim suitable for a broad range of evaluation use-cases. SplitSim is open source and the full paper artifact is available here: https://github.com/simbricks/conext25-artifact.

This work does not raise ethical issues.

大规模网络与分布式系统研究常面临难以获取充足测试床 (Testbed) 的挑战. 大多数研究人员乃至许多从业者都难以获得规模足够大, 或在灵活性、可控性及硬件条件上满足需求的测试床. 例如, 新型数据中心拥塞控制算法可能需要在数据中心的每一台网络交换机上设定特定的配置参数; 又如, 基于网内处理 (In-Network Processing) 加速的分布式系统需要在网络特定位置部署新型可编程交换机.

针对上述情况, 我们通常依赖一种 "拼凑式" 的评估方法, 即: 结合小型物理测试床的端到端测量与大规模场景下的协议级仿真.

然而, 这种方法在评估大规模场景下的端到端系统行为时, 会牺牲准确性.

物理测试床受限于规模必然过小, 而协议级仿真往往未能模拟诸多系统组件, 涵盖NIC行为、主机互连、内存层级, 到完整的操作系统及应用层软件栈.

我们认为, 基于仿真的端到端测试床能够弥合这一鸿沟. 本文采用自顶向下的方法, 以高精度但速度较慢的小规模端到端仿真为起点, 致力于解决规模扩展过程中面临的实际与根本性挑战.

我们以现有的模块化端到端仿真工作 [11, 18] 作为研究起点. 模块化端到端仿真通过组合和连接针对不同系统组件的 "同类最佳" (best-of-breed) 模拟器, 利用模块化特性灵活覆盖广泛的应用场景. 此类仿真通过将各个组件作为并行进程运行来实现扩展, 这些进程既可在本地运行, 也可分布在多台物理机器上, 并进行高效的通信与同步.

SimBricks [11] 已证实该方法具备良好的扩展性, 能够模拟包含 1000 个运行 Linux Memcached 的单核主机与网卡的网络环境. 然而, 尽管技术上可行, 模拟仅 10 秒的应用负载却需要耗费 6 至 20 小时的仿真时间 (取决于配置), 且需占用 26 台配备 96 vCPU 的机器, 按照当前 EC2 价格计算, 成本高达 600 至 2000 美元.

通过 SplitSim, 我们结合多种技术手段, 以更合理的成本效益比实现了大规模仿真:

我们提供了一套端到端仿真的配置与编排框架, 通过将待仿真系统的配置与具体的仿真实例化选择 (instantiation choices) 相分离, 简化了以不同方式灵活模拟具体系统的过程
我们利用模块化特性实现了混合逼真度仿真 (Mixed-Fidelity Simulation): 将部分系统组件实例运行于精度较低的模拟器中, 以大幅降低 CPU 资源消耗, 同时确保关键实例仍运行于高精度模拟器中
我们设计了通用的构建模块, 通过将成为性能瓶颈的模拟器分解为多个并行、互连且同步的进程, 从而缩短仿真时间
我们引入了轻量级的同步与通信性能分析 (Profiling) 机制, 向用户反馈各组件模拟器实例间的性能瓶颈与资源利用效率

评估结果表明, SplitSim 能够支持包含高达 1200 台主机的大规模网络系统评估, 并确保关键节点运行完整的操作系统与应用栈.

因此, SplitSim 仿真实现了大规模网络中全方位的端到端应用评估.

通过结合混合逼真度仿真与基于分解的并行化技术, SplitSim 仅需单台机器, 即可在 3 小时内完成大规模系统 20 秒运行时间的仿真.

这种模块化方法以及灵活的配置与编排框架, 使得 SplitSim 适用于广泛的评估场景.

SplitSim 已开源, 完整论文相关工件 (Artifact) 可从以下链接获取: https://github.com/simbricks/conext25-artifact.

Background and Motivation¶

2.1 Requirements for Evaluating Large-Scale Network Systems¶

We argue tooling for practical evaluation of large scale systems should be:

End-to-End: Obtain full-system measurements with all relevant hardware (switches, topology, NICs, server-internals, etc.) and software (application, OS, library) components.
Scalable: Support evaluation of systems with realistic scale, e.g. 100–1000s of hosts.
Efficient: Evaluation within feasible resource limits, in particular processor cycles.
Fast: Keep evaluation and measurement times manageable.
Flexible: Support a broad range of different system configurations.
Easy to Use: Make it easy for users to configure and run evaluations and measurements.

大规模网络系统评估 的理想要求下, 一个实用的评估工具需要同时满足以下六个特性:

端到端 (End-to-End): 涵盖软硬件全栈 (从应用层到物理层)
可扩展 (Scalable): 支持真实规模 (数百至数千节点)
高效 (Efficient): 合理利用计算资源
快速 (Fast): 评估耗时可控
灵活 (Flexible): 支持多样的配置
易用 (Easy to Use): 降低配置与运行门槛

2.2 Existing Approaches Fall Short¶

We now overview typical evaluation approaches used in network research when physical testbeds are out of reach, and explain why they are insufficient. Table 1 shows an overview.

Performance Estimators. Theoretical models [6, 22] have long served as valuable tools for describing network states and estimating key metrics such as throughput and latency. While they are helpful when network behavior can be precisely articulated through equations, they lack the fine-grained packet-level visibility offered by other approaches. These approaches also do not enable an end-to-end evaluation but instead rely on coarse-grained modeling for a system.

Recent efforts emulate networks using deep neural networks to estimate user-relevant metrics such as delay and packet loss. MimicNet [21], for instance, learns performance metrics at a cluster granularity level and generates estimated packet traces. DeepQueueNet [20] refines this approach by learning device-level performance metrics, thereby enhancing packet visibility within the cluster. These estimations are derived through inferencing input data, which lends itself well to massive parallelization, enabling rapid results for large-scale networks. However, the deep neural net's behavior is not interpretable. Additionally, to model different network configurations, the model has to be reconstructed and re-trained, incurs substantial computing and engineering effort.

Discrete Event Network Simulation. DES-based simulators such as ns-3 [15] and OMNeT++ [10] are extensively used in networking as they are well suited to model packets in networks. They provide detailed insights into network behavior at the packet and protocol level, enabling in-depth analysis. These simulators model network events with timestamps, such as a packet being generated at a host, transmitted through a link, and received at the other end. Events are processed in chronological order to update the state of the simulated network. Other simulators leverage DES to simulate other components, such as gem5 [4] or Simics [13] for computers including detailed processor and memory hierarchy.

However, sequential DES struggles with scalability when modeling large systems. For example, simulating a few seconds of a modern datacenter network—with vast traffic from thousands of end hosts—can take hours or even days. Parallelizing DES [2, 3, 14, 17] can reduce simulation times. The primary challenge is the difficulty of efficiently parallelizing while accurately synchronizing DES. For example, with ns-3's native parallelization, we found partitioning a network simulation across 16 processes on 16 cores resulted in only a 3.8x speedup. Clean-slate parallel designs [8, 19] further improve parallelism, but lack the rich feature sets developed over decades of research and necessarily focus on simulating specific components rather than entire end-to-end systems.

More fundamentally, a DES-based simulation only models behavior explicitly included in the models of the simulator. Given the growing complexity of modern network systems and all the components involved, it appears infeasible to add all necessary models to one single simulator. Individual simulation models also fundamentally need to choose what to model and what to skip to keep simulation overhead manageable. As such, an individual simulator may be well suited for one use, but inadequate for another.

Modular Simulation. To address this, modular simulation frameworks such as SimBricks [11] and SST [18] allow users to combine multiple simulators into one complete system. For example, SimBricks combines simulators for host, hardware device, and network to construct an end-to-end simulation, and run in parallel. By combining different best-of-breed simulators, this approach leverages the substantial engineering effort that has gone into developing simulators for different components, and combines this effort across components. Modular simulation is also flexible, as depending on the choice of simulators, a simulation can be configured to be accurate but slow, or faster and less accurate, by using different simulators. This works well, but for large scale systems, these simulations use computational resources inefficiently. Each system component needs an additional simulator instance running on a separate core, and many cores will waste precious cycles waiting for bottleneck simulators.

SplitSim. With SplitSim we aim to address these shortcomings, by pragmatically leveraging existing pieces and the decades of engineering effort, but augmenting their capabilities to make practical end-to-end simulation of large scale systems possible. We make it easy to simulate a system configuration of interest in different ways, to flexibly explore different simulator choices trade-offs. SplitSim also supports the user in locating simulation bottlenecks, and parallelizing bottleneck simulators while consolidating others. Overall, SplitSim thereby enables practical & flexible end-to-end evaluation of large-scale systems in simulation.

现有方法的局限性

目前的评估方法各有短板, 无法同时满足上述要求:

alt text

性能估算器 (如 MimicNet):
- 虽然速度快、易扩展, 但缺乏细粒度的数据包级视野
- 模型不可解释 (黑盒)
- 应对配置变更时需重新训练, 开销大
离散事件仿真 (DES, 如 ns-3):
- 虽然细节丰富, 但顺序执行导致大规模仿真极慢, 并行化困难 (同步开销大)
- 单一模拟器难以覆盖所有系统组件的细节
模块化仿真 (如 SimBricks):
- 虽然通过组合 "同类最佳" 模拟器解决了灵活性和细节问题, 但在大规模场景下资源效率低
- 快慢组件混用导致大量 CPU 周期浪费在 "等待" 上

2.3 Technical Challenges¶

Based on the observations above, we use modular end-to-end simulations as a starting point, and examine the compounding challenges in scaling up to meet the requirements for large-scale network systems.

High Resource Requirements for Detailed Simulators. In general, detailed simulators are slower and less resource efficient compared to less-detailed simulators. To obtain meaningful end-to-end measurements, we generally require functionally and timing accurate simulators for all component types in the system, be it processor and memory subsystem, hardware devices such as NICs, or the actual network topology. As a result, modular end-to-end simulations of large-scale systems are prohibitively expensive, requiring hundreds or thousands of processor cores for many hours to simulate seconds.

Simulations Bottlenecked by the Slowest Component. Modular simulation comprising multiple synchronized components can, by construction, only proceed as fast as the slowest component in the system. Slow bottleneck simulators cause two separate problems. First, overall simulation times will be long for simulations that include even just a single slow simulator component. Second, as typical end-to-end simulation will naturally contain component simulators that simulate with different speeds. This results in faster components wasting a lot of processor cycles waiting for slower simulators. At scale, this reduces in substantial waste.

Hard to Understand Simulation Performance. To make matters worse, finding bottlenecks in simulations comprising tens to thousands of communicating and synchronized components is a challenge. Most efficient simulation synchronization mechanisms rely on polling shared memory state for efficiency. Thus, all components will commonly show 100% CPU utilization, and a regular profiler will indicate lots of time spend in the functions that poll for messages. Based on these indicators it is hard to tell if a simulator is bottlenecked or communicating heavily, especially when also combined with heavy compiler optimization. Blocking will also naturally propagate through dependent system components.

Navigating Fidelity, Time, and Cost Trade-offs. Achieving high simulation fidelity while minimizing simulation time and computational resource usage is inherently challenging. Simulating large-scale systems, necessarily requires some compromise across three key dimensions: fidelity, time, or resource demand. Users want to make the best use of their available time, computational budget and simulation fidelity by navigating this trade-off space effectively. Here there is a large space, of different options, different choices of simulation models, parallelization strategies, and strategies for mapping onto available physical resources. Exploring this fundamentally requires exploration and trail-and-error with different configurations. Unfortunately, this is difficult to navigate with existing tools.

Complex Configuration and Execution. More generally, configuring and running simulations for large-scale end-to-end system is a complex task. Many instances of different simulators for different components need to be configured, connected, and then executed in a coordinated manner. The first problem is the complexity: each simulator has its own mechanism and abstractions for configuring it, and there is a substantial learning curve whenever a user looks to use a new simulator. Second, this is complicated by the fact that any non-trivial evaluation typically will need to simulate multiple different configurations of its system, and often needs to explore different simulators and simulator configurations to identify suitable configurations. Finally, once the user has chosen a system and simulation configuration, all components need to be connected together, started in the correct order respecting dependencies, outputs need to be collected, and finally all simulators need to be cleanly terminated. Even with more than a handful of components a manual approach is prohibitively complex and laborious.

扩展 "模块化仿真" (e.g. SimBricks) 时, 面临以下五大难题"

资源需求高: 高精度模拟器极其消耗 CPU, 大规模仿真成本过高
受限于最慢组件: 整体仿真速度取决于最慢的模块, 导致较快的模块空转浪费资源
性能瓶颈难以定位: 由于仿真器常用轮询机制 (显示 100% CPU 占用), 传统 Profiler 无法区分 "忙碌" 还是 "空转等待", 难以识别真正的瓶颈
权衡困难: 在 "逼真度、时间、成本" 三者间寻找平衡点极具挑战, 缺乏工具辅助用户进行探索
配置与执行复杂: 管理成百上千个异构模拟器实例的配置、连接、启动和终止, 手动操作极其繁琐且易错

alt text

Design and Implementation¶

SplitSim combines four techniques to address these technical challenges. Figure 2 shows an overview. First, SplitSim reduces the resources needed for large scale simulations by enabling mixed-fidelity simulations (3.1), where expensive detailed simulators are replaced with faster, less resource intensive simulations in part of the system, while keeping detailed simulation in other parts. To increase simulation speed and avoid poorly utilized processor cores, SplitSim provides generic building blocks for parallelizing bottleneck component simulators by decomposing them into parallel processes (3.2). SplitSim helps users identify bottleneck component simulators and largely idle component simulators with a cross-simulator synchronization and communication profiler (3.3). Finally, SplitSim streamlines configuring and running a broad range of different system and simulation configurations, with programming abstractions for configuration and communication. (3.4).

SplitSim 结合了四种技术来应对上述技术挑战. 图 2 展示了概览.

首先, SplitSim 通过 启用混合逼真度仿真 (Mixed-Fidelity Simulations) (3.1) 来降低大规模仿真所需的资源, 即在系统的一部分中使用速度更快、资源消耗更少的模拟器替代昂贵的详细模拟器, 同时在其他部分保留详细仿真.

其次, 为了提高仿真速度并避免处理器核心利用率低下, SplitSim 提供了通用的构建模块, 通过 将瓶颈组件模拟器分解为并行进程 来实现并行化 (3.2).

SplitSim 利用 跨模拟器的同步与通信性能分析器 (Profiler) (3.3), 帮助用户识别瓶颈组件模拟器以及大部分时间处于空闲状态的组件模拟器.

最后, SplitSim 利用用于配置和通信的编程抽象, 简化了广泛的不同系统和仿真配置的 配置与运行 流程 (3.4).

3.1 Mixed-Fidelity Simulations¶

To reduce the computational resources necessary for large-scale end-to-end simulations, we propose mixed-fidelity simulation. The idea is basically to retain a subset of full detailed end-to-end components for parts of the system, while using less resource-intensive simulations for less critical areas of the system.

Reducing Simulation Detail in non-Critical Components. The underlying insight is that typically full detail is not required in every component of the system. A common example is running a system as part of a larger network to evaluate the effect of other background traffic, congestion, etc. in the network; here protocol-level simulation of hosts generating this background traffic is completely sufficient. However, where detailed simulation is required and where detail can be sacrificed depends on the system and evaluation goal. When evaluating peak system throughput for a client-server system, modeling internal client detail is not essential — as long as client requests arrive at the required rate and with the correct protocol or format, the server behavior will be the same. When evaluating end-to-end request latency, on the other hand, client internal behavior is likely to significantly affect measured latency, thus here at least the clients that measure the latency need to be simulated in full detail. For all three of these examples, instead of simulating all hosts end-to-end with detailed architectural simulators, such as qemu or gem5, we can instead simulate a specific subset of them at the protocol level, e.g. in ns-3 or OMNet++ (Figure 3). A similar approach applies to other system components, e.g. instead of running expensive RTL-level simulations for all NICs or Switches for projects that propose new hardware design, a judicious combination of RTL simulations with faster and more efficient lightweight simulation models, drastically reduces cycles needed.

Enabling Mixed-Fidelity End-to-End Simulations. At a technical level, SplitSim enables mixed fidelity simulations through modular composition, inherited from SimBricks. Components simulators are connected through fixed message passing interfaces, primarily Ethernet packets and PCI, and individual simulators are thus decoupled from how these messages are generated.

New Challenges. However, while mixed fidelity simulations can drastically reduce computational cost for large-scale simulations, configuring and running mixed fidelity simulations gives rise to or exacerbates the other three challenges. First, these simulations often result in heavy bottlenecks for simulation speed, thereby also introducing significant imbalance leading other simulators to waste cycles waiting and leaving cores idle — with most SimBricks simulations we have run, the end-host simulators (qemu or gem5) are the slowest component by a significant margin, however once we move a few hundred or thousand hosts into the ns-3 network, ns-3 slows down the whole simulation by 3–5×. In the following two subsections we discuss how SplitSim enables users to locate (3.3) and mitigate (3.2) such bottlenecks. Finally, configuring mixed-fidelity simulations and exploring different levels of detail in different system components, is particularly complicated and laborious for users. In addition to building host disk images with applications and configurations, and setting up commands for each host to run etc., a mixed fidelity simulation now also requires configuring additional simulators, e.g. ns-3, to also implement similar functionality through their abstractions. SplitSim simplifies this, in part, through the configuration and orchestration framework (3.4).

为了减少大规模端到端仿真所需的计算资源, 我们提出了混合逼真度仿真. 其核心思想基本上是在系统的部分区域保留全细节端到端组件的子集, 而在系统非关键区域使用资源密集度较低的仿真.

降低非关键组件的仿真细节, 根本的洞察在于, 通常并不需要在系统的每个组件中都保持全细节.

一个常见的例子是将某系统作为更大网络的一部分运行, 以评估网络中其他背景流量、拥塞等因素的影响; 在这种情况下, 对产生背景流量的主机进行协议级仿真完全足够.

然而, 哪里需要详细仿真以及哪里可以牺牲细节, 取决于系统和评估目标:

当评估客户端-服务器系统的峰值吞吐量时, 建模客户端内部细节并非必要:
1. 只要客户端请求以所需的速率和正确的协议或格式到达, 服务器的行为就是相同的.
另一方面, 当评估端到端请求延迟时, 客户端内部行为可能会显著影响测得的延迟:
1. 因此这里至少测量延迟的客户端需要进行全细节仿真

对于这三个例子, 我们不再使用详细的架构模拟器 (如 QEMU 或 gem5) 对所有主机进行端到端仿真, 而是可以在协议级别 (例如在 ns-3 或 OMNeT++ 中) 模拟它们的特定子集 (图 3):

alt text

类似的方法也适用于其他系统组件, 例如, 对于提出新硬件设计的项目, 不再为所有网卡或交换机运行昂贵的 RTL 级仿真, 而是明智地结合 RTL 仿真与更快、更高效的轻量级仿真模型, 从而大幅减少所需的周期数.

启用混合逼真度端到端仿真:

在技术层面, SplitSim 继承自 SimBricks, 通过模块化组合实现混合逼真度仿真.

组件模拟器通过固定的消息传递接口 (主要是以太网数据包和 PCI) 连接, 单个模拟器因此与这些消息的生成方式解耦.

新挑战:

尽管混合逼真度仿真可以大幅降低大规模仿真的计算成本, 但配置和运行混合逼真度仿真会引发或加剧其他三个挑战:

首先, 这些仿真通常会导致严重的仿真速度瓶颈, 从而引入显著的不平衡, 导致其他模拟器浪费周期等待并使核心空闲——在我们运行的大多数 SimBricks 仿真中, 终端主机模拟器 (QEMU 或 gem5) 通常是速度最慢的组件, 差距显著;

然而, 一旦我们将数百或数千台主机移入 ns-3 网络, ns-3 会将整个仿真速度拖慢 3-5 倍.

在接下来的两个小节中, 我们将讨论 SplitSim 如何使用户能够定位 (3.3) 并缓解 (3.2) 此类瓶颈.

最后, 配置混合逼真度仿真并探索不同系统组件的不同细节级别, 对用户而言特别复杂且费力.

除了构建包含应用程序和配置的主机磁盘镜像、设置每台主机的运行命令等之外, 混合逼真度仿真现在还需要配置额外的模拟器 (例如 ns-3), 以通过其抽象实现类似的功能. SplitSim 部分通过配置和编排框架 (3.4) 简化了这一过程.

3.2 Parallelizing Through Decomposition¶

In general, parallelizing simulators is a challenging problem with different approaches for different types of simulators. These are well-studied but at least the few major relevant simulators for endto-end simulations are either sequential (gem5) or scale poorly (ns-3, OMNeT++) [21]. Moreover, the existing parallelization approaches often require intrusive changes to simulators. In SplitSim, we instead propose simpler easy-to-integrate building blocks to parallelize system simulators with modular architectures, such as ns-3, OMNeT++, and gem5.

The key idea in the SplitSim parallelization approach is to decompose these simulators at component boundaries into multiple separate processes.

We then leverage the well-defined module interfaces for connecting and synchronizing the parallel processes with SplitSim adapters that translate these interfaces into messages on SimBricks channels, the same channels also used to interconnect other SplitSim simulator components. Using the same mechanisms enables re-use and provides SplitSim with visibility into the simulation structure for effective orchestration and also enables use the SplitSim profiler for these newly parallel components.

通常, 并行化模拟器是一个具有挑战性的问题, 不同类型的模拟器有不同的方法. 尽管这方面已有充分研究, 但至少对于端到端仿真而言, 几个主要的相关模拟器要么是顺序执行的 (gem5), 要么扩展性差 (ns-3, OMNeT++) [21].

此外, 现有的并行化方法通常需要对模拟器进行侵入式修改. 在 SplitSim 中, 我们提出了一种更简单、易于集成的构建模块, 以并行化具有模块化架构的系统模拟器, 如 ns-3、OMNeT++ 和 gem5.

SplitSim 并行化方法的关键思想是: 在组件边界将这些模拟器分解为多个独立进程

然后, 我们利用定义明确的模块接口连接并行进程, 并通过 SplitSim 适配器进行同步, 这些适配器将接口转换为 SimBricks 通道上的消息, 这与用于互连其他 SplitSim 模拟器组件的通道相同.

使用相同的机制使得重用成为可能, 并为 SplitSim 提供了对仿真结构的可见性以进行有效编排, 同时也使得 SplitSim 分析器能够用于这些新并行化的组件.

3.2.1 Building Blocks.

Base Adapter. We build on SimBricks adapters in gem5, ns-3, and OMNeT, that are all implemented simply within the device abstractions of each simulator, and implement synchronization through the channel, as well as communication. Based on these, we define an abstract SplitSim base adapter for each simulator, that implements initialization and synchronization, but is not specific to a particular SimBricks channel type. This base adapter can then be used to implement multiple specific protocol adapters without needing to re-implement the common functionality. This includes adapters for the existing SimBricks protocols, but also makes it easy to implement adapters for internally connecting and synchronizing pieces of a simulator.

Trunk Adapter. Many non-trivial partitions will require multiple connections between some pairs of processes. In principle here multiple instances of the SplitSim adapter can be used and this will just work. However, this will unnecessarily incur the synchronization overhead once for each adapter. To address this, SplitSim introduces trunk channels, that multiplex messages for multiple upper layer channels over one synchronized SimBricks channel. The implementation tags messages going across with the sub-channel identifier, for demultiplexing at the receiver.

构建模块 (Building Blocks):

基础适配器 (Base Adapter).
- 我们建立在 gem5、ns-3 和 OMNeT 中的 SimBricks 适配器之上, 这些适配器都在每个模拟器的设备抽象内简单实现, 并实现了通过通道的同步以及通信
- 基于这些, 我们为每个模拟器定义了一个抽象的 SplitSim 基础适配器, 它实现了初始化和同步, 但不特定于某种 SimBricks 通道类型
- 该基础适配器随后可用于实现多个特定的协议适配器, 而无需重新实现通用功能
- 这包括用于现有 SimBricks 协议的适配器, 同时也使得实现用于内部连接和同步模拟器各部分的适配器变得容易
中继适配器 (Trunk Adapter).
- 许多非平凡的分区需要在某些进程对之间建立多个连接
- 原则上, 这里可以使用多个 SplitSim 适配器实例, 并且可以正常工作. 然而, 这会导致每个适配器都产生一次不必要的同步开销
- 为了解决这个问题, SplitSim 引入了中继通道 (Trunk Channels), 它在单个同步的 SimBricks 通道上多路复用多个上层通道的消息
- 实现上, 它使用子通道标识符标记传输的消息, 以便在接收端进行解复用

3.2.2 Examples.

Multi-Core gem5. Figure 4 demonstrates how we use SplitSim adapters to parallelize multi-core simulations in gem5. Changes to gem5 are limited to 1) implementing the adapters as simulation object in gem5, and serializing the already message-based memory packet interface to messages, and 2) changing the gem5 python configuration script to only instantiate the relevant components for each process.

Parallel ns-3 and OMNeT++. We also implemented SplitSim parallelization for the ns-3 and OMNeT++ network simulators. Here we also instantiate different parts of the overall network topology in separate processes, and replace links going across components with SplitSim trunk link adapters. For network simulators we rely on the user to configure the partitioning and create the adapters, either manually or through our configuration framework (3.4).

示例:

alt text

[1] 多核 gem5:

图 4 展示了我们如何使用 SplitSim 适配器在 gem5 中并行化多核仿真. 对 gem5 的修改仅限于:

1) 在 gem5 中将适配器实现为仿真对象, 并将原本基于消息的内存数据包接口序列化为消息

2) 修改 gem5 的 Python 配置脚本, 使其仅实例化每个进程的相关组件

[2] 并行 ns-3 和 OMNeT++:

我们还为 ns-3 和 OMNeT++ 网络模拟器实现了 SplitSim 并行化. 在这里, 我们同样在独立的进程中实例化整体网络拓扑的不同部分, 并将跨越组件的链路替换为 SplitSim 中继链路适配器

对于网络模拟器, 我们依赖用户手动或通过我们的配置框架 (3.4) 来配置分区并创建适配器

3.3 Lightweight Profiling for Synchronization and Communication¶

To address the challenges in understanding simulation performance, deciding what to parallelize, consolidate, and generally find bottlenecks, SplitSim includes profiling infrastructure. The SplitSim profiler measures metrics related to SplitSim cross-simulator synchronization and communication in each component simulator. The profiler comprises two components: instrumentation in each simulator, and post-processing to aggregate the collected metrics and present them to the user.

3.3.1 Lightweight Instrumentation. SplitSim instruments each adapter, both for communication across simulator components and within the processes of a particular component, with lightweight metric collection and logging. First, each adapter continuously counts the number of 1) CPU cycles blocked waiting for a synchronization message from the peer to allow the simulation to proceed, 2) sending data messages to peer simulators, and 3) processing incoming data messages. Second, each simulator can be set to periodically, e.g. every 10 s, log the values of these counters for each adapter and the current time stamp counter as well as that simulator's current simulation time.

3.3.2 Profiler Post-Processing. After the simulation terminates, either because it completes or because the user stops it, the profiler post processor ingests and parses these logs. As each simulator logs absolute totals for each value, we calculate the difference between a late entry towards the end and an early entry towards the beginning, dropping a configurable number of warm-up and cool-down lines. Each log entry contains both the simulation time and processor time stamp counter, thereby providing a reference for simulation time and physical system time.

Metrics Calculated. The post processor first calculates a global metric, simulation speed, by dividing the difference in simulation time by the difference in time stamp counter cycles (as all simulators are synchronized, this value is the same for each simulator). For each simulator we also calculate their efficiency as the fraction of cycles not spent on receive, transmit, or synchronization in the SplitSim adapter. This metric is useful to determine when diminishing returns for parallelizing SplitSim simulations set in.

Wait-Time Profile Graph. The main output for understanding SplitSim simulation performance and for localizing bottlenecks, is the wait-time profile graph (WTPG). The WTPG contains a node for each simulator instance, and a pair of opposite directed edges for each SplitSim channel connecting two simulators. The profiler annotates each edge with the fraction of cycles that the simulator at the source of the edge has spent waiting for synchronization messages from the destination simulator of the edge. As such, the graph visualizes "who waits for who". Additionally, the profiler annotates each node with the total number of cycles that node spends waiting across simulators. Based on this value, we also color nodes on a spectrum from green to red, with red for nodes that spend few cycles waiting for other nodes, and green for nodes that spend many cycles waiting for other nodes. Typically, nodes that spend little time waiting, are the bottleneck simulators and will stand out in red. If in doubt, the edge labels allow users to confirm that their neighbors spend significant cycles waiting on them. Figure 5 shows an example of a WTPG for a SplitSim simulation.

为了应对理解仿真性能、决定并行化或整合什么组件以及普遍查找瓶颈的挑战, SplitSim 包含了性能分析基础设施. SplitSim 分析器测量各组件模拟器中与 SplitSim 跨模拟器同步及通信相关的指标.

分析器由两个组件组成:

每个模拟器中的 "插桩" (instrumentation)
用于聚合收集到的指标并呈现给用户的 "后处理" (post-processing)

3.3.1 轻量级插桩 (Lightweight Instrumentation)

SplitSim 对每个适配器进行插桩, 包括跨模拟器组件的通信以及特定组件进程内的通信, 以进行轻量级指标收集和日志记录. 首先, 每个适配器持续计数: 1) 等待来自对等端的同步消息以允许仿真继续所阻塞的 CPU 周期数; 2) 向对等模拟器发送数据消息的数量; 3) 处理传入数据消息的数量. 其次, 每个模拟器可以被设置为周期性地 (例如每 10 秒) 记录每个适配器的这些计数器值、当前时间戳计数器以及该模拟器的当前仿真时间.

3.3.2 分析器后处理 (Profiler Post-Processing)

仿真终止后 (无论是完成还是用户停止), 分析器后处理器会摄取并解析这些日志. 由于每个模拟器记录的是各项值的绝对总数, 我们通过计算接近结束时的条目与接近开始时的条目之间的差值, 并丢弃可配置数量的预热和冷却行来进行处理. 每个日志条目都包含仿真时间和处理器时间戳计数器, 从而为仿真时间和物理系统时间提供了参考.

计算指标:

后处理器首先通过将仿真时间的差值除以时间戳计数器周期的差值来计算全局指标——仿真速度 (由于所有模拟器都是同步的, 每个模拟器的该值相同). 对于每个模拟器, 我们还将其效率计算为未在 SplitSim 适配器中用于接收、发送或同步的周期比例. 该指标对于确定并行化 SplitSim 仿真何时出现收益递减非常有用.

等待时间分布图 (Wait-Time Profile Graph, WTPG):

用于理解 SplitSim 仿真性能和定位瓶颈的主要输出是等待时间分布图 (WTPG).

alt text

WTPG 为每个模拟器实例包含一个节点, 并为连接两个模拟器的每个 SplitSim 通道包含一对反向有向边. 分析器在每条边上标注源节点模拟器等待来自目标节点模拟器的同步消息所花费的周期比例. 因此, 该图可视化了 "谁在等谁".

此外, 分析器用节点跨模拟器等待所花费的总周期数来标注每个节点. 基于该值, 我们还将节点在从绿色到红色的光谱上进行着色, 红色表示等待其他节点周期较少的节点, 绿色表示等待其他节点周期较多的节点.

通常, 等待时间很少的节点是瓶颈模拟器, 会以红色突出显示. 如有疑问, 边标签允许用户确认其邻居是否花费大量周期在等待它们. 图 5 展示了一个 SplitSim 仿真的 WTPG 示例.

3.4 Configuration and Orchestration¶

Finally, we are left with addressing the complexity of configuring and running a broad range of large-scale simulations. SplitSim addresses this with an orchestration framework. The orchestration framework aims to reduce the user configuration complexity by providing natural abstractions for separately specifying the configuration of the simulated system from the implementation choices for how to simulate the system. Finally, SplitSim will apply the specified implementation choices and coordinate the execution of the simulation, including starting up each component simulator, connecting them up, collecting outputs, and, eventually, clean up after termination.

Crucially, with the SplitSim configuration abstractions, users can solve many simulation configuration tasks fully within SplitSim, without resorting to manually configuring specific simulators through their specific configuration mechanism. For such tasks, SplitSim abstractions also provide a level of portability, in fully separating system configurations from concrete simulator choices. At the same time, the SplitSim orchestration framework can easily be bypassed where necessary and users can resort to manually configuring specific simulators. The SplitSim orchestration aims to make easy tasks easy, and complex tasks possible.

3.4.1 System Configuration Abstraction. The goal for the SplitSim system configuration abstraction is to specify the configuration of the simulated system separate from concrete choices of how to simulate it. We represent the system configuration as a hierarchy of Python objects. At the root we have the SystemConfiguration object, that contains a list of all system components. A system component can be a host, a NIC, a switch, or a PCI or Ethernet link. Each component object carries the expected attributes. For example, a host object may specify the number of cores, memory, disk image, applications to run, IP address, etc. A link object specifies latency and bandwidth, along with its two endpoints. The key design consideration for this abstraction is to specify system characteristics while abstracting simulation details.

Since we use Python for defining SplitSim system configurations, users can use all python language features for assembling this configuration, in particular relying on loops for instantiating repeated patterns or using functions and modules for factoring out re-usable configuration parts. For example, for the experiments in this paper we use the same parameterizable large-scale network topology across multiple of our experiments and have defined this in a common function across experiments.

3.4.2 Implementation Choices. After a user has assembled a system configuration in the simulation Python script, the second step is to generate one or more different concrete simulation instantiations. Users instantiate a SplitSim simulation by choosing specific simulators and translating the system configuration for the corresponding system components into configurations for these concrete simulators. We specify the resulting instantiated simulation using the existing SimBricks abstractions for describing interconnected instances of component simulators.

In general, there are many different instantiation strategies. Instead of trying to automate this inherently complex step with a one-size-fits-all approach, SplitSim instead opts for an extensible and flexible approach by merely providing library routines for common instantiation strategies. For example, one strategy we commonly use is to instantiate all hosts as separate processes for a specified host simulator, qemu or gem5, all NICs of a particular type, and simulate the whole network topology in one ns-3 process. A generalized version of this strategy instead first applies a partition function provided as a parameter to divvy up the network topology components into different partitions to run in separate ns-3 processes. For the network topology above, we have implemented a couple of different partition strategy functions that we use across the experiments in this paper.

As instantiated simulation configurations is just a regular SimBricks configuration, comprising SimBricks orchestration python objects, SplitSim users can manually modify this configuration afterwards when the need arises.

3.4.3 Running Simulations. Finally, to actually run SplitSim simulations, we leverage the existing SimBricks orchestration framework runtime. Since the instantiation above produces a SimBricks configuration as its output, this can directly be passed through for execution.

最后, 我们需要解决配置和运行各种大规模仿真的复杂性. SplitSim 通过编排框架来解决这个问题. 编排框架旨在通过提供自然的抽象, 将待仿真系统的配置与如何模拟该系统的实现选择分离开来, 从而降低用户的配置复杂性. 最终, SplitSim 将应用指定的实现选择并协调仿真的执行, 包括启动每个组件模拟器、连接它们、收集输出, 并在终止后进行清理.

关键在于, 通过 SplitSim 配置抽象, 用户可以在 SplitSim 内部完全解决许多仿真配置任务, 而无需借助特定模拟器的特定配置机制进行手动配置. 对于此类任务, SplitSim 抽象还提供了一定程度的可移植性, 将系统配置与具体的模拟器选择完全分离. 同时, SplitSim 编排框架在必要时可以轻松绕过, 用户可以转而手动配置特定的模拟器. SplitSim 编排旨在让简单的任务变得简单, 让复杂的任务成为可能.

3.4.1 系统配置抽象 (System Configuration Abstraction). SplitSim 系统配置抽象的目标是将模拟系统的配置与如何模拟的具体选择分开指定. 我们将系统配置表示为 Python 对象的层级结构. 根节点是 SystemConfiguration 对象, 包含所有系统组件的列表. 系统组件可以是主机、网卡、交换机、PCI 或以太网链路. 每个组件对象都携带预期的属性. 例如, 主机对象可以指定核心数、内存、磁盘镜像、运行的应用程序、IP 地址等. 链路对象指定延迟和带宽及其两个端点. 该抽象的关键设计考虑是在抽象仿真细节的同时指定系统特征.

由于我们使用 Python 定义 SplitSim 系统配置, 用户可以使用所有 Python 语言特性来组装此配置, 特别是依赖循环来实例化重复模式, 或使用函数和模块来提取可重用的配置部分. 例如, 在本文的实验中, 我们在多个实验中使用了相同的可参数化大规模网络拓扑, 并在跨实验的公共函数中定义了它.

3.4.2 实现选择 (Implementation Choices). 用户在仿真 Python 脚本中组装好系统配置后, 第二步是生成一种或多种不同的具体仿真实例化. 用户通过选择特定的模拟器并将相应系统组件的系统配置转换为这些具体模拟器的配置来实例化 SplitSim 仿真. 我们使用现有的 SimBricks 抽象来指定生成的实例化仿真, 描述组件模拟器的互连实例.

通常, 存在许多不同的实例化策略. SplitSim 没有试图用 "一刀切" 的方法自动化这一本质上复杂的步骤, 而是选择了可扩展且灵活的方法, 仅为常见的实例化策略提供库例程. 例如, 我们常用的一种策略是将所有主机作为指定主机模拟器 (QEMU 或 gem5) 的独立进程实例化, 所有特定类型的网卡也同样处理, 并在一个 ns-3 进程中模拟整个网络拓扑. 该策略的通用版本则是首先应用作为参数提供的分区函数, 将网络拓扑组件划分到不同的分区中, 以便在独立的 ns-3 进程中运行. 对于上述网络拓扑, 我们实现了几种不同的分区策略函数, 并在本文的实验中加以使用.

由于实例化的仿真配置只是常规的 SimBricks 配置, 包含 SimBricks 编排 Python 对象, SplitSim 用户可以在需要时手动修改此配置.

3.5 Workflow¶

The usage model and workflow of SplitSim are illustrated in Figure 2. The user begins by writing a configuration script that specifies the configuration of the system to be evaluated. Next, based on the system configuration, the user specifies possible simulation instantiation choices. Since it is difficult to predict bottlenecks and determine how to divide the simulation workload across processes in advance, the user typically prepares multiple candidate configurations with different fidelity and performance properties. Next, each configuration is executed with the SplitSim profiler for a short duration to evaluate performance. The profiler generates a WTPG for each configuration, informing the user of bottlenecks and idle components. Based on these insights, the user then refines the configuration script by further splitting bottleneck components into additional processes or consolidating idle ones. This iterative process continues until the workload is sufficiently balanced. Finally, the user executes the final configuration to collect the full simulation results.

图 2 说明了 SplitSim 的使用模型和工作流:

用户首先编写 配置脚本, 指定待评估系统的配置
接下来, 基于系统配置, 用户指定可能的 仿真实例化选择
- 由于难以预先预测瓶颈并确定如何在进程间划分仿真工作负载, 用户通常准备多个具有不同逼真度和性能属性的候选配置
然后, 使用 SplitSim 分析器 短时间执行每个配置 以评估性能
- 分析器为每个配置生成 WTPG, 告知用户瓶颈和空闲组件
基于这些洞察, 用户随后通过将瓶颈组件进一步拆分为更多进程或合并空闲组件来 优化配置 脚本

alt text

此迭代过程持续进行, 直到工作负载足够平衡. 最后, 用户执行最终配置以收集完整的仿真结果

Looking Forward¶

In this paper we have introduced SplitSim, a system and methodology for enabling end-to-end evaluation for large scale systems in simulation for when physical testbeds are out of reach. SplitSim enables users to easily navigate the inherent trade-offs when fidelity, simulation time, and computational resources. In our evaluation we have shown SplitSim can simulate multiple hosts with full OS and application software stacks, NICs, as part of a large scale network of 1200 hosts, run this simulation on a single physical machine, and complete a 20s simulation run in under four hours. Finally, SplitSim drastically lowers the barrier to entry for such simulations, by enabling users to configure such simulations comprising multiple different simulators etc. without needing expertise for how to configure each and every simulator.

在本文中, 我们介绍了 SplitSim, 这是一种在物理测试床难以获取时, 支持在仿真环境中对大规模系统进行端到端评估的系统与方法论. SplitSim 使用户能够在逼真度、仿真时间和计算资源之间轻松驾驭固有的权衡. 在评估中, 我们展示了 SplitSim 能够在包含 1200 台主机的大规模网络中, 模拟多台运行完整操作系统及应用软件栈的主机和网卡, 并在单台物理机器上运行该仿真, 于 4 小时内完成 20 秒的仿真运行. 最后, SplitSim 通过允许用户配置包含多个不同模拟器的仿真, 而无需具备配置每一个模拟器的专业知识, 从而大幅降低了此类仿真的准入门槛.

Navigating Fidelity Trade-offs. We have shown that mixed-fidelity simulation can significantly reduce both simulation time and computational resource usage while maintaining reasonable accuracy compared to full-fidelity simulation. However, finding the ideal mixed-fidelity configuration with maximal performance but adequate fidelity for a concrete evaluation use case a priori remains an open problem. In our evaluation we have relied on intuition to determine plausible configurations, by relaxing fidelity where it seems less critical, and then validated these configurations against much shorter full-fidelity simulations (where possible). We expect developing a robust methodology for determining this a priori will require substantial measurement studies across a broad range of different systems, and simulation models, that we hope our community will develop over time. Until then the current process is somewhat laborious, but since evaluations in our community require many evaluations of similar system configurations and parameters, SplitSim still enables drastic time and resource savings.

驾驭逼真度权衡 (Navigating Fidelity Trade-offs):

我们已经展示, 与全逼真度仿真相比, 混合逼真度仿真可以在显著减少仿真时间和计算资源使用的同时, 保持合理的准确性.

然而, 针对具体的评估用例, 如何先验地 (a priori) 找到既能实现性能最大化又能保持充足逼真度的理想混合逼真度配置, 仍然是一个开放性问题.

在评估中, 我们依靠直觉来确定合理的配置, 即在看似非关键的部分放宽逼真度要求, 随后 (在可能的情况下) 通过较短时间的全逼真度仿真来验证这些配置.

我们预计, 要开发一种能够先验地确定理想配置的稳健方法, 需要在广泛的不同系统和仿真模型中进行大量的测量研究, 我们希望社区能随着时间的推移逐步完成这项工作. 在此之前, 尽管当前的过程略显费力, 但鉴于社区中的研究往往需要对相似的系统配置和参数进行多次评估, SplitSim 仍然能够实现大幅的时间和资源节省.

Towards Automating Simulation Configuration. With SplitSim, we have introduced abstractions that separate configuration of the system, the "what", from the choice of simulation configuration, the "how". Further, through its profiler, SplitSim also provides a feedback mechanism to determine where problems lie. This leaves a typically large search space, but also an automated means of testing different points and obtaining feedback. In future work, based on these building blocks, we plan to develop automated methods to assist users in configuring large-scale simulations. Additionally, we expect that by collecting this data across many configurations, it will become possible to automatically provide implementation suggestions a priori based on the system configuration.

迈向仿真配置自动化 (Towards Automating Simulation Configuration):

通过 SplitSim, 我们引入了抽象机制, 将系统配置 (即 "是什么") 与仿真配置的选择 (即 "怎么做") 分离开来. 此外, SplitSim 还通过其分析器 (Profiler) 提供了一种反馈机制来定位问题所在.

这虽然留下了一个通常庞大的搜索空间, 但也提供了一种自动测试不同配置点并获取反馈的手段.

在未来的工作中, 基于这些构建模块, 我们计划开发自动化方法来辅助用户配置大规模仿真. 此外, 我们期望通过在多种配置中收集这些数据, 未来将可能根据系统配置先验地自动提供实现建议.