SimBricks: End-to-End Network System Evaluation with Modular Simulation¶

深度好文

文章非常值得学习:

“拼图类”写作手法
Simu/emu sys对比对象积累
“拼图类”常见问题:
- interfaces
- msg queue
- clock sync
- …

TLDR¶

先让 gemini 读一遍

针对现有网络系统仿真工具无法实现全系统"端到端"评估(或评估效率低下, 模块化程度差)的现状, 提出了创新的模块化仿真框架 SimBricks

核心挑战与研究动机

物理测试床的局限性:
- 物理环境构建成本昂贵, 且在新型 ASIC 或高性能网卡等前沿硬件尚未商用时难以获取
- 在包含数千个节点的大规模网络中, 物理部署极具挑战
现有仿真器的不足:
- 传统工具缺乏端到端仿真能力, 通常仅能模拟单一组件(如仅模拟主机架构或仅模拟网络协议)
- 不同仿真器之间往往互不兼容
- 难以平衡仿真精度与系统规模的可扩展性

SimBricks 设计方案

SimBricks 采用模块化仿真理念, 通过定义标准化的组件接口, 将多种成熟的仿真器解构并重组为虚拟的端到端测试床

解耦接口(窄腰架构):
- 基于物理系统的自然边界定义接口, 主要包括: 用于连接主机与设备的 PCIe 接口, 以及用于连接网卡与网络的 Ethernet 接口
消息传递机制:
- 各仿真器作为独立进程运行, 通过优化的共享内存队列进行高性能异步消息传递
高效同步协议:
- 摒弃了扩展性较差的 Global Barrier Synchronization, 采用Pairwise Synchronization机制
- 该协议利用链路固有的传播延迟提供同步松弛(Slack), 在确保时钟精确性的同时最大化并行执行效率

系统实现与集成

SimBricks 已成功集成一系列高性能仿真工具:

主机仿真(Host): 包括 gem5(高精度架构仿真)和 QEMU(高性能功能仿真)
网卡仿真(NIC): 集成 Intel i40e 行为模型, 以及 Corundum FPGA 网卡的 RTL 级高精度模型
网络仿真(Network): 支持 ns-3, OMNeT++ 以及 Intel Tofino 交换机仿真器
其他组件: 支持 FEMU NVMe SSD 存储模型及 Menshen 可编程交换机流水线

关键评估指标

学习一下人家的写作方式

可扩展性:
- 本地扩展: 支持在单台物理服务器上模拟数十个完整的主机节点
- 分布式扩展: 通过 TCP/RDMA 代理机制, 仿真规模可扩展至 1000 个模拟主机
- 实验表明, 仿真规模从 40 台增长至 1000 台时, gem5 的仿真耗时仅增加 14%
精确性验证: SimBricks 能够准确复现拥塞控制协议(如 DCTCP), 网络计算(如 NOPaxos)以及网卡硬件架构研究中的核心发现
性能优势: 相比传统的 dist-gem5 同步机制, SimBricks 在 32 节点规模下的仿真时间可缩短约 74%

拼图类工作如何体现亮点

可扩展性: "我是拼图", 别人单机我多机, 别人是"砖"我是"地板"
精确性验证: "别人能做到的, 我的也得差不多, 不能产生破坏性"
性能优势: "别人能做到的, 我的性能更优化"

局限性与非研究目标

非加速器属性: 框架本身并不直接提高单个仿真组件的运行速度
语义兼容性限制: 无法消除不同仿真模型间的语义鸿沟
- 例如, 无法将基于报文的 gem5 主机直接接入非报文驱动的流式仿真器
硬件约束: 目前的评估主要集中在单核模拟主机上

Introduction¶

Our community expects research ideas to be implemented and evaluated as part of a complete system "end-to-end" in realistic conditions. End-to-end evaluation is important as many factors in each system component affect the overall behavior in subtle and unpredictable ways. Yet evaluation in full physical testbeds is frequently infeasible. Work might require cutting edge commercial hardware that is not yet available at the time of publication [32, 33, 35, 50], develop hardware extensions to existing proprietary hardware [51], or propose entirely new ASIC hardware architectures [9, 13, 25, 27, 34, 36, 53, 55]. The trend towards increasingly specialized hardware, including SmartNICs, programmable switches, and other accelerators, further exacerbates this. Finally, work on network protocols and congestion control necessitates evaluation in large scale networks with hundreds to thousands of hosts.

When a full evaluation in a physical testbed is not possible, simulation has long offered an alternative. In networking, we use ns-2 [43], ns-3 [44], and OMNeT++ [57] to evaluate protocols and algorithms; computer architects rely on system simulators such as gem5 [8], while hardware designers employ RTL simulators such as Modelsim [52] or Verilator [54].While network systems do benefit from these simulators [4, 28, 41], they do not enable endto-end evaluation, as no existing simulator simulates all required components in a testbed: hosts, devices, and the full network.

In this paper, we demonstrate how to enable end-to-end network system simulation by combining different simulators to cover the necessary functionality. Instead of building a new simulator, throwing away decades of work, we connect existing and new simulators for hosts, hardware devices, and networks – into full system simulations capable of running unmodified operating systems, drivers, and applications. Existing simulators, however, are standalone and not designed to be combined with other simulators. To achieve modular end-to-end simulation, we thus need to overcome three technical challenges: 1) no interfaces to connect simulators together, 2) efficient, scalable, and correct synchronization of simulator clocks, and 3) combining mutually incompatible simulation models.

We present the design and implementation of SimBricks, a modular simulation framework for end-to-end network system simulations. SimBricks defines interfaces for interconnecting simulators based on natural component boundaries in physical systems, specifically PCIe and Ethernet links. Individual component simulators run in parallel as separate processes, and communicate via message passing only between connected peers through optimized shared memory queues. With this message transport, we co-design a protocol that leverages simulation topology and latency at component boundaries for efficient and accurate synchronization of simulator clocks. For scaling out simulations across physical hosts, we introduce a proxy to forward messages over TCP or RDMA.

Currently, SimBricks integrates QEMU [46] and gem5 [8] as host simulators, Verilator [54] as an RTL hardware simulator for hardware devices, and ns-3 [44], OMNeT++ [57], as well as the Intel Tofino simulator [23] for network simulation. Further, we have integrated open source RTL designs for the Corundum FPGA NIC [16] and the Menshen switch pipeline [58] to showcase SimBricks's generality. We have also implemented fast behavioral simulators, e.g. for the Intel X710 40G NIC [22], and ported the FEMU NVMe SSD model [31] into SimBricks. In combination, these simulators enable a broad range of end-to-end configurations for different use-cases.

In our evaluation, we demonstrate that SimBricks enables endto-end simulation of existing network systems at small and large scales. We also reproduce key results from congestion control [3], in-network compute [33], and FPGA NIC design [16] in SimBricks. SimBricks obtains more realistic results compared to ns-3 in isolation (§3). SimBricks also scales to 1000 hosts and NICs with only a 14% increase in simulation time compared to a 40-host simulation (§7.4). Finally, SimBricks provides deep visibility and control of low-level system behaviors, facilitating evaluation and performance debugging (§8.1).

We make the following technical contributions:

• Modular architecture for end-to-end system simulation (§5.1) combining host, device, and network simulators.

• Co-designed message transport and synchronization mechanism for parallel and distributed simulations (§5.5, §5.2) leveraging pairwise message passing to efficiently ensure correct simulation, even at scale.

• SimBricks, a prototype implementation of our architecture (§6) with integrations for existing and new simulators.

SimBricks is available open source at https://simbricks.github.io This work does not raise any ethical issues.

学术界期望研究思路能够在现实条件下, 作为完整系统的一部分进行"端到端"的实现与评估. 由于各系统组件中的多种因素会以微妙且不可预测的方式影响整体行为, 因此端到端评估至关重要. 然而, 在全物理测试床上进行评估往往并不可行. 相关工作可能需要发表时尚未面市的前沿商用硬件, 或需要在现有专有硬件上开发硬件扩展, 亦或是提出全新的 ASIC 硬件架构. SmartNIC, 可编程交换机及其他加速器等硬件日益专业化的趋势, 进一步加剧了这一困境. 此外, 针对网络协议和拥塞控制的研究, 需要在拥有成百上千个主机的大规模网络中进行评估.

当无法在物理测试床上进行全面评估时, 仿真长期以来一直作为一种替代方案:

在网络领域, 我们使用 ns-2, ns-3 和 OMNeT++ 来评估协议和算法
计算机体系结构研究者依赖 gem5 等系统仿真器
硬件设计者则采用 Modelsim 或 Verilator 等 RTL 仿真器

虽然网络系统受益于这些仿真器, 但它们无法实现端到端评估, 因为没有任何现有仿真器能模拟测试床中所有的必需组件: 主机, 设备和完整网络.

在本文中, 我们展示了如何通过结合不同的仿真器以覆盖必要的功能, 从而实现端到端网络系统仿真.

我们并没有构建一个抛弃数十年研究成果的新仿真器, 而是将现有的和新型的主机, 硬件设备及网络仿真器连接起来, 构建出能够运行未经修改的操作系统, 驱动程序和应用程序的全系统仿真.

然而, 现有的仿真器是独立运行的, 并非为与其他仿真器组合而设计. 为了实现模块化端到端仿真, 我们需要克服三个技术挑战:

缺乏连接仿真器的接口
仿真器时钟需实现高效, 可扩展且正确的同步
需结合互不兼容的仿真模型

我们提出了 SimBricks 的设计与实现, 这是一个用于端到端网络系统仿真的模块化仿真框架:

SimBricks 基于物理系统中的自然组件边界(特别是 PCIe 和以太网链路)定义了仿真器互连接口
单个组件仿真器作为独立的进程并行运行, 并仅通过优化的共享内存队列在连接的对等体之间进行消息传递
依托该消息传输机制, 我们协同设计了一种协议, 利用仿真拓扑和组件边界延迟, 实现仿真器时钟的高效, 精确同步
为了在物理主机之间扩展仿真规模, 我们引入了代理(Proxy)机制, 通过 TCP 或 RDMA 转发消息

目前, SimBricks 集成了 QEMU 和 gem5 作为主机仿真器, 集成 Verilator 作为硬件设备的 RTL 仿真器, 并集成 ns-3, OMNeT++ 以及 Intel Tofino 仿真器进行网络仿真. 此外, 我们还集成了 Corundum FPGA NIC 和 Menshen 交换机流水线的开源 RTL 设计, 以展示 SimBricks 的通用性. 我们还实现了快速行为仿真器(例如 Intel X710 40G NIC), 并将 FEMU NVMe SSD 模型移植到了 SimBricks 中. 这些仿真器的组合使得针对不同用例的广泛端到端配置成为可能.

在评估中, 我们证明了 SimBricks 能够在小规模和大规模下实现现有网络系统的端到端仿真. 我们还在 SimBricks 中复现了拥塞控制, 网络计算和 FPGA NIC 设计的关键结果. 与孤立的 ns-3 相比, SimBricks 获得了更真实的结果. SimBricks 还可以扩展到 1000 个主机和网卡, 与 40 个主机的仿真相比, 仿真时间仅增加了 14%. 最后, SimBricks 提供了对底层系统行为的深度可见性和控制力, 有助于评估和性能调试.

我们做出了以下技术贡献:

用于端到端系统仿真的模块化架构, 结合了主机, 设备和网络仿真器
为并行和分布式仿真协同设计的消息传输与同步机制, 利用对等消息传递高效确保仿真正确性, 即使在大规模下也是如此
SimBricks, 我们架构的原型实现, 集成了现有和新型仿真器

SimBricks 已在 github/simbricks 开源. 本工作不涉及任何伦理问题

Simulation Background¶

Simulators employ techniques such as discrete event simulation, binary translation, and hardware virtualization, to simulate system components at various scales and levels of detail. Network simulators, such as ns-2 [43], ns-3 [44], and OMNeT++ [57], use discrete event simulation to model packets traversing network topologies. Computer architecture simulators, such as gem5 [8], QEMU [46], and Simics [37], simulate full computer systems capable of running unmodified guest software, including operating systems, with different and sometimes configurable degrees of detail. These simulators also include I/O devices, but often only implement the minimum features for basic functionality. Hardware RTL simulations, such as xsim [59] and Verilator [54], help test and debug hardware designs cycle by cycle against testbenches. In all three cases individual components are simulated in isolation.

仿真器采用离散事件仿真(Discrete Event Simulation), 二进制翻译(Binary Translation)以及硬件虚拟化(Hardware Virtualization)等技术, 在不同规模和细节层面上对系统组件进行模拟

网络仿真器(如 ns-2, ns-3 和 OMNeT++)利用离散事件仿真来建模数据包在网络拓扑中的传输过程
计算机体系结构仿真器(如 gem5, QEMU 和 Simics)则能够模拟完整的计算机系统, 支持运行包括操作系统在内的未经修改的客户机软件, 并提供不同且通常可配置的细节水平
- 虽然这些仿真器也包含输入/输出(I/O)设备, 但通常仅实现了维持基本功能的最少特性
硬件寄存器传输级(RTL)仿真(如 xsim 和 Verilator)则有助于针对测试平台进行逐周期的硬件设计测试与调试

在这三类情况中, 各组件通常都是在隔离状态下进行仿真的

Advantages. The main motivation for simulation is that a physical implementation is often not feasible. Simulations are also portable as they decouple the simulated system from the host system. Many are deterministic (with explicit seeds for randomness), providing reproducible results. Simulators are also flexible; implemented as software they can be modified, and frequently offer parameters representing a broad range of configurations. Finally, simulations provide great visibility, and can log details about the system, without affecting behavior.

Disadvantages. Simulations also have some common drawbacks. Long simulation times are common – architectural simulators often only simulate hundreds or thousands of system cycles a second [26, 55], and simulating a few milliseconds of a large scale topology in ns-3 can take many hours. Different simulators strike different trade-offs between accuracy and simulation time, depending on the intended use-case. Finally, simulation results are only as good as the simulator, and may not be representative unless validated against a physical testbed.

Comparison to Emulation. Emulations replicate externally visible behavior of a system without modeling internal details, and typically run at close to interactive speeds. For example, Mininet [30] emulates OpenFlow networks with multiple end-hosts running real Linux applications at near native speed on a single physical host, by using Linux containers and kernel network features. However, as emulation uses wall-clock time, it only works as long as all components can keep up in real time. Simulations, in contrast, rely on virtual time which can slow down without affecting simulated behavior. Additionally, emulation does not model internals of a system that could affect system behavior, e.g., interactions between NIC and drivers. As such, emulation is primarily useful for interactive testing or performance evaluation when fidelity is not crucial.

(1) 优势: 使用simulation的主要动机在于物理实现往往并不可行

具有便携性. 因为它们实现了模拟系统与宿主系统的解耦
许多simulation过程是确定性的(具有显式的随机种子), 能够提供可复现的结果
具有高度灵活性:
- 作为软件实现, 它们可以被修改, 并通常提供代表广泛配置范围的参数
提供了极佳的可观测性, 能够记录系统的详细信息且不影响其行为

(2) 劣势: 仿真也存在一些普遍缺点, 耗时和质量

simulation耗时过长是常见问题:
- 体系结构simulator每秒通常只能模拟数百或数千个系统周期
- 在 ns-3 中模拟大规模拓扑的几毫秒运行可能需要数小时之久
- 不同的simulator根据预期的应用场景, 在精度与仿真时间之间进行权衡
仿真结果的质量取决于simulator本身, 除非经过 physical testbed 的验证, 否则其结果可能不具代表性

(3) Comparison to Emulation:

Emulation 通过复制系统的外部可见行为而不建模内部细节, 通常以接近交互的速度运行

例如, Mininet 利用 Linux 容器和内核网络特性, 在单个物理主机上以接近原生的速度仿真运行实际 Linux 应用程序的多终端主机 OpenFlow 网络

然而, 由于 Emulation 使用的是墙钟时间(Wall-clock Time), 只有当所有组件都能实时跟进时才有效

相比之下, Simulation 依赖于虚拟时间, 虚拟时间的减慢并不会影响被模拟的行为

此外, Emulation 不会对可能影响系统行为的系统内部(例如网卡与驱动程序之间的交互)进行建模

因此, Emulation 主要适用于对保真度要求不高的交互式测试或性能评估

Systems Research Challenges¶

Systems research faces additional challenges that complicate using simulation during prototyping and evaluation.

系统研究面临着额外的挑战, 这些挑战使得在原型设计和评估阶段使用仿真变得复杂化

Not end-to-end. First and foremost, no existing simulator covers all required components for network systems with sufficient features and detail, precluding end-to-end evaluation. While existing simulators cover individual components, such as computer architecture, hardware devices, and networks, they only do so in isolation with no mechanism for combining them into complete systems. As a result, we are left with non-end-to-end "piecemeal" evaluation, where different components are evaluated in isolation [4, 20, 41].

We illustrate the pitfalls of piecemeal evaluation by comparing dctcp [3] congestion control behavior in the ns-3 network simulator to a physical testbed. As network speed increases and bottlenecks move to end-hosts, congestion control incurs small variations in timing in the host hardware and software which can affect behavior [3, 29, 40]. However, ns-3 only models network and protocol behavior, and as a result, does not capture these factors. We set up two clients and two servers sharing a single 10G bottleneck link with a 4000B MTU, and one large TCP flow generated by iperf for each client-server pair. Fig. 1 shows the throughput for varying dctcp marking thresholds 𝐾. The marking threshold balances queuing latency and throughput; a lower threshold reduces queue length but risks under-utilizing links. ns-3 underestimates the necessary threshold [3] to achieve line rate, as it does not model host processing variations, particularly processing delay caused by OS interrupt scheduling. Only an end-to-end evaluation of the full system captures such intricacies.

缺乏端到端评估(Not end-to-end)

目前没人做过端到端, 都是单组件"非自然拼接"

首要问题在于: 目前没有任何单一仿真器能够涵盖网络系统所需的全部组件, 并提供足够的特性与细节

这阻碍了端到端的评估

虽然现有仿真器可以覆盖计算机体系结构, 硬件设备和网络等独立组件, 但它们仅在隔离状态下运行, 缺乏将这些组件组合成完整系统的机制.因此, 研究者只能进行非端到端的"碎片化"评估, 即对不同组件进行孤立评估

alt text

我们通过在 ns-3 网络仿真器与物理测试床中对比 DCTCP 拥塞控制行为, 阐明了碎片化评估的弊端.随着网络速度的提升以及瓶颈向端主机转移, 拥塞控制会受到主机硬件和软件中细微时间变化的影响, 进而改变系统行为

然而, ns-3 仅对网络和协议行为进行建模, 无法捕捉这些主机侧因素.我们设置了两个客户端和两个服务器, 共享一条 10G 瓶颈链路(MTU 为 4000B), 每个客户端-服务器对运行由 iperf 生成的大型 TCP 流

图 1 展示了不同 DCTCP 标记阈值 K 下的吞吐量. 标记阈值用于平衡队列延迟与吞吐量;较低的阈值可缩短队列长度, 但存在链路利用率不足的风险.

由于 ns-3 未对主机处理差异(尤其是操作系统中断调度引起的处理延迟)进行建模, 它低估了达到线速所需的必要阈值. 只有对完整系统进行端到端评估, 才能捕捉到此类复杂特性

Not scalable. Network and distributed systems frequently require evaluation on clusters beyond tens of hosts to demonstrate scalability. But for most simulators, already long simulation times increase super-linearly with the size of the simulated system, making simulation of a large network system an infeasible task.

缺乏可扩展性(Not scalable)

规模一上去, 仿真时间就超速膨胀了

网络与分布式系统通常需要在超过数十个主机的集群上进行评估, 以验证其可扩展性

但对于大多数仿真器而言, 原本就较长的仿真时间会随着模拟系统规模的扩大呈超线性(super-linearly)增长, 这使得大规模网络系统的仿真任务变得难以实现

Not modular. Using simulators for systems research often requires extending existing simulators with additional functionality, e.g., adding a new NIC to an architecture simulator. These extensions are tied to a particular simulator, as different simulators lack common internal interfaces, This complicates apples-to-apples comparisons for future work that may use a different simulator, e.g., to simulate a host with a different NIC, forcing the same simulator to be used throughout the project cycle. Finally, this tight integration complicates the implementation and releasing of such extensions, as they often require maintaining a fork of the full simulator.

缺乏模块化(Not modular)

扩展功能往往绑定特定仿真器, 难以复用

在系统研究中使用仿真器往往需要扩展其功能, 例如为体系结构仿真器添加新型网卡.

由于不同仿真器之间缺乏通用的内部接口, 这些扩展插件通常与特定仿真器绑定.

这导致未来使用不同仿真器的工作难以进行"苹果对苹果"的公平比较, 迫使整个项目周期只能依赖同一种仿真器.最后, 这种紧耦合机制增加了扩展功能的实现与发布难度, 因为研究者往往需要维护一个完整仿真器的分支

Modular Simulation¶

We argue that end-to-end simulations can be effectively assembled from multiple different interconnected and synchronized simulators for individual components. To demonstrate this, we present SimBricks, a new modular simulation framework that aims to provide end-to-end network system simulation.

End-to-end simulations are better. Returning to the dctcp example from earlier, Fig. 2 shows the simulation setup that produces the result shown in Fig. 1. We combine four instances of gem5 with four instances of the Intel i40e NIC simulator we developed, each pair connected through PCIe; all NIC simulators are in turn connected to an instance of ns-3. The gem5 instances are running a full Ubuntu image with unmodified NIC drivers and iperf. Fig. 1 shows that our SimBricks simulation approximates the behavior of the physical testbed much more closely than ns-3, and yields the same insight. We conclude that end-to-end evaluation with SimBricks improves accuracy for network system evaluation over non-end-to-end simulators.

我们认为,通过将多个相互连接且时间同步的独立组件仿真器进行有效组合,可以构建端到端仿真系统.为了验证这一观点,我们提出了 SimBricks -- 一个旨在提供端到端网络系统仿真的新型模块化仿真框架

端到端仿真的优越性:回到前述的 DCTCP 案例,图 2 展示了生成图 1 结果的仿真设置

alt text

我们将四个 gem5 实例与四个自行开发的 Intel i40e 网卡仿真器实例相结合,每对组件通过 PCIe 连接
随后,所有网卡仿真器均接入一个 ns-3 实例
gem5 实例运行完整的 Ubuntu 镜像,并配备未经修改的网卡驱动程序和 iperf 工具

图 1 表明,SimBricks 仿真结果比独立 ns-3 更接近物理测试床的行为,并得出了相同的研究见解

据此我们得出结论:相比非端到端仿真器,SimBricks 的端到端评估显著提升了网络系统评估的准确性

学习一下Design部分的写作方式

Design Idea: 很短几句话, 点出核心的"聪明点"
Design Goal: xxx
Design Challenges: xxx
Design Principles: xxx
Design Non-Goals: limitation 不说缺陷, 而是说"非目标"

4.1 Design Goals¶

To address the challenges for using simulations in systems research, (§3), we have the following design goals for SimBricks:

End-to-end: simulate full network systems, with hosts, existing or custom devices, network topologies, and the full software stack, including unmodified OS and applications.
- 能够模拟完整的网络系统,涵盖主机、现有或定制设备、网络拓扑以及全栈软件(包括未经修改的操作系统和应用程序)
Scalable: simulate large network systems consisting of tens or hundreds of separate hosts and devices.
- 支持模拟由数十或数百个独立主机与设备组成的大规模网络系统
Fast: keep simulation times as low as possible.
- 尽可能降低模拟运行时间
Modular: enable flexible composition of simulators, where components can be added and swapped independently.
- 支持仿真器的灵活组合,各组件能独立地添加或更换
Accurate: preserve accuracy of constituent simulators, correctly interface and synchronize components to behave equivalent to a monolithic simulator with the same models.
- 保持各组成仿真器的精度
- 通过正确的接口连接与同步,使其行为等同于具有相同模型的单体仿真器 ("整体中的个体", 功能不可以输给"单体")
Deterministic: keep end-to-end simulation deterministic when all individual simulators are deterministic.
- 在所有单个仿真器均具备确定性的前提下,确保端到端仿真结果的可重复性和确定性
Transparent: provide deep and detailed visibility into endto-end performance without affecting simulation behavior, to support debugging and performance analysis.
- 提供深入且详尽的端到端性能观测能力,以支持调试与性能分析

4.2 Technical Challenges¶

Achieving our design goals incurs the following challenges:

Simulation interconnection interfaces. Unfortunately, existing simulators are standalone and provide no suitable interfaces for interconnecting with other external simulators. Moreover, enabling modular "plug-and-play" configurations, where components can be independently swapped out, requires common, well-defined interfaces between different component types.

仿真互连接口:

遗憾的是,现有仿真器多为独立运行,缺乏与其他外部仿真器互联的适用接口.此外,实现组件可独立更换的模块化"插件化"配置,需要在不同组件类型之间建立通用的标准化接口

Scalable synchronization and communication. Individual component simulators maintain their own virtual simulation clocks that progress at different rates. To accurately connect simulators, we need to synchronize their virtual clocks. However, this synchronization comes at a performance cost, especially with increasing system scale. For example, we measure a 3.7× increase in runtime for the dist-gem5 [42] simulator when scaling from 2 to 16 simulated hosts, due to synchronization overhead (§7.3.1). Prior work shows synchronization overhead can be reduced by sacrificing accuracy and determinism through lax synchronization. [12, 19]. Since this violates two of our design goals, we do not consider this.

可扩展的同步与通信:

各个组件仿真器维护各自的虚拟仿真时钟,且推进速率各异. 为了准确连接仿真器,必须同步其虚拟时钟

然而,同步会带来性能开销,且随系统规模扩大而显著增加

例如,由于同步开销,dist-gem5 仿真器在模拟主机规模从 2 台扩大到 16 台时,运行时间增加了 3.7 倍

虽有前期研究表明可以通过牺牲精度和确定性的松弛同步(lax synchronization)来降低开销,但由于这违反了我们的两项设计目标,故不予采用

Incompatible simulation models. Finally, different simulators often employ mutually incompatible simulation models. For example, QEMU has a synchronous device model where calls in device code block until complete, while ns-3 schedules asynchronous events to model networks, and Verilator simulates hardware circuits cycle by cycle. We therefore need an interface compatible with all of these simulation models.

不兼容的仿真模型:

最后,不同仿真器往往采用互不兼容的仿真模型.例如:

QEMU 采用同步设备模型,其设备代码调用会阻塞至完成
ns-3 通过调度异步事件来模拟网络
Verilator 则逐周期地模拟硬件电路

因此,我们需要一种能够兼容所有这些仿真模型的接口

4.3 Design Principles¶

We address these challenges through four design principles:

Fix natural component simulator interfaces. To enable modular composition of simulators, SimBricks defines an interface for each component type (§5.1). We base these interfaces on the point-to-point component boundaries in real systems: PCI express (PCIe) connects today's hardware devices to servers, while network devices typically connect through Ethernet networks. We choose these interfaces as a starting point, but our approach generalizes to other interconnects and networks. These component interfaces form narrow waists, decoupling innovation on both sides: To integrate a simulator into SimBricks, developers need to add an adapter that implements the component interface, without needing to modify other simulators. We assume a static topology of components throughout a simulation.

固定自然组件仿真的互连接口:

为了实现仿真器的模块化组合,SimBricks 为每种组件类型定义了标准接口.

我们基于真实系统中的点对点组件边界建立这些接口:

PCIe 用于将当今的硬件设备连接至服务器
网络设备通常通过 Ethernet 进行互连

我们选择这些接口作为起点,但该方法同样适用于其他互连方式和网络.

这些组件接口形成了 "窄腰"架构,实现了两侧创新的解耦: 开发者只需添加一个实现该组件接口的适配器即可将仿真器集成到 SimBricks 中,而无需修改其他仿真器

我们假设在整个仿真过程中,组件拓扑保持静态

Loose coupling with message passing. Instead of tightly integrating multiple simulators into one simulation loop, SimBricks runs component simulators as separate processes that communicate through message passing (§5.1) across our defined interfaces. This drastically simplifies integrating simulators into SimBricks, as we treat each simulator as a black-box that only needs to implement our interfaces. Using asynchronous message passing also maximizes compatibility with different simulation models: Discrete event and cycle-by-cycle simulations can issue requests and process responses at the scheduled times, while blocking simulations can block till the response message arrives — for peer simulators this is transparent. Message passing channels also provide inspection points for debugging and tracing system behavior without modifying component simulators.

基于消息传递的松耦合:

SimBricks 并非将多个仿真器紧密集成到单一仿真循环中,而是将各组件仿真器作为独立进程运行,通过定义的接口进行消息传递

由于我们将每个仿真器视为仅需实现特定接口的"黑盒",这极大简化了集成过程

采用异步消息传递还最大程度地提升了对不同仿真模型的兼容性:

离散事件和逐周期仿真可以在预定时间点发送请求并处理响应
阻塞式仿真可以阻塞至响应消息到达

这一过程对"对等仿真器"是透明的.消息传递通道还提供了观测点,能够在不修改组件仿真器的情况下调试和追踪系统行为

Parallel execution with shared memory queues. We run simulators in parallel on different host cores and connect them through optimized shared-memory queues (§5.2). As simulators run on separate cores and only communicate when necessary, this avoids unnecessary cache-coherence traffic and hidden scalability bottlenecks. These mechanisms allow us to (i) scale up to large simulations: Instead of simulating the complete system in one simulation instance, we simulate different components of the system in separate simulators running in parallel (§5.3). (ii) scale out with distributed simulations: We use a separate proxy that transparently forwards messages on shared memory queues over the network to and from simulators running on remote hosts (§5.4).

基于共享内存队列的并行执行:

我们在不同的主机核心上并行运行仿真器,并通过优化的共享内存队列进行连接

由于仿真器在独立核心上运行且仅在必要时通信,这避免了不必要的缓存一致性流量和潜在的可扩展性瓶颈.这些机制使我们能够:

通过并行运行独立仿真器来 Scale up 仿真规模,而非在单一仿真实例中模拟整个系统
通过分布式仿真实现 Scale out,即: 利用独立的代理(Proxy)透明地将共享内存队列中的消息在远程主机之间进行转发

Accurate and efficient synchronization. We ensure accurate simulation through correct time synchronization among simulators, but with minimum runtime overhead. Synchronization is optional, and the user can disable it for unsynchronized emulations. For this, we combine three key insights:

1) Global synchronization is not necessary as our simulator boundaries at point-to-point interfaces limit which simulators directly communicate. As long as events at these pairwise interfaces are processed in a time-synchronized manner, simulation behavior is correct.

2) Latency at component interfaces provides slack, reducing frequency of component having to wait for others to coordinate [12] and thus synchronization overhead. An event sent at time 𝑇 only arrives at 𝑇 + Δ, as our component interfaces have an inherent latency Δ in physical systems that we model.

3) By inlining synchronization with efficient polled message transfers, synchronization overheads can be minimized and sometimes completely avoided. We combine these observations to design an accurate, efficient, and scalable synchronization mechanism for parallel end-to-end simulations (§5.5).

准确且高效的同步:

我们在确保各仿真器之间时间同步准确性的同时,最大限度地减少运行开销.同步是可选功能,用户可以针对非同步模拟将其禁用.为此,我们结合了三项核心洞察:

全局同步并非必需: 由于点对点接口限制了直接通信的仿真器范围, 只要确保成对接口间的事件按时间同步处理, 仿真的正确性即可得到保证
接口延迟提供松弛空间: 组件接口的固有延迟提供了同步松弛(Slack), 减少了组件协调频率,从而降低同步开销.在模型中,时间 \(t\) 发出的事件仅在 \(t\) + \(\delta\) 到达
内联同步机制: 通过在高效的轮询消息传输中嵌入同步信号,可以最小化甚至完全消除同步开销

我们综合这些观察结果,为并行端到端仿真设计了一种准确、高效且可扩展的同步机制

4.4 Non-Goals¶

SimBricks is not a panacea. We explicitly view the following aspects as out of scope for this paper and leave them for future work:

Accelerating component simulators. SimBricks does not generally aim to reduce simulation times for individual component simulators as we only modify simulators to add SimBricks adapters. Simulation times for synchronized end-to-end SimBricks simulations are at least as high as the slowest component simulator, and may increase due to synchronization and communication overhead. However, in a few cases, SimBricks interfaces enable developers to decompose an existing component simulator into multiple smaller parallel pieces, thereby reducing simulation time (§7.3.2).

加速组件仿真器:

SimBricks 的目标通常并非缩短单个组件仿真器的仿真时间,因为我们仅通过修改仿真器来添加 SimBricks 适配器

同步后的端到端 SimBricks 仿真时间至少与最慢的组件仿真器持平,并可能因同步和通信开销而增加

然而,在少数情况下,SimBricks 接口允许开发人员将现有的组件仿真器分解为多个较小的并行部分,从而减少仿真时间

Avoiding need for validation. To obtain representative results, users need to validate component simulation configurations in SimBricks as with any other simulation. Validation effort is no higher in SimBricks than it would be in an equivalent monolithic simulator, as SimBricks forwards timestamped events accurately from one simulator-internal interface to another without modifying them (except for the configured link latency). We expect, however, that SimBricks could reduce validation effort by allowing users to re-combine validated component simulator configurations without validating from scratch. (§9)

消除验证需求:

为了获得具有代表性的结果,用户仍需像进行任何其他仿真一样,在 SimBricks 中验证组件仿真配置

由于 SimBricks 能够准确地将带有时间戳的事件从一个仿真器内部接口转发到另一个接口,且不进行任何修改(配置的链路延迟除外),因此在 SimBricks 中进行的验证工作量不会高于等效的单体仿真器

不过,我们期望 SimBricks 能够通过允许用户重新组合已验证的组件仿真器配置,而无需从头开始验证,从而减少验证工作量

Interfacing semantically incompatible simulators. While SimBricks can combine simulators that use different models for simulation, it cannot bridge semantic gaps between simulators. For example, SimBricks cannot connect a gem5 host sending packets through an RTL NIC with a flow-based network simulator. Such conversions may be possible in special cases, but are specific to the concrete simulators, and as such could be integrated as part of a SimBricks adapter in such a simulator.

对接语义不兼容的仿真器:

虽然 SimBricks 可以结合使用不同仿真模型的仿真器,但它无法消除仿真器之间的语义鸿沟

例如,SimBricks 无法将通过 RTL 级网卡发送报文的 gem5 主机连接到基于流(flow-based)的网络仿真器上

此类转换在某些特殊情况下或许可行,但其具体实现取决于特定的仿真器,因此应作为该仿真器 SimBricks 适配器的一部分进行集成

Design¶

Using our design principles, we have built SimBricks, a modular, end-to-end simulation framework shown in Fig. 3. In this section, we detail the design of SimBricks, including simulator interfaces, fast message transport, techniques to scale up and out to larger simulations, and the synchronization mechanism.

5.1 Component Simulator Interfaces¶

SimBricks achieves modularity through well-defined interfaces between component simulators: Host simulators connect to device simulators through a PCIe interface; NIC and network simulators interconnect through an Ethernet interface. This results in a double hourglass architecture (Fig. 3) with narrow waists at component boundaries. In physical systems both interfaces are asynchronous and incur propagation delay (Δ 𝑖 ). We replicate both aspects.

alt text

SimBricks 通过组件仿真器之间定义明确的接口实现模块化:

主机仿真器通过 PCIe 接口连接至设备仿真器
网卡与网络仿真器则通过以太网接口互连

这形成了一种在组件边界处具有"窄腰"特征的双沙漏架构 (见图 3)

在物理系统中,这两类接口均为异步执行并存在传播延迟 (Δi),我们在仿真中对这两个特性均进行了还原

5.1.1 PCIe: Host-Device Interface

PCIe itself is a layered protocol, ranging from the low-level physical layer to the transactional layer for data operations. We define SimBricks's host-device interface (Fig. 4) based on the PCIe transactional layer, and abstract away physical attributes of the PCIe link with simple parameters – link bandwidth and latency. Low-level complexity such as encoding and signaling are unnecessary for most system simulations and would incur substantial cost and complexity for each simulator. Should future use-cases need to model this, a detailed PCIe simulator could be integrated as an interposed component (§5.3).

Discovery and Initialization. A key PCIe feature is that hosts can enumerate and identify connected devices and the features they support. To this end, our interface defines the INIT_DEV message for registering device simulators with the host simulator. The device simulator includes device information in the message, such as the PCI vendor, device identifiers, base address registers (BARs), the number of MSI(-X) interrupt vectors, and addresses of the MSI-X table and PBA. The host simulator uses this information to expose a corresponding PCIe device to the system.

Data transfers: MMIO & DMA. PCIe data transfers are symmetrical: both sides can initiate reads and writes, which the other side completes. SimBricks's PCIe interface defines DMA_READ / WRITE messages for DMA transfers initiated by device simulators, and MMIO_READ / WRITE for MMIO accesses initiated by host simulators. As in PCIe, all data transfer operations are asynchronous. Once a request is finished, the device simulator issues a MMIO_COMPL completion message, while the host simulator adapter sends a DMA_COMPL. PCIe allows multiple outstanding operations and only guarantees that they will be issued to the memory system in the order of arrival. Completion events, however, may arrive out-of-order. To match completions with outstanding requests, all requests carry an identifier that the receiving simulator includes in the response.

Interrupts. Our interface supports all PCIe interrupt signaling methods: legacy interrupts (INTX), message signaled interrupts (MSI), and MSI-X. Physical PCIe devices implement MSI (including configuration, masking, and generating signalling operations) completely on the device side. To reduce repeated implementation effort in device simulators and integration challenges in host simulators, we instead opt to keep this functionality inside the host simulator. Device issues INTERRUPT messages to either trigger an interrupt vector for MSI(-X) or to set interrupt pin state for INTX. To support devices that require knowledge about which interrupt mechanisms the OS has enabled, our interface provides the INT_STATUS message which the host simulator sends on configuration changes.

PCIe 本身是一种分层协议,涵盖从底层物理层到用于数据操作的事务层.

我们基于 PCIe 事务层定义了 SimBricks 的主机-设备接口(见图 4),并利用简单的参数(链路带宽和延迟)抽象掉 PCIe 链路的物理属性.

alt text

编码和信号传输等底层复杂性对于大多数系统仿真而言并非必要,且会显著增加每个仿真器的开发成本与复杂度

???+ danger"链路与编码没做"

Text Only
明确提到:

编码和信号传输等底层复杂性对于大多数系统仿真而言并非必要, 本文不考虑这一部分的模拟

若未来的用例需要对此建模,可以集成一个详细的 PCIe 仿真器作为中间组件(见第 5.3 节)

(1) 发现与初始化:

PCIe 的一个关键特性是主机可以枚举并识别连接的设备及其支持的功能. 为此,我们的接口定义了 INIT_DEV 消息,用于在主机仿真器中注册设备仿真器. 设备仿真器在消息中包含设备信息,如 PCI 厂商、设备标识符、基址寄存器(BAR)、MSI(-X) 中断向量数量以及 MSI-X 表和 PBA 的地址. 主机仿真器利用这些信息向系统暴露相应的 PCIe 设备.

(2) 数据传输 - MMIO 与 DMA:

PCIe 数据传输是对称的:双方均可发起读写操作,并由另一方完成.

SimBricks 的 PCIe 接口为设备仿真器发起的 DMA 传输定义了 DMA_READ/WRITE 消息,为主机仿真器发起的 MMIO 访问定义了 MMIO_READ/WRITE 消息.

与物理 PCIe 一致,所有数据传输操作均为异步执行.

请求完成后,设备仿真器发出 MMIO_COMPL 完成消息,而主机仿真器适配器则发送 DMA_COMPL 消息.

PCIe 允许存在多个未完成的操作,且仅保证它们按到达顺序提交给内存系统.

然而,完成事件可能会乱序到达.

为了将完成事件与未完成请求相匹配,所有请求都携带一个标识符,接收方仿真器会在响应中包含该标识符.

(3) 中断:

我们的接口支持所有 PCIe 中断信令方法:传统中断(INTX)、消息信令中断(MSI)和 MSI-X.

物理 PCIe 设备在设备端完全实现 MSI(包括配置、掩码和生成信令操作).

为了减少设备仿真器中的重复实现工作并降低主机仿真器的集成挑战,我们选择将此功能保留在主机仿真器内部.

设备发送 INTERRUPT 消息以触发 MSI(-X) 的中断向量或设置 INTX 的中断引脚状态.

为了支持需要获知操作系统已启用何种中断机制的设备,我们的接口提供了 INT_STATUS 消息,主机仿真器会在配置更改时发送该消息.

5.1.2 Ethernet: Network Component Interface

In SimBricks's network interface, we similarly abstract away low-level details of the Ethernet standard, and only expose Ethernet frames, as PACKET messages, to NIC and network simulators. A PACKET message carries the length of the packet alongside packet payload, but omits CRCs to reduce overhead as none of our network simulators models them and most NICs strip them after validation. If future network or NIC simulators require CRCs, their SimBricks adapter can transparently generate and strip the checksums, as we currently do not model data corruption. We leave support for hardware flow control as future work.

在 SimBricks 的网络接口中,我们同样抽象掉了以太网标准的底层细节,仅向网卡和网络仿真器暴露以太网帧(作为 PACKET 消息).

PACKET 消息携带报文长度及负载,但省略了循环冗余校验(CRC)以减少开销,因为我们目前的网络仿真器均不对其建模,且大多数网卡在验证后会将其剥离.

如果未来的网络或网卡仿真器需要 CRC,其 SimBricks 适配器可以透明地生成和剥离校验和,因为我们目前不对数据损坏进行建模.

我们将对硬件流量控制的支持留作未来工作.

5.2 Inter-Simulator Message Transport¶

SimBricks runs component simulators as separate processes communicating through message passing. Thus, efficient inter-process communication is critical for the overall performance. We use optimized shared memory queues with polling for efficient message transport between simulators. For parallel processes on separate cores, shared memory queues enable low-latency communication with minimal overhead [5, 7]. Between any pair of communicating simulators, SimBricks establishes a bidirectional message channel consisting of a pair of unidirectional queues in opposite directions. During channel initialization, SimBricks uses a Unix socket to provide a named endpoint for connection setup and for communicating queue parameters and shared memory file descriptors.

SimBricks uses concurrent, circular, single-producer and consumer queues. They comprise an array of fixed-sized, cache line aligned message slots. The last byte in each slot is reserved for metadata: one bit indicating the current owner of the slot (consumer or producer) and the rest for the message type. As queues are singleproducer and single-consumer, we store the tail pointer locally at the producer and the head pointer at the consumer. Consumers poll for a message in the next slot, until the ownership flag indicates consumer. After processing the message the consumer resets the ownership flag. Producers similarly wait for the next slot to be available, fill it, and switch the ownership flag.

The SimBricks message transport design avoids cache coherence overhead unless it is fundamentally necessary. Since head and tail pointers are local to consumer and producer respectively, only accesses to shared message slots result in coherence traffic. Moreover, as long as a consumer does not poll in between the producer writing a message to the corresponding slot and setting the ownership bit, all coherence traffic carries necessary data from producer to consumer [5]. We include additional detail and pseudocode in §A.2.

SimBricks 将组件仿真器作为独立进程运行, 通过消息传递进行通信. 因此, 高效的进程间通信对于整体性能至关重要!

我们使用 经过优化的、基于轮询的共享内存队列 在仿真器之间进行高效的消息传输

对于运行在独立核心上的并行进程, 共享内存队列能以极低的开销实现低延迟通信

在任何一对通信的仿真器之间, SimBricks 都会建立一个双向消息通道, 由一对方向相反的单向队列组成
在通道初始化期间, SimBricks 使用 Unix 套接字提供命名的端点, 用于连接建立以及传输队列参数和共享内存文件描述符

SimBricks 使用并发的、循环的单生产者单消费者队列. 队列包含一组固定大小且缓存行对齐的消息槽位. 每个槽位的最后一个字节预留给元数据: 一位用于指示槽位的当前所有权(消费者或生产者), 其余位用于指示消息类型. 由于队列是单生产者单消费者的, 我们将尾指针本地存储在生产者端, 头指针本地存储在消费者端. 消费者轮询下一个槽位的消息, 直到所有权标志指示为消费者. 处理完消息后, 消费者重置所有权标志. 生产者同样等待下一个槽位可用, 填充数据并切换所有权标志.

SimBricks 的消息传输设计尽可能避免了缓存一致性开销. 由于头尾指针分别是消费者和生产者的本地变量, 只有对共享消息槽位的访问才会产生一致性流量. 此外, 只要消费者不在生产者写入消息及设置所有权位之间进行轮询, 所有的一致性流量都仅包含从生产者到消费者的必要数据传输.

5.3 Scaling Up with Decomposition¶

SimBricks can scale to larger simulations by adding more component simulators. For instance, a network simulator connecting to many devices may become a bottleneck as it needs to synchronize with all peers. We leverage the SimBricks architecture to improve scalability, by decomposing the network simulator into multiple processes that connect and synchronize via SimBricks Ethernet interfaces. Other simulators, such as a gem5 simulated host, can be accelerated in a similar fashion by decomposing into connected components. We will demonstrate the scalability benefit of our decomposition approach in §7.4.

SimBricks 可以通过添加更多组件仿真器来扩展到更大规模的仿真.

例如, 一个连接众多设备的网络仿真器可能会因为需要与所有对等体同步而成为瓶颈.

我们利用 SimBricks 架构提升可扩展性, 将网络仿真器分解为多个进程, 这些进程通过 SimBricks 以太网接口进行连接与同步.

其他仿真器(如 gem5 模拟的主机)也可以通过类似的方式分解为相互连接的组件来加速.

5.4 Scaling Out with Proxies¶

Running simulators in parallel on dedicated cores maximizes parallelism, but the number of available cores in a single machine limits simulation size. Message passing and modular simulation in SimBricks enables us to scale out simulations by partitioning components to multiple hosts and replacing message queues between simulators on different hosts with network communication. However, directly implementing this in individual component simulators has two major drawbacks. First, it increases the complexity for integration, as each simulator adapter needs to implement an additional message transport. Second, it increases communication overhead in component simulators, leaving fewer processor cycles for simulators and increasing simulation time. To avoid these drawbacks, we instead implement network communication in proxies. SimBricks proxies connect to local component simulators through existing shared memory queues and forward messages over the network to their peer proxy which operates symmetrically. This requires an additional processor core for the proxy on each side, but is fully transparent to component simulators and does not increase their communication overhead.

在专用核心上并行运行仿真器可实现最大并行度, 但单台机器的可用核心数限制了仿真规模

SimBricks 的消息传递和模块化特性允许我们将组件划分到多台主机上, 并用网络通信替代不同主机仿真器间的消息队列, 从而实现向外扩展

然而, 直接在单个仿真器内部实现此功能有两个主要弊端:

增加了集成复杂度, 因为每个适配器都需要实现额外的传输层
增加了仿真器的通信开销, 降低了仿真效率

为了避免这些问题, 我们在代理(Proxy)中实现网络通信

SimBricks 代理通过现有的共享内存队列连接本地仿真器, 并将消息转发给远程主机上对称运行的对等代理

这虽然在每侧需要额外消耗一个处理器核心, 但对组件仿真器是完全透明的, 且不会增加其通信开销

5.5 Simulator Synchronization Mechanism¶

To ensure accurate interconnection of component simulators, we design a synchronization mechanism that that guarantees correctness while minimizing overhead, even when scaling to large simulations.

为了确保组件仿真器间的准确互连, 我们设计了一种既能保证正确性又能最大限度减少开销的同步机制, 即使在扩展到大规模仿真时也是如此

5.5.1 Naive Synchronization Mechanisms do not Scale

A conceptual straw-man for synchronizing components are global barriers at each time step, keeping simulators in lockstep. When components are connected by communication links with non-zero latency, frequency of global barriers can be reduced by dividing simulation time into epochs no larger than the lowest link latency. Global barriers are only required at epoch boundaries, since all cross-component events will be delivered after the end of the current epoch [1, 42, 47]. Unfortunately, epoch-based synchronization still relies on non-scalable global barriers across all simulators, with the barrier frequency determined by the lowest link latency in the whole simulation, incurring substantial synchronization overhead.

同步组件的一种概念性方案是在每个时间步设置全局屏障, 使各仿真器保持同步步调

当组件通过具有非零延迟的通信链路连接时, 可以通过将仿真时间划分为不大于最小链路延迟的"周期"(Epochs)来降低全局屏障的频率

全局屏障仅在周期边界处需要, 因为所有的跨组件事件都会在当前周期结束后交付

遗憾的是, 这种基于周期的同步仍依赖于跨所有仿真器的不可扩展的全局屏障, 其频率由整个仿真中最低的链路延迟决定, 从而产生巨大的同步开销

5.5.2 Scalable synchronization in SimBricks

We avoid global synchronization while guaranteeing accurate simulator interconnection by relying on properties specific to the SimBricks architecture. Fig. 5 shows pseudocode for the SimBricks synchronization protocol.

Enforcing message processing times is sufficient. In SimBricks, all communication between simulators is explicit through message passing along statically created point-to-point channels. Thus, the only requirement for accurate simulation is that messages are processed at the correct time [10, 11]. Additional synchronization does not affect the simulation, as simulators cannot otherwise observe or influence each other. To enforce this guarantee, senders tag messages with the time when the receiver must process the message. For determinism, simulators with multiple peers must order messages with identical timestamps consistently.

Pairwise synchronization is sufficient. All SimBricks message passing channels are point-to-point and statically determined by the simulation structure. This is where we differ from most prior synchronization schemes: they do not assume a known topology and thus require global synchronization. SimBricks only needs to implement pairwise synchronization, between each simulator and its a priori known peers [10].

Per-channel message timestamps are monotonic. Our message queues deliver messages strictly in order. Since each SimBricks connection between two simulators incurs a propagation latency Δ 𝑖 > 0, a message sent at time 𝑇 over interface 𝑖 arrives at 𝑇 + Δ 𝑖. Assuming simulator clocks advance monotonically, message timestamps on each channel are thus monotonic.

Message timestamps ensure correctness. A corollary of monotonic timestamps is that a message with timestamp𝑡 is an implicit promise that no messages with timestamps < 𝑡 will arrive on that channel later. Therefore, once a simulator receives messages with timestamps ≥ 𝑇 from all its peers, it can safely advance its clock to 𝑇 without more coordination.

Ensuring liveness with sync messages. The above conditions ensure accuracy, but do not guarantee liveness. Simulations can only make progress when every channel carries at least one message in each direction in every Δ 𝑖 time interval [10, 11]. To ensure progress, we introduce SYNC messages that simulators send if they have not sent any messages for 𝛿 𝑖 ≤ Δ 𝑖 time units. SYNC messages allow connected peers to advance their clocks in the absence of data messages. In our simulations we set 𝛿 𝑖 = Δ 𝑖 ; lower values of 𝛿 𝑖 are valid, but we have not found configurations where the benefit of more frequent clock advances outweighed the cost of sending and processing additional SYNC messages.

Link latency provides synchronization slack. Non-zero link latencies further reduce synchronization overhead, since not even peer simulators need to execute in lockstep. Specifically, a message sent at 𝑇 allows its peer to advance to 𝑇 + Δ 𝑖. At that point, the peer's clock is guaranteed to lay between𝑇 −Δ 𝑖 (otherwise the local clock would not be at 𝑇) and 𝑇 + Δ 𝑖. Different channels in a SimBricks configuration can use different Δ 𝑖 values. While synchronized simulations are fundamentally only as fast as the slowest component, this slack improves efficiency by absorbing small transient variation in simulation speed, without immediately blocking all simulators.

我们通过依赖 SimBricks 架构特有的属性来避免全局同步, 同时保证准确的仿真器互连.

强制消息处理时间即已足够:

在 SimBricks 中, 仿真器之间的所有通信均通过静态创建的点对点通道进行显式的消息传递.

因此, 准确仿真的唯一要求是消息在正确的时间被处理.

由于仿真器无法通过其他方式观察或影响彼此, 额外的同步不会影响仿真结果.

为了强制执行此保证, 发送方会为消息打上时间戳, 指明接收方必须处理该消息的时间.

为了确保确定性, 具有多个对等体的仿真器必须一致地对具有相同时间戳的消息进行排序.

成对同步即已足够 (Pairwise synchronization):

所有 SimBricks 消息传递通道都是点对点的, 且由仿真结构静态确定.

这正是我们与大多数先前同步方案的不同之处: 它们不假设已知的拓扑结构, 因此需要全局同步.

SimBricks 仅需在每个仿真器及其预先确定的对等体之间实现成对同步.

单通道消息时间戳是单调的 (Monotonic timestamps):

我们的消息队列严格按顺序交付消息.

由于仿真器间的每次连接都存在传播延迟 \(\Delta_i > 0\), 在时间 \(T\) 通过接口 \(i\) 发送的消息将在 \(T + \Delta_i\) 送达. 假设仿真器时钟单调递增, 则每个通道上的消息时间戳也是单调的.

时间戳保证正确性 (Correctness):

单调时间戳的一个推论是, 带有时间戳 \(t\) 的消息隐含了一个承诺, 即该通道后续不会再出现时间戳 \(< t\) 的消息.

因此, 一旦仿真器从其所有对等体接收到时间戳 \(\ge T\) 的消息, 它就可以在无需进一步协调的情况下安全地将其时钟推进到 \(T\).

利用同步消息保证活跃性 (Ensuring liveness):

上述条件保证了准确性, 但不能保证活跃性.

仿真只有在每个通道的每个 \(\Delta_i\) 时间间隔内双向都至少携带一条消息时才能继续推进.

为此, 我们引入了 SYNC 消息, 如果仿真器在 \(\delta_i \le \Delta_i\) 时间单位内未发送任何消息, 则会发送该消息.

SYNC 消息允许连接的对等体在缺乏数据消息的情况下推进时钟.

在我们的仿真中, 我们设置 \(\delta_i = \Delta_i\).

链路延迟提供同步松弛 (Synchronization slack):

非零链路延迟进一步降低了同步开销, 因为即使是对等仿真器也不需要完全同步执行.

具体而言, 在 \(T\) 发送的消息允许其对等体推进到 \(T + \Delta_i\).

这种松弛通过吸收仿真速度的小幅瞬时波动提高了效率, 避免了立即阻塞所有仿真器.

并行与分布式仿真 (Parallel & Distributed Simulation)

这一类系统侧重于通过多进程或多机器扩展仿真规模

代表性系统 / 工具	技术特点与实现机制	与 SimBricks 的对比差异
dist-gem5 / pd-gem5	连接多个 gem5 实例进行分布式仿真，使用全局屏障 (Global Barrier) 进行同步。	SimBricks 采用基于模拟结构 (Simulation Structure) 的同步协议，实现更精确的同步而非全局屏障。
Graphite	跨核心和机器并行化多核仿真，但使用近似同步，可能导致因果错误。	SimBricks 确保了仿真的精确性 (Accuracy) 与确定性 (Determinism)。
Simics	支持全系统仿真，运行未修改的 OS，通过连接多个进程模拟网络系统。	SimBricks 通过固定接口（如 PCIe、以太网）解耦并连接异构仿真器。
ns-3 (v3.8+)	采用保守的前瞻协议和显式同步，依赖 MPI 连接多个进程。	SimBricks 将同步适配器耦合到优化的共享内存队列中，最小化通信开销。

多仿真器协同仿真 (Multi-Simulator Co-Simulation)

这类工具旨在将不同的仿真模型集成到一个统一的运行环境中

代表性系统 / 工具	技术特点与实现机制	与 SimBricks 的对比差异
gem5 + SystemC	将 SystemC 代码链接到 gem5 二进制文件中，嵌入其事件循环运行。	SimBricks 连接的是具有完全不同仿真模型的异构仿真器，无需深度代码嵌入。
SST (Structural Simulation Toolkit)	HPC 模块化框架，采用并行离散事件仿真和全局周期同步。	SST 要求仿真器深度集成到单一循环中，且未定义固定的标准化组件接口。

全系统模拟与仿真 (System Emulation & Modeling)

这类工具通常在速度与细节建模之间进行权衡

代表性系统 / 工具	技术特点与实现机制	与 SimBricks 的对比差异
Mininet	利用 Linux 容器特性仿真拓扑，运行实际应用并使用宿主内核处理协议。	无法模拟缓存或 PCIe 交互等低级别硬件细节，而 SimBricks 可以。
ns-3 DCE	将 Linux 内核作为库操作系统 (libOS) 集成到 ns-3 网络拓扑中。	相比 SimBricks，其无法对物理系统上的低级别瓶颈进行精确建模。
专用处理器仿真	在专用处理器上仿真网卡/交换机，系统其余部分原生运行。	SimBricks 的仿真虽耗时较长，但能完全控制模型细节并利用虚拟时间调节性能。

SimBricks: End-to-End Network System Evaluation with Modular Simulation¶

TLDR¶

Introduction¶

Simulation Background¶

Systems Research Challenges¶

Modular Simulation¶

4.1 Design Goals¶

4.2 Technical Challenges¶

4.3 Design Principles¶

4.4 Non-Goals¶

Design¶

5.1 Component Simulator Interfaces¶

5.2 Inter-Simulator Message Transport¶

5.3 Scaling Up with Decomposition¶

5.4 Scaling Out with Proxies¶

5.5 Simulator Synchronization Mechanism¶

Related Work¶