跳转至

Co-opting Linux Processes for High-Performance Network Simulation

TLDR

先让 gemini 速览一遍

这篇论文介绍了 Phantom 的设计与实现. 它旨在解决现有网络实验工具在 可扩展性 (Scalability), 真实性 (Realism)可控性 (Control) 之间无法兼得的问题

背景与问题

作者指出, 现有的网络实验架构都存在明显缺陷.

(1) 仿真器 (Emulators, 如 Mininet)

  • 能运行真实程序, 但可扩展性差
  • 在高负载下会出现 时间失真 (time distortion), 导致实验结果不可靠

(2) 模拟器 (Simulators, 如 NS-3)

  • 可扩展性强, 可控性好
  • 只能运行应用程序的抽象模型, 缺乏真实性

(3) 混合架构 (Hybrid)

  • 基于插件的方案 (如旧版 Shadow)

    • 性能较好
    • 兼容性差:
      • 需要重新编译
      • 不支持静态链接
    • 维护成本高
  • 基于进程的方案 (如 gRaIL)

    • 兼容性好, 可以运行标准 Linux 进程
    • 性能极差, 主要原因是依赖低效的 ptrace 机制进行系统调用拦截和进程控制

Phantom 的解决方案

Phantom 提出了一种新的混合架构, 在高性能和高兼容性之间取得平衡.

它是一个离散事件网络模拟器, 能够直接运行未修改的 Linux 进程.

其核心技术创新包括以下几个方面:

  1. 双重拦截策略 (Dual Interposition Strategy)
    • 主要策略: LD_PRELOAD
      • 预加载一个垫片 (shim) 库
      • 拦截动态链接函数(如 libc)
      • 性能开销最低, 是首选路径
    • 次要策略: seccomp
      • 使用安全计算过滤器 (seccomp)
      • 捕获所有未被 LD_PRELOAD 拦截的系统调用
        • 如静态链接调用
        • 或通过汇编直接触发的系统调用
      • 确保系统调用不会泄漏到真实内核
  2. 高效的进程间通信 (IPC)
    • 放弃使用 ptrace
    • 采用以下机制进行进程控制和系统调用传递:
      • 共享内存 (shared memory)
      • 信号量 (semaphores)
    • 这种方式显著降低了上下文切换和内核态交互的开销!
  3. 内存管理优化
    • 设计了一个进程间内存管理器
    • 通过"内存映射机制", 直接读写目标进程内存
    • 避免了昂贵的数据拷贝操作, 显著提升性能

Phantom 在保持高模拟速度和强可重复性的同时, 极大提升了实验的现实性和实用性.

为什么 emulator 容易出现时间失真?
  • 简单理解:
    • 就像你在玩一个画质极高的电子游戏, 如果你的电脑配置不够(负载高), 游戏画面就会卡顿
    • 在游戏中过去"一秒钟", 现实世界可能已经过了"两秒钟"
  • 含义:
    • 在网络仿真器 (Emulators, 如 Mininet) 中, 实验通常是尝试按照"真实时间 (Wall-clock time)"来运行的
    • 比如, 它试图让模拟网络中的 1 秒等于现实中的 1 秒
  • 为什么发生: 当你需要模拟的网络规模很大(比如成千上万个节点), 或者网络流量非常大时, 物理宿主机的 CPU 处理不过来了(过载)
  • 后果:
    • 此时, 模拟网络里的数据包处理速度跟不上真实时间的流逝
    • 这会导致实验结果不可信!
    • 本来应该能发出去的数据包, 因为宿主机 CPU 忙不过来而被丢弃或延迟了, 这并不是因为模拟的网络拥堵, 而是因为运行模拟的电脑卡了
  • Phantom 的优势:
    • Phantom 是模拟器 (Simulator), 它使用"虚拟时间"
    • 如果 CPU 忙不过来, 它会把虚拟时钟"暂停", 等计算完了再拨动指针
    • 所以不管负载多高, 它都不会出现"时间失真", 只是计算过程会慢一点而已
答疑解惑: LD_PRELOAD, shim, seccomp分别作用是什么?

(1) 什么是 ptrace? 它是如何进行拦截的?

  • 简单理解: ptrace 就像是一个极其严格的"监工". 被监控的程序每做一个小动作(发起系统调用), 都必须停下来向监工汇报, 监工检查并批准后, 程序才能继续
  • ptrace 是什么: 它是 Linux 提供的一种机制, 允许一个进程(比如调试器 GDB)去观察和控制另一个进程的执行
  • 如何拦截:
    • 在旧的模拟器设计(如 gRaIL)中, 它们利用 ptrace 监控应用程序
    • 每当应用程序想要联网(例如调用 send 发送数据)时, 操作系统会暂停该程序, 唤醒模拟器进程
    • 模拟器查看程序想干什么, 修改其行为(比如把数据放入模拟网络而不是真实网卡), 然后让程序继续
  • 为什么不好: 这种机制非常慢.
    • 上下文切换频繁: 每一次系统调用, 都要在"应用程序"和"模拟器"之间来回切换至少 4 次
    • 数据拷贝慢: 想要读取应用程序内存里的数据(比如要发送的网络包内容), ptrace 需要一个字一个字地搬运, 效率极低. 这导致了高达 10 倍以上的性能惩罚

(2) LD_PRELOAD / shim / seccomp 是什么?

Phantom 为了避开上面提到的 ptrace 的性能坑, 使用了这套"组合拳"

A. LD_PRELOAD (偷梁换柱的入口)

  • 是什么: 这是一个 Linux 环境变量. 它告诉操作系统: "在加载任何其他库之前, 先加载我指定的这个库"
  • 作用:
    • Phantom 利用它抢先加载自己的代码
    • 比如, 当应用程序想调用标准的 send 函数发数据时, 因为 Phantom 的库先被加载了, 程序实际调用的是 Phantom 提供的"假" send 函数, 而不是操作系统原本的函数
    • 这是一种非常高效的 hijack 手段

B. Shim (垫片/中间层)

  • 是什么: 这是一个通过 LD_PRELOAD 注入到应用程序里的自定义库
  • 作用: 它就像塞在应用程序和操作系统之间的一个"垫片"
    • 应用程序以为自己在跟 Linux 内核说话
    • 实际上, 是 Shim 在接听电话. 如果应用程序问"现在几点了?", Shim 会拦截这个问题, 并回答"现在是模拟时间 12:00", 而不是去问真实的 CPU 时间
    • Shim 把拦截到的请求通过高效的共享内存通道发给 Phantom 的控制器

C. Seccomp (安全滤网/保底措施)

  • 是什么: 全称 Secure Computing Mode, 是 Linux 内核的一种安全功能, 用来限制一个进程能发起哪些系统调用
  • 作用: LD_PRELOAD 虽然快, 但并不完美(如果程序是静态链接的, 或者直接用汇编指令发起调用, 就会绕过 LD_PRELOAD)
  • Phantom 的用法:
    • Phantom 设置了一个 Seccomp 过滤器作为"保底"
    • 如果有什么漏网之鱼绕过了 Shim 试图直接联系真实内核, Seccomp 会立刻捕捉到(触发 Trap), 并强制转交给 Phantom 处理
    • 这保证了模拟的正确性, 防止真实请求泄漏

Abstract

Network experimentation tools are vitally important to the process of developing, evaluating, and testing distributed systems. The state-of-the-art simulation tools are either prohibitively inefficient at large scales or are limited by nontrivial architectural challenges, inhibiting their widespread adoption. In this paper, we present the design and implementation of Phantom, 1 a novel tool for conducting distributed system experiments. In Phantom, a discrete-event network simulator directly executes unmodified applications as Linux processes and innovatively synthesizes efficient process control, system call interposition, and data transfer methods to co-opt the processes into the simulation environment. Our evaluation demonstrates that Phantom is up to 2.2× faster than Shadow, up to 3.4× faster than NS-3, and up to 43× faster than gRaIL in large P2P benchmarks while offering performance comparable to Shadow in large Tor network simulations.

网络实验工具在分布式系统的开发, 评估与测试过程中起着至关重要的作用.

现有的先进模拟工具要么在大规模场景下效率极低, 要么受限于复杂的架构挑战, 从而阻碍了其广泛采用.

本文介绍了 Phantom 的设计与实现, 这是一种用于开展分布式系统实验的新型工具.

Phantom 采用离散事件网络模拟器, 将未修改的应用程序直接作为 Linux 进程执行, 并创新性地结合了高效的进程控制, 系统调用拦截及数据传输方法, 从而将这些进程接管并纳入模拟环境中.

评估结果表明, 在大型 P2P 基准测试中, Phantom 的速度最高达到 Shadow 的 2.2 倍, NS-3 的 3.4 倍以及 gRaIL 的 43 倍;同时, 在大型 Tor 网络模拟中, 其性能与 Shadow 相当.

Introduction

Network experimentation tools promote the progression of network science: they aim to realistically reproduce the effects of distributed networks at scale in a controlled environment, enabling the scientific evaluation of performance and security across a range of system characteristics. Experimentation tools are particularly useful for large-scale distributed systems that are deployed in the real world, such as the globally expansive domain name system [50], peer-to-peer and content distribution networks [14], decentralized data-storage networks [52], and overlay networks [17]. Due to the sizes of these deployments and the internet's great heterogeneity and rapid change [19], it would be extremely difficult to run scientifically controlled, replicable experiments with them in the real world. Tools that enable realistic, scalable, and controlled experimentation of large-scale distributed systems can help accelerate research, development, and education.

Large-scale distributed systems are often characterized by a complex set of algorithms and protocols that run in application-layer software. Previous work has found that it is prudent to directly execute this software as part of the experimentation process to promote realism [30, 54, 62]. However, there are nontrivial architectural challenges in designing tools that meet the scalability and realism requirements. Emulators such as Mininet [45] do not support large-scale systems because they are vulnerable to time distortion during periods of overload [44]. Simulators such as NS-3 [26] run application abstractions in place of real software which can cause unrealistic behavior and lead to invalid results [54].

To meet the large-scale distributed system requirements, the state-of-the-art tools are designed with hybrid architectures wherein a network simulator directly executes application code. However, tools that load and execute applications in plugin namespaces (i.e., NS-3-DCE [62] and Shadow [30]) suffer from compatibility and correctness issues and high maintenance costs: applications must be recompiled as plugins, complex code is required to load and run them, and the system calls they make often leak outside of the simulation. On the other hand, tools that run applications as Linux processes (i.e., gRaIL [54]) incur considerable inter-process overhead: we have measured at least a 10× performance penalty in running gRaIL due to inefficient process control, system call interposition, and data transfer mechanisms. No existing network simulator simultaneously overcomes the compatibility, correctness, maintenance, and performance challenges found in the state-of-the-art tools.

Introducing Phantom: We present Phantom, 1 a novel, multiprocess network simulator that: (i) precludes the compatibility, correctness, and maintenance issues that have plagued plugin-based designs; and (ii) overcomes the performance challenges of existing multi-process designs by innovatively synthesizing efficient process control, system call interposition, and data transfer mechanisms. In Phantom, a discreteevent network simulation core directly executes unmodified applications as Linux processes, allowing us to take advantage of native Linux process isolation and management. Phantom co-opts the Linux processes into a simulation environment by (i) preloading a shim library (via LD_PRELOAD) that is used to establish efficient mechanisms for process control and function interception; (ii) installing a secure computing (i.e., seccomp) filter in the processes to guarantee interposition on system calls that are not preloadable; and (iii) using a novel inter-process memory mapper that allows us to directly read and write process memory without incurring inter-process communication (IPC) overhead. Once the processes are coopted, Phantom efficiently emulates system calls they make and facilitates communication over a simulated network.

Novel Contributions: This paper makes the following novel contributions to the state of the art in network simulation:

– The innovative design of Phantom, which for the first time shows how to minimize inter-process overhead in a hybrid, multi-process network simulator.

– A high-performance implementation of Phantom.

– An extensive evaluation of Phantom through which we find that it is up to 2.2× faster than Shadow, up to 3.4× faster than NS-3, and up to 43× faster than gRaIL in large P2P benchmarks while offering performance comparable to Shadow in large Tor network simulations.

– A verification of Phantom's accuracy in small LAN and WAN networks and in large Tor overlay networks.

Impact: This work has high potential for broad impact across multiple communities for the purposes of research, development, and education. First, researchers building software prototypes can use Phantom to quickly evaluate their new distributed system designs in a large-scale network without needing to worry about complicated deployments that are difficult to manage. Second, Phantom can be built into developers' testing frameworks so that new code can be continuously tested and discovered bugs can be identically reproduced. Third, with facilities to introduce network events (e.g., intermittent delays or failures), Phantom could help teach network and distributed systems courses. The Tor Project has already started using Phantom to develop and test new congestion control protocols before deploying them to the Tor network [57]. Availability: Phantom is merged into the open-source Shadow project as of v2 [4] and our artifacts are publicly available [3].

研究背景与痛点

网络实验工具对于分布式系统的评估至关重要, 旨在提供大规模, 真实且受控的实验环境.

然而, 现有工具存在明显缺陷:

  1. 仿真器(如 Mininet)在大规模负载下容易发生时间失真
  2. 模拟器(如 NS-3)运行抽象模型而非真实代码, 导致结果失真

现有混合架构的局限

目前的混合架构工具要么采用插件模式(如 Shadow), 存在兼容性差, 维护成本高和正确性问题

要么采用进程模式(如 gRaIL), 因进程控制和数据传输机制低效而导致严重的性能开销(至少 10 倍惩罚)

Phantom 的设计方案

本文提出了 Phantom, 这是一种新型的多进程网络模拟器, 它直接将未修改的应用程序作为 Linux 进程运行, 利用了原生 Linux 的隔离和管理功能.

Phantom 通过三种核心机制将进程纳入模拟环境:

  1. 预加载 shim 库以建立控制机制
  2. 安装 seccomp 过滤器以拦截不可预加载的系统调用
  3. 使用新型进程间内存映射器, 避免 IPC 开销直接读写内存

主要贡献与性能评估

Phantom 首次展示了如何在混合多进程模拟器中最小化进程间开销.

评估表明, 在大型 P2P 基准测试中, Phantom 比 Shadow 快 2.2 倍, 比 NS-3 快 3.4 倍, 比 gRaIL 快 43 倍, 且在 Tor 网络模拟中保持了与 Shadow 相当的性能与准确性

Phantom 极大地降低了大规模分布式系统实验的门槛, 适用于研究原型评估, 开发测试(Bug 复现)及教学.Tor 项目已开始使用它进行协议测试, 且 Phantom 代码已合并至开源项目 Shadow v2 中

Background and Motivation

We motivate the need for Phantom by identifying the key requirements, existing architectures, and challenges for realistically simulating large-scale distributed systems. (See Appendix A for extended background on related tools.)

2.1 Requirements

Scalability: Recent work finds that it is imperative to run network experiments as close as possible to the deployed scale because reducing the scale can lead to a significant loss of confidence in the experimental results [40]. Although some statistical confidence can be recovered with repeated trials, it can take many more trials at a smaller scale to achieve the same confidence as larger scale simulations [40].

To increase the scale at which we can run network experiments, a correct and valid execution of the simulation workload should not depend on the computational abilities of, or passage of time on, the host machine. Decoupling the simulation from time and computational constraints allows us to scale without introducing artifacts in the results due to over-provisioning and time-distortion [44].

Realism: Distributed systems are often composed of a diverse set of applications that each contain complex logic. We should directly execute these applications in order to guarantee that our experiments identically replicate their logic and obtain the highest application fidelity possible [30, 54, 62].

Deployed system software is often under active development to fix bugs and develop enhancements. We should execute applications the same way they would be executed in deployment; we should not require recompilation or the maintenance of application patches or abstractions. Running unmodified applications enables us to decouple the application logic and programming language from that of the simulation.

Control: Large-scale distributed systems contain many variables, and changing any one of them can have cascading network effects that can lead to unexpected behaviors or results. We should support deterministic execution to obtain scientific control and to guarantee that the results produced by an experiment can be independently and identically replicated.

网络实验的三大核心需求 (Requirements)

  • 可扩展性 (Scalability):
    • 实验必须在尽可能接近部署规模的环境下运行以保证结果可信度
    • 模拟应当与宿主机的计算能力和时间流逝解耦, 以避免因资源限制导致的"时间失真 (time distortion)"等误差
  • 真实性 (Realism):
    • 分布式系统逻辑复杂, 应直接执行未修改的真实应用程序而非抽象模型, 以确保高保真度并避免重新编译或维护补丁
  • 可控性 (Control):
    • 必须支持确定性执行: 很多变量的微小变化都可能引发级联效应, 导致意外行为. 应当确定执行以避免此类问题
    • 确保实验结果具有科学对照性, 并能被独立且完全一致地复现

2.2 Traditional Architectures

Tools implementing strictly traditional architectures are unsuitable for evaluating large-scale distributed systems with logic primarily contained in application-layer software.

Simulation: Network simulators such as NS-3 [26] scale independently of the wall-clock time [67] and offer precise experimental control due to deterministic execution [13]. However, simulators traditionally run application abstractions in place of real software which can cause unrealistic behavior and lead to invalid results [54]. As a result, traditional simulators do not fulfill the application realism requirement.

Emulation: Network emulators such as Mininet [45] directly execute applications using real kernel network stacks and therefore offer better application realism. However, emulators lack perfect scientific control due to non-determinism [12]. Moreover, emulators are generally unable to scale independently of computational constraints: if the experiment host machine is overloaded, time distortion will exacerbate reproducibility issues [44]. We confirm this claim with an experiment in which we find that as the host machine becomes more loaded with virtual peers, its packet forwarding capacity is limited and a decreasing fraction of the sent packets are correctly forwarded (see §5.4 and Figure 14 for details). As a result, traditional emulators are useful only at small scales.

传统架构的局限性 (Traditional Architectures)

  • 模拟 (Simulation):
    • 如 NS-3
    • 优点: 具备良好的可扩展性和确定性控制
    • 缺点: 使用应用抽象代替真实软件导致缺乏真实性, 可能产生无效结果
  • 仿真 (Emulation):
    • 如 Mininet
    • 优点: 运行真实内核栈具备较好真实性
    • 缺点: 缺乏确定性控制, 且无法独立于计算约束扩展 (宿主过载会导致时间失真和丢包), 仅适用于小规模实验

2.3 Hybrid Architectures and Challenges

A hybrid architecture is characterized by the ability to directly execute applications to promote realism while still running them in the context of a cohesive network simulation. As a result, a hybrid architecture enjoys the advantages of both emulation and simulation and offers the best opportunity to fulfill the scalability, realism, and control requirements discussed in §2.1 (see Table 1). However, there are numerous challenges with hybrid architectures that we believe have inhibited tools implementing them from achieving widespread adoption. We describe these challenges by the method for executing applications: plugin namespaces and processes.

Plugin Namespaces: In this approach, the simulator loads each application into a new plugin namespace (e.g., using dlmopen) and directly executes the application in the context of that namespace while using function interposition (via LD_PRELOAD) to hook the loaded applications into the simulation environment. A plugin design is implemented in both NS-3-DCE [62] and Shadow [30] and has several limitations:

– Compatibility: The domain of supported applications is limited to those that are compiled as position-independent libraries (PIC) or executables (PIE) that export their symbols to the dynamic symbol table (rdynamic), are dynamically linked to libc, and make all system calls through libc. Rebuilding is tedious and impossible if the source code is not available (e.g., closed-source software or malware).

– Correctness: Relying solely on preloading is unreliable because only dynamically linked functions (e.g., those in libc) can be intercepted using LD_PRELOAD; system calls invoked via statically linked code or assembly instructions will leak outside of the simulation and cause errors.

– Maintainability: A custom dynamic loader [63] is required to load more than 16 namespaces at once, and a portable threading library [48] is used to support multi-threaded applications (these account for 62k LoC in Shadow; see §4). libc functions with nontrivial functionality must be reimplemented in order to intercept the system calls they make.

These challenges have limited Shadow's use to Tor network simulation [40] while work on simulating Bitcoin has been abandoned [48] and work on NS-3-DCE has mostly stalled.

Processes: In this approach, applications are executed as standard Linux processes and hooked into the simulation through the system call interface using standard kernel facilities. This design overcomes many of the limitations of the plugin approach: (i) the simulator can execute any existing application without rebuilding it; (ii) kernel subsystems guarantee reliable process isolation and correct system call interception; and (iii) the maintenance of a custom loader, threading libraries, and reimplemented libc functions is no longer required. However, the naïve way of connecting multiple processes in a cohesive simulation as demonstrated in gRaIL [54] requires the kernel's process control (ptrace) subsystem and is significantly less performant than the plugin approach: we show in §5.4 that the run time of gRaIL (which extends NS-3) is 13× that of NS-3 alone, and 43× that of Phantom in experiments with fixed P2P messaging workloads. Worse performance in gRaIL's multi-process design can be attributed to:

– Process control: The simulator needs to control the execution state of the processes as they progress through simulated time. The ptrace process control mechanism (PTRACE_ATTACH or PTRACE_TRACEME) incurs overhead that is quadratic in the total number of attached processes, limiting scalability (see Appendix B.1).

– System call interposition: The simulator needs to intercept system calls made in the processes so they can be emulated. The ptrace system call mechanism (PTRACE_SYSCALL) requires at least 4 context switches for every system call, contributing substantial overhead relative to a same-process function call (see Appendix B.2).

– Data transfer: The simulator needs to access system call arguments referencing process memory (e.g., data buffers). The ptrace memory access mechanism (PTRACE_PEEK and PTRACE_POKE) requires an additional system call and mode transition for each word of memory, making it inefficient for large structs and buffers (see Appendix B.3).

Ideally, we want a simulator with the higher performance of the uni-process, plugin-based Shadow design (which does not incur inter-process overhead) and the improved compatibility, correctness, and maintainability of the multi-process gRaIL design. However, it was previously unknown if this ideal is attainable due to the multi-process challenges; indeed, we show throughout §5 that even a more efficient use of ptrace (see Appendix B) is still less performant than a uni-process design.

混合架构的挑战 (Hybrid Architectures and Challenges)

混合架构试图结合直接执行应用与网络模拟的优势, 但面临以下挑战:

  • 插件命名空间方法 (Plugin Namespaces): 如 Shadow 和 NS-3-DCE.

    • 局限:
      • 兼容性差 (仅支持位置无关代码 PIC/PIE, 必须动态链接 libc)
      • 正确性不可靠 (静态链接或汇编发起的系统调用会泄露)
      • 维护成本高 (需自定义加载器和线程库)
  • 进程方法 (Processes): 如 gRaIL.

    • 优势: 解决了插件方法的兼容性与正确性问题, 利用内核机制保证隔离
    • 性能瓶颈:
      • 传统实现 (基于 ptrace) 性能极差 (比 Phantom 慢 43 倍)
      • 主要原因包括: ptrace 的进程控制开销随进程数二次方增长. 系统调用拦截上下文切换频繁 (至少 4 次). 数据传输效率低下 (逐字访问内存)

目标: 理想的模拟器需要同时具备:

  1. 单进程插件架构的高性能 (无进程间开销)
  2. 多进程架构的高兼容性与正确性

但此前这被认为是难以实现的!

Design

In this section we describe the novel multi-process Phantom design that eliminates the limitations of the state-of-the-art plugin-based architecture and overcomes the performance challenges of the state-of-the-art process-based simulator.

在本节中, 我们将描述新颖的多进程 Phantom 设计, 该设计消除了最先进的基于插件架构的局限性, 并克服了最先进的基于进程模拟器的性能挑战

3.1 Overview

The main component in Phantom is a discrete-event simulator which drives the simulation (see Figure 1). After initialization, the simulator directly executes the real applications of an experiment as Linux processes while using interprocess communication channels (IPC) between the application and simulator processes. Phantom co-opts the application processes into the simulation by intercepting all system calls they make (e.g., socket, listen, connect, send, recv, poll, etc.) rather than allowing them to be handled by the Linux kernel. Phantom handles intercepted system calls by internally simulating common kernel functionalities that most applications expect to be available, such as networking facilities (e.g., buffers, protocols, and interfaces), event notification facilities (e.g., select, poll, and epoll), and file descriptor facilities (e.g., files, sockets, and pipes). As a result, Phantom emulates a Linux kernel to the applications while connecting them through a virtual, simulated network, and the applications need not be aware that they are running in a simulation.

alt text

Phantom 的主要组件是一个驱动模拟的离散事件模拟器(见图 1).初始化后, 模拟器将实验中的真实应用程序直接作为 Linux 进程执行, 同时在应用程序和模拟器进程之间使用进程间通信(IPC)通道

Phantom 通过拦截应用程序发出的所有系统调用 (例如 socket, listen, connect, send, recv, poll 等) 将应用程序进程纳入(co-opts)模拟环境, 而不是允许它们由 Linux 内核处理

Phantom 通过内部模拟大多数应用程序期望可用的通用内核功能, 处理被拦截的系统调用, 例如网络设施(如缓冲区, 协议和接口), 事件通知设施(如 select, pollepoll)以及文件描述符设施(如文件, 套接字和管道)

因此, Phantom 向应用程序模拟了一个 Linux 内核, 同时通过虚拟的模拟网络连接它们, 而应用程序无需感知它们运行在模拟环境中

人话

提供了一个 "虚拟linux kernel"

  1. 用户程序原本发出来的 syscall 都被这个虚拟 kernel 截获并处理了, 而不是发给真实的 linux kernel
  2. 这个虚拟 kernel 模拟了真实 kernel 的常用功能 (网络, 事件通知, 文件描述符 ...)

3.2 Components

3.2.1 Simulation Controller Process

Phantom is a parallel, conservative-time, discrete-event network simulator that emulates a Linux kernel to the applications it executes. Simulations are driven by a single controller process which has two primary functions that occur successively during an initialization phase and an execution phase. Initialization Phase: During initialization, the controller reads and processes configuration inputs. The inputs specify a number of virtual hosts that should be simulated, a network graph model that should be used to model network characteristics such as routing, latency, and packet loss between the virtual hosts, and the file paths and arguments needed to directly execute the applications on the virtual hosts. The controller initializes internal simulation state accordingly. Execution Phase: Simulation work is organized into events that each occur at a discrete simulation time. Each event is assigned to a virtual host and stored in a host-specific event queue: a min-heap that sorts events by their simulation time.

The controller manages the global simulation clock and synchronizes simulation time by using time barriers to establish discrete execution rounds: time intervals during which events may be safely executed in parallel. The time barrier in a round is set such that no event that is executed for any host in that round will enqueue a new event for any other host in the same round. This conservative-time algorithm guarantees that simulation time always advances on each host, even when concurrently executing distinct hosts' events. When the next event time in every host's event queue exceeds the time barrier for the current round, the controller updates the global clock and advances the execution round.

Phantom 是一个并行, 保守时间(conservative-time)的离散事件网络模拟器, 向其执行的应用程序模拟 Linux 内核.模拟由单个控制器进程驱动, 该进程具有在初始化阶段和执行阶段依次发生的两个主要功能.

初始化阶段:在初始化期间, 控制器读取并处理配置输入

输入指定了:

  • 应模拟的虚拟主机数量
  • 用于模拟网络特征(虚拟主机之间的路由, 延迟和丢包)的网络图模型
  • 在虚拟主机上直接执行应用程序所需的文件路径和参数.

控制器据此初始化内部模拟状态!

执行阶段:模拟工作被组织成在离散模拟时间发生的事件

每个事件被分配给一个虚拟主机, 并存储在特定于主机的事件队列中: 一个按模拟时间排序事件的"最小堆"

控制器管理全局模拟时钟, 并通过使用 time barriers 建立离散的 execution rounds 来同步模拟时间:

在这些时间间隔内, 事件可以安全地并行执行!

其中: 一轮中的时间屏障设置方式, 应当确保在该轮中为任何主机执行的任何事件, 都不会为同一轮中的任何其他主机排队新事件

这种保守时间算法保证了模拟时间在每个主机上总是向前推进, 即使在并发执行不同主机的事件时也是如此

当每个主机事件队列中的下一个事件时间超过当前轮次的时间屏障时, 控制器更新全局时钟并推进执行轮次!

人话: Time Barriers 与 保守时间算法

这里解决的是一个核心矛盾: 我们想让很多虚拟主机"并行"运行(利用多核 CPU 加速), 但又不能乱了时间的先后顺序

alt text

(1) 乱套的情况:

  • 现状:现在模拟时间是 10:00
  • 主机 A(运行得快)一下子跑到了 10:05
  • 主机 B(运行得慢)还在处理 10:01 的事情

出事了: 主机 B 在 10:01 发了一个数据包给 A, 网络延迟是 2 秒, 理应在 10:03 到达 A

Bug: 但是 A 已经跑到 10:05 了!它错过了 10:03 的那个数据包

这就是"时间穿越"导致的错误!!!

(2) 解决方案: Time Barriers + 保守时间算法

  1. 设定安全距离(barrier): 主持人知道, 这个网络里数据传输最快也要 5 毫秒. 是 min propagation delay
  2. 划定回合:
    • 现在时间:0ms
    • 主持人说:"大家听好!因为网络延迟最少是 5ms, 所以如果你现在(0ms)发消息, 对方最早也要 5ms 才能收到.因此, 在 0ms 到 5ms 这段时间里, 大家各跑各的, 绝对安全, 不会有人能打断别人!"
    • 这个 5ms 就是时间屏障
  3. 并行执行:于是, 所有 CPU 核心开始疯狂处理所有主机在 0~5ms 内的任务.大家并行跑, 速度很快
  4. 暂停同步:当所有主机都处理完 5ms 之前的任务后, 大家停下来.主持人更新全局时钟到 5ms
  5. 下一轮:主持人说:"好, 下一轮安全时间是 5ms 到 10ms, 大家继续跑!

3.2.2 Parallel Worker Threads

Phantom concurrently executes the events in each execution round using worker threads (workers) that are managed with high level abstractions we call logical processors (LPs). Phantom allows a configurable number of LPs and controls the state of an independently configurable number of workers such that only a number of workers equal to the number of LPs are concurrently active.2

The following algorithm employs a work stealing [10, 65] strategy to schedule the worker threads, ensuring that each LP will always be running a worker thread as long as one with remaining work exists. When an execution round begins, one worker thread starts running for each LP while the remaining workers remain waiting. While running, a worker dequeues and executes all events that occur within the current round (as set by the controller) for all hosts assigned to it. When a worker completes all outstanding events for the current round, it: (i) starts running another waiting worker that has yet to run in this round (if any exist); and (ii) starts waiting to be run again during the following round. An execution round ends when all workers have entered the waiting state.

Phantom 使用工作线程(workers)并发执行每个执行轮次中的事件, 这些工作线程由我们称为逻辑处理器(LPs)的高级抽象进行管理

Phantom 允许配置 LP 的数量, 并控制独立配置数量的工作线程的状态. 能够进行并发活动的工作线程数量必须严格等于 LP 的数量, 防止 performance degradation caused by CPU oversubscription

以下算法采用工作窃取(work stealing)策略来调度工作线程, 确保只要存在剩余工作的线程, 每个 LP 将始终运行一个工作线程

当执行轮次开始时, 每个 LP 启动一个工作线程, 而其余工作线程保持等待

运行时, 工作线程会出队并执行分配给它的所有主机在当前轮次内(由控制器设置)发生的所有事件

当工作线程完成当前轮次的所有未决事件时, 它:

  1. 开始运行另一个在本轮尚未运行的等待中的工作线程(如果存在)
  2. 开始等待在下一轮再次运行.当所有工作线程进入等待状态时, 执行轮次结束

建议回顾一下 Stanford CS149: 并行与分布式系统

3.2.3 Direct Application Execution

During initialization, each virtual host is configured to directly execute some number of applications. Phantom internally creates virtual process and thread data structures to store the state needed to manage the execution of the applications (e.g., file descriptor tables and standard input/output handles). Managed Processes and Threads: Phantom directly executes specified application binaries and allows for configuration of the command-line arguments and the start time within the simulation. Each application is launched by a Phantom worker with a vfork+execvpe sequence.

The application execution procedure results in the creation of one or more Linux processes and threads that are managed by their parent Phantom worker. Each worker (i) uses our preload shim library to co-opt their managed processes into the simulation, and (ii) uses our inter-process communication mechanisms to modulate the running state of the managed processes such that only one of a worker and its managed processes are running at any time (thus maintaining that only one task per LP is concurrently active).

Preload Shim: In order to assist with controlling the managed processes and threads, we create a custom shared library, subsequently referred to as "the shim", which is loaded into each managed process's address space using the LD_PRELOAD environment variable. We use the shim to: (i) execute initialization code in the shim's constructor functions and establish an inter-process communication channel (see §3.2.6); and (ii) intercept functions defined in libraries that are dynamically linked to the applications (e.g., libc; see §3.2.4).

在初始化期间, 每个虚拟主机被配置为直接执行一定数量的应用程序.Phantom 内部创建虚拟进程和线程数据结构, 以存储管理应用程序执行所需的状态(例如文件描述符表和标准输入/输出句柄)

受管进程和线程:Phantom 直接执行指定的应用程序二进制文件, 并允许配置命令行参数和模拟内的启动时间

每个应用程序由 Phantom 工作线程通过 vfork+execvpe 序列启动

应用程序执行过程导致创建一个或多个 Linux 进程和线程, 这些进程和线程由其父 Phantom 工作线程管理

每个工作线程:

  1. 使用我们的预加载 shim 库将其"受管进程"纳入模拟
  2. 使用我们的 IPC (Inter-Process Communications) 来调节受管进程的运行状态, 使得在任何时间"工作线程"及其"受管进程"中只有一个在运行
    • 从而保持每个 LP 只有一个任务并发活动

预加载 Shim:为了协助控制受管进程和线程, 我们创建了一个自定义共享库, 随后称为 "shim"

本质上是, 使用 LD_PRELOAD 环境变量将其加载到每个"受管进程"的地址空间中

我们使用 shim 来:

  1. shim 的构造函数中执行初始化代码并建立IPC通道
  2. 拦截动态链接, 到应用程序的库中定义的函数

3.2.4 System Call Interposition

Phantom co-opts processes into the simulation by intercepting functions at the system call interface using two interception strategies: preloading and seccomp (see Figure 2). Primary Strategy: Preloading: Recall from §3.2.3 that Phantom preloads a shared library shim into each process it executes using the LD_PRELOAD environment variable. Because the shim is preloaded, the dynamic loader loads the shim before all other shared objects linked to the managed process and the shim is the first library searched when attempting to dynamically resolve symbols. This feature allows us to selectively override functions in other shared libraries by supplying identically named functions with alternative implementations inside the shim. Preloading is efficient, as it changes only the address of the instruction that is next executed when a dynamically-linked function is invoked. Therefore, we use preloading as our primary interception strategy.

Notice that preloading works by intercepting shared library functions, not system calls. While preloading can interpose dynamically linked calls to libc system call wrapper functions made from outside of libc, it cannot interpose the statically linked calls made from inside of libc (e.g., internal calls from printf to write). 3 If using preloading alone, we would need to reimplement printf and any other libc functionality we wanted to support and not just the system call wrappers—an untenable engineering burden. Preloading alone would also fail to intercept system calls made without using libc at all, e.g., those made by directly using a syscall instruction. Secondary Strategy: seccomp: Phantom intercepts system calls that are not handled by the preloading strategy using the kernel's seccomp (secure computing) facility. The seccomp facility enables a process to set a filter on the system calls that are made by the process and to associate an action with the filter. We install a seccomp filter that traps all system calls except for: (i) sigreturn; and (ii) system calls originating from Phantom's own preloaded shim. We install a SIGSYS signal handler for system calls trapped by the seccomp filter; whenever a system call matching the filter is invoked, the kernel traps it and instead calls our signal handler function.

We use seccomp as our secondary interception strategy because, although it can intercept all system calls, it is less efficient than preloading; it requires: (i) a mode transition from the process to the kernel when the system call is invoked; (ii) execution of the seccomp filter; and (iii) a mode transition back to the process to invoke the shim callback function. Because most system calls are preloadable, we infrequently incur the additional overhead from seccomp in practice.

Phantom 通过使用两种拦截策略在系统调用接口处拦截函数, 从而将进程纳入模拟: preloadseccomp

alt text

主要策略:预加载:

回顾 §3.2.3, Phantom 使用 LD_PRELOAD 环境变量将共享库 shim 预加载到它执行的每个进程中

由于 shim 是预加载的, 动态加载器在链接到"受管进程"的所有其他共享对象之前加载 shim, 并且在尝试动态解析符号时, shim 是搜索的第一个库

此特性允许我们通过在 shim 内部提供具有替代实现的同名函数来选择性覆盖其他共享库中的函数! Hack!

preload 是高效的, 因为它仅更改调用动态链接函数时执行的下一条指令的地址

因此, 我们将预加载作为主要拦截策略!

注意, 预加载通过拦截共享库函数而不是系统调用来工作:

虽然预加载可以拦截从 libc 外部对 libc 系统调用包装函数的动态链接调用, 但它无法拦截从 libc 内部发出的静态链接调用(例如从 printfwrite 的内部调用)

如果仅使用预加载, 我们将需要重新实现 printf 和我们希望支持的任何其他 libc 功能, 而不仅仅是系统调用包装器——这是无法承受的工程负担.

仅预加载也无法拦截根本不使用 libc 发出的系统调用, 例如通过直接使用 syscall 指令发出的系统调用

因此我们还需要一个"补丁策略", 也就是下文的seccomp

次要策略:seccomp:

Phantom 使用内核的 seccomp 设施拦截"未由预加载策略处理"的系统调用

seccomp 设施允许进程对进程发出的系统调用设置过滤器, 并将动作与过滤器相关联

我们安装一个 seccomp 过滤器, 捕获除以下之外的所有系统调用:

(i) sigreturn;和 (ii) 源自 Phantom 自身预加载 shim 的系统调用

我们为 seccomp 过滤器捕获的系统调用安装 SIGSYS 信号处理程序;每当调用与过滤器匹配的系统调用时, 内核会捕获它并改为调用我们的信号处理程序函数

我们使用 seccomp 作为次要拦截策略, 因为虽然它可以拦截所有系统调用, 但它的效率低于预加载. 它需要:

  1. 调用 syscall 时从进程到内核的模式转换
  2. 执行 seccomp 过滤器
  3. 模式转换回进程以调用 shim 回调函数

由于大多数系统调用是可预加载的, 我们在实践中很少产生来自 seccomp 的额外开销

3.2.5 Emulating System Calls

Both system call interception strategies from §3.2.4 result in a syscall handler function being executed in the shim, i.e., within the managed process. System calls can be emulated either directly in the shim or in the controller (see Figure 2). In the Shim: Frequently made system calls that can be emulated using little state from the controller can be serviced directly in the shim without incurring additional overhead related to IPC. For example, the shim directly handles the time, gettimeofday, and clock_gettime system calls by arranging for the controller to share and maintain the current simulation time in a shared memory control block that is accessible to the shim as described in §3.2.6.

In the Controller: The remaining system calls are serviced in the simulator controller process. The system call number and arguments are sent to the controller using the IPC control channel as described in §3.2.6. The controller handles the system calls internally using lightweight implementations that effectively form a simulated kernel that completely replaces the functionality normally provided by the Linux kernel. The simulated kernel (re)implements (i.e., simulates) important system functionality, including: the passage of time; input and output operations on file, socket, pipe, timer, and event descriptors; packet transmissions with respect to transport layer protocols such as TCP and UDP; and aspects of computer networking including routing, queuing, and bandwidth limits. (See Appendix D for additional details.) Importantly, this approach enables us to establish a private, simulated network environment that is completely isolated from the real network, but is internally interoperable and entirely controllable. Determinism: Phantom uses a pseudorandom generator that is seeded with a configurable seed as its single source of randomness throughout the simulation. Care is taken to ensure that all random bytes that are needed during the simulation are initiated from this source, including during the emulation of system calls such as getrandom and when emulating reads from files like /dev/*random. This approach allows Phantom to produce deterministic simulations, improving scientific control over the experimentation process and enabling experimental results to be replicated.

§3.2.4 中的两种系统调用拦截策略, 都会导致在 shim 中(即在受管进程内)执行 syscall 处理程序函数.系统调用可以直接在 shim 中或在控制器中模拟:

(1) 在 Shim 中:频繁发出的, 可以使用来自控制器的少量状态进行模拟的系统调用可以直接在 shim 中提供服务, 而不会产生与 IPC 相关的额外开销

例如, shim 通过安排控制器在共享内存控制块中共享和维护当前模拟时间(如 §3.2.6 所述, shim 可访问该块), 直接处理 time, gettimeofdayclock_gettime 系统调用

(2) 在控制器中:其余的系统调用在模拟器控制器进程中提供服务

系统调用号和参数使用 IPC 控制通道发送到控制器, 如 §3.2.6 所述

控制器使用轻量级实现内部处理系统调用, 这些实现有效地形成了一个模拟内核, 完全取代了 Linux 内核通常提供的功能

模拟内核复现了重要的系统功能, 包括:

  1. 时间的流逝
  2. 文件, 套接字, 管道, 定时器
  3. 事件描述符上的输入和输出操作
  4. 关于传输层协议(如 TCP 和 UDP)的数据包传输
  5. 计算机网络的各个方面, 包括路由, 排队和带宽限制

重要的是, 这种方法使我们要建立一个私有的模拟网络环境, 该环境与真实网络完全隔离, 但在内部可互操作且完全可控

确定性:Phantom 使用一个以可配置种子为种子的伪随机生成器作为整个模拟过程中唯一的随机性来源

我们采取措施确保模拟期间所需的所有随机字节都从该源发起, 包括在模拟 getrandom 等系统调用期间以及模拟从 /dev/*random 等文件读取时

这种方法允许 Phantom 生成确定性模拟, 提高了对实验过程的科学控制, 并使实验结果能够被复制

3.2.6 Managed Process-to-Controller Communication

We use control channels to exchange fixed-size messages with each managed process (e.g., system call arguments), and a memory manager to exchange dynamic amounts of data (e.g., a buffer passed to a send system call; see Figure 3). Control Channel: Phantom establishes a control channel with the shim of each managed process by allocating an initial block of shared memory and sharing the handle to this memory during process startup using an environment variable. This control block uses a fixed data structure layout that includes semaphores and messaging state (e.g., system call arguments). The semaphores provide a safe and efficient way for a message sender to signal that a new message is available and for a message receiver to wait for a new message; the controller uses this functionality to modulate the execution state of the process (see §3.2.7). We use shared memory and semaphores because we found this combination to perform better than alternative approaches (see Appendix C). Memory Manager: We designed an inter-process memory access manager to enable the controller to directly and efficiently read and write the memory of each managed process without extraneous data copies or control messages. The memory manager tracks the memory mappings that are active across various regions of a process's memory, which are analogous to the mappings found in the /proc//maps file. Upon initialization, the memory manager creates a sparse memory file for each process, where a virtual address in the process corresponds to the same offset in the file. The memory manager initially remaps the process's stack and heap memory regions into this file. As the process runs, the memory manager brokers all read, write, or other mapping requests that involve managed process memory in order to: (i) also map requests for anonymous private regions (such as those made when serving large allocation requests) into the shared file; (ii) maintain a consistent view of the process's address space; and (iii) simplify system call handling by translating memory pointers to shared memory pointers as needed. Whenever the memory manager receives an access request for an address that is not mapped into the shared file, it utilizes the kernel's process_vm_readv and process_vm_writev facilities to directly transfer data between the controller and the managed process's address space without copying it into kernel space.

我们使用"控制通道"与每个"受管进程"交换固定大小的消息(例如系统调用参数), 并使用"内存管理器"交换动态数量的数据(例如传递给 send 系统调用的缓冲区)

alt text

人话

内容太冗长, 感觉上图说明的会更直观和清晰

(1) 控制通道:Phantom 通过分配初始共享内存块, 并在进程启动期间使用环境变量共享此内存的句柄, 与每个受管进程的 shim 建立控制通道

此控制块使用固定的数据结构布局, 其中包括信号量和消息传递状态(例如系统调用参数)

信号量为sender发出新消息可用的信号以及recver等待新消息提供了一种安全高效的方式

控制器使用此功能来调节进程的执行状态(见 §3.2.7).我们使用共享内存和信号量, 是因为我们发现这种组合比替代方法性能更好

(2) 内存管理器:我们设计了一个进程间内存访问管理器, 使控制器能够直接高效地读写每个受管进程的内存, 而无需无关的数据复制或控制消息

内存管理器跟踪在进程内存的各个区域中活动的内存映射, 这类似于 /proc/<pid>/maps 文件中发现的映射

初始化时, 内存管理器为每个进程创建一个稀疏内存文件, 其中进程中的虚拟地址对应于文件中的相同偏移量.

内存管理器最初将进程的堆栈和堆内存区域重新映射到此文件中.

随着进程运行, 内存管理器代理所有涉及受管进程内存的读取, 写入或其他映射请求, 以便:

  1. 也将匿名私有区域的请求(例如在服务大型分配请求时发出的请求)映射到共享文件中
  2. 维护进程地址空间的一致视图
  3. 根据requirements, 将内存指针转换为共享内存指针, 以简化系统调用处理
    • 每当内存管理器收到对未映射到共享文件中的地址的访问请求时, 利用内核的 process_vm_readvprocess_vm_writev 设施在控制器和受管进程的地址空间之间直接传输数据, 而无需将其复制到内核空间

3.2.7 Managed Process/Thread Scheduling

We use the IPC control channel from §3.2.6 to control the execution state of each managed process. When a process first loads, it immediately waits on the channel semaphore to receive a message from Phantom before starting. When a Phantom worker runs (following the algorithm in §3.2.2), the worker initially sends a start message to the process it manages and waits to receive a message back from the process. The process then runs until it invokes a system call that is interposed as described in §3.2.4, sends a system call request message back through the control channel to the waiting Phantom worker, and waits to receive the system call result message from Phantom.

There are two possible scheduling outcomes when a Phantom worker handles a system call requested by a managed process. For system calls that can be handled immediately (non-blocking calls, or blocking calls for which a result is ready), the Phantom worker returns the result over the control channel and the scheduling cycle continues. For system calls that cannot be handled immediately (blocking system calls whose result is not ready), the Phantom worker must wait for some condition to become true (e.g., a packet to arrive or a timeout to occur). Such conditions are internally registered, and then the worker leaves the managed process in an idle state while it continues executing simulation events (and advancing simulation time). When the condition later becomes true (e.g., a timeout occurred), the worker executes an event that causes it to check the system call state and return the timeout result to the process over the control channel. The process continues executing and the scheduling cycle continues.

The effect of this scheduling process is that each Phantom worker only allows a single thread of execution across all processes it manages; each of the remaining managed processes/threads will always be idle, waiting for a result message from the worker for the previously requested system call. Using this scheduling process, Phantom has precise control over the execution state of all managed processes and guarantees nonconcurrent access of managed processes' memory through the memory manager from §3.2.6.

我们使用 §3.2.6 中的 IPC 控制通道来控制每个受管进程的执行状态:

  1. 当进程首次加载时, 它立即在通道信号量上等待, 以便在启动前接收来自 Phantom 的消息
  2. 当 Phantom 工作线程运行(遵循 §3.2.2 中的算法)时, 工作线程最初向其管理的进程发送启动消息, 并等待接收来自进程的消息
  3. 然后该进程运行, 直到它调用如 §3.2.4 所述被拦截的系统调用, 通过控制通道向等待的 Phantom 工作线程发送系统调用请求消息, 并等待接收来自 Phantom 的系统调用结果消息

当 Phantom 工作线程处理受管进程请求的系统调用时, 有两种可能的调度结果:

  1. 对于可以立即处理的系统调用(非阻塞调用, 或结果已就绪的阻塞调用)
    • Phantom 工作线程通过控制通道返回结果, 调度周期继续
  2. 对于无法立即处理的系统调用(结果未就绪的阻塞系统调用)
    • Phantom 工作线程必须等待某些条件变为真(例如数据包到达或超时发生)
    • 此类条件在内部注册, 然后工作线程将受管进程保持在空闲状态, 同时继续执行模拟事件(并推进模拟时间)
    • 当条件随后变为真(例如发生超时)时, 工作线程执行一个事件, 使其检查系统调用状态并通过控制通道将超时结果返回给进程
    • 进程继续执行, 调度周期继续

此调度过程的效果是, 每个 Phantom 工作线程在其管理的所有进程中仅允许单个执行线程;其余每个受管进程/线程将始终处于空闲状态, 等待来自工作线程的先前请求的系统调用的结果消息.

使用此调度过程, Phantom 对所有受管进程的执行状态具有精确控制, 并通过 §3.2.6 中的内存管理器保证对受管进程内存的非并发访问

3.2.8 Linux CPU Scheduling

Phantom is designed to work with the Linux CPU affinity (i.e., CPU pinning) scheduling feature. CPU affinity is a scheduling attribute associated with running Linux processes. A process's CPU affinity can be adjusted to restrict the process to run only on a specified subset of CPUs (e.g., a single CPU). CPU pinning can improve performance by reducing the frequency of cache misses, CPU migrations, and context switches. In particular, Linux semaphores shared between two same-core processes incur fewer context switches than when shared between cross-core processes (see Appendix C). Recall that Phantom will run either a worker thread or one of its managed processes, but never both at the same time. This design choice enables us to naturally pin each worker and all of its managed processes to the same core in order to capitalize on the CPU pinning performance benefits.

Phantom 旨在与 Linux CPU 亲和性(即 CPU 绑定)调度功能配合使用.

人话

这不就是"局部性原理"吗 ...

CPU 亲和性是与运行的 Linux 进程关联的调度属性.可以调整进程的 CPU 亲和性, 以限制进程仅在指定的 CPU 子集(例如单个 CPU)上运行.

CPU 绑定可以通过减少缓存未命中, CPU 迁移和上下文切换的频率来提高性能.

特别是, 在两个同一核心的进程之间共享的 Linux 信号量比在跨核心进程之间共享时产生的上下文切换更少.

回想一下, Phantom 将运行工作线程或其受管进程之一, 但绝不会同时运行两者.

这种设计选择使我们能够自然地将每个工作线程及其所有受管进程绑定到同一个核心, 以利用 CPU 绑定带来的性能优势.

Implementation

We implement Phantom using the plugin-based Shadow as a basis because: (i) we will show in §5.4 that Shadow outperforms other simulators; and (ii) it will be fairer to compare the plugin- and process-based architectures using tools that share the same foundation. See Appendix A.3 for Shadow details. Transforming Shadow: We forked Shadow v1.14.0 and identified the components that are no longer necessary for Phantom. Of the 94, 259 lines of code (LoC) in Shadow v1.14.0, 4 we removed 47, 959 LoC (50.9%) containing a custom version of the GNU portable threads library that was used to simulate application threading [48], 14, 498 LoC (15.9%) containing a custom loader that dynamically loads plugins using dlmopen [63], and 6, 559 LoC (7.0%) that implemented the interface between Shadow and the libc functions it preloads. We also found that 6, 315 LoC (6.7%) implemented tests and 2, 123 LoC (2.3%) implemented tools, leaving just 16, 805 LoC (17.8%) implementing core simulator functionality that Phantom integrates (see Appendix D for more details). Implementing Phantom: We implemented Phantom's design from §3 on top of our stripped down version of Shadow. Our full Phantom implementation supports 164 system calls and contains 56, 742 LoC: tests account for about 15, 653 LoC (27.6%), tools account for 1, 956 LoC (3.4%), and the remaining 39, 133 LoC (69.0%) implements core functionality.

  • 基于 Shadow 构建:

    • Phantom 选择基于插件式的模拟器 Shadow (v1.14.0) 进行开发
    • 原因:
      1. 为了证明 Shadow 本身性能优于其他模拟器
      2. 为了确保在相同的基础代码上, 公平地对比"基于插件"和"基于进程"这两种架构的性能差异
  • 大幅精简旧代码 (Transforming Shadow):

    • 由于 Phantom 采用了进程架构, Shadow 中用于支持插件架构的许多复杂组件不再需要
    • 删除内容: 移除了约 50.9% 的代码 (用于模拟应用线程的自定义 GNU 线程库), 15.9% 的代码 (自定义动态加载器), 以及 7.0% 的 libc 接口代码
    • 保留内容: 最终仅保留了约 17.8% (16, 805 行) 的核心模拟器功能代码作为 Phantom 的基础
  • Phantom 实现详情 (Implementing Phantom):

    • 在精简后的 Shadow 基础上实现了第 3 节描述的 Phantom 设计
    • 系统调用支持: 目前支持 164 个 Linux 系统调用
    • 代码规模: 总代码量为 56, 742 行 (LoC). 其中核心功能代码占 69.0% (39, 133 行), 其余为测试代码 (27.6%) 和工具代码 (3.4%)

来自原文 5.4 "Comparison to Related Tools"

工具名称 架构类型 核心机制 优势 主要局限/劣势 性能对比 (vs Phantom)
Phantom 混合多进程 标准进程 + Seccomp + 预加载 高性能, 高兼容性, 确定性控制, 无时间失真 受限于模拟内核实现的系统调用数量 基准 (最快)
Mininet 仿真 (Emulation) Linux 内核 + 虚拟网络接口 应用真实性高 (真实内核栈) 控制力差, 扩展性差, 高负载下出现时间失真 吞吐量随负载下降 (非线性)
NS-3 模拟 (Simulation) 应用抽象 + 网络模拟 确定性控制, 可扩展 应用真实性低 (运行抽象而非真实代码) 3.4倍
gRaIL 混合多进程 进程 + ptrace 兼容性好 (真实进程) 极低效 (ptrace 开销大), IPC 瓶颈 43倍
Shadow 混合单进程 插件命名空间 + LD_PRELOAD 高性能 (无 IPC 开销) 兼容性差 (需 PIC/动态链接), 正确性风险 (漏拦截静态调用), 维护难 2.2倍

Conclusion

We have designed, implemented, and thoroughly evaluated Phantom, a novel, high-performance network simulator for large-scale distributed systems. Phantom's multi-process design eliminates the compatibility, correctness, and maintainability limitations that we believe have inhibited the widespread adoption of existing plugin-based simulators. With our innovative synthesis of efficient process control, system call interposition, and data transfer mechanisms, Phantom also overcomes the inter-process performance challenges of the state-of-the-art multi-process simulator. Through our extensive evaluation, we have demonstrated that Phantom achieves better performance and is more scalable than alternative simulators across a variety of important benchmarks.

我们设计, 实现并全面评估了 Phantom, 这是一种面向大规模分布式系统的新型高性能网络模拟器.Phantom 的多进程设计消除了兼容性, 正确性和可维护性方面的局限, 我们认为正是这些局限阻碍了现有基于插件的模拟器的广泛采用.

凭借对高效进程控制, 系统调用拦截及数据传输机制的创新性融合, Phantom 还克服了现有最先进多进程模拟器所面临的进程间性能挑战.

通过详尽的评估, 我们证明了在一系列重要基准测试中, Phantom 相比其他同类模拟器具备更优越的性能和更强的可扩展性.