跳转至

Unison: A Parallel-Efficient and User-Transparent Network Simulation Kernel

Abstract

Discrete-event simulation (DES) is a prevalent tool for evaluating network designs. Although DES offers full fidelity and generality, its slow performance limits its application. To speed up DES, many network simulators employ parallel discrete-event simulation (PDES). However, adapting existing network simulation models to PDES requires complex reconfigurations and often yields limited performance improvement. In this paper, we address this gap by proposing a parallel-efficient and user-transparent network simulation kernel, Unison, that adopts fine-grained partition and load adaptive scheduling optimized for network scenarios. We prototype Unison based on ns-3. Existing network simulation models of ns-3 can be seamlessly transitioned to Unison. Testbed experiments on commodity servers demonstrate that Unison can achieve a 40× speedup over DES using 24 CPU cores, and a 10× speedup compared with existing PDES algorithms under the same CPU cores.

离散事件仿真(DES)是评估网络设计的一种普遍工具。尽管DES提供了全保真度和通用性,但其缓慢的性能限制了其应用。为了加速DES,许多网络仿真器采用并行离散事件仿真(PDES)。然而,将现有的网络仿真模型适应PDES需要复杂的重新配置,并且通常只能获得有限的性能提升。在本文中,我们提出了一种并行高效且对用户透明的网络仿真内核Unison,它采用了针对网络场景优化的细粒度分区和负载自适应调度。我们基于ns-3对Unison进行了原型设计。现有的ns-3网络仿真模型可以无缝迁移到Unison。基于普通服务器的测试床实验表明,使用24个CPU核心,Unison可以实现相对于DES的40倍加速,并且在相同的CPU核心下,相较于现有的PDES算法可以实现10倍加速。

Keywords: Network simulation, Parallel discrete-event simulation, Data center networks

Introduction

Discrete-event simulation (DES) is a prevalent tool for evaluating network designs among various simulation approaches. Popular network simulators including ns-3 [30], OMNeT++ [31] and ns.py [18] are all based on DES. Many influential studies on networks, such as DCTCP [4], DCQCN [44], TIMELY [27], PowerTCP [2] and ABM [1] have used these DES-based simulators to provide convincing evaluation results. The full-fidelity accuracy(全保真度)and packet-level details of DES make it the de facto(事实上的) ground truth for network performance estimation [9, 36, 40, 42, 43]. The extensive generality of DES also enables researchers to understand network behaviors under buggy configurations (e.g., BGP route leak), heavy workload (e.g., DDoS attacks) and other extreme scenarios (e.g., TCP incast), which are costly to produce and analyze in the real world.

Although DES offers full fidelity and generality, its slow performance limits its application. As the scale of current networks evolves, such as expanding data center networks (DCNs), the simulation time for these giant networks is unacceptably long for DES. Based on our experience, simulating a 1536-host fat-tree connected with 100Gbps links for only 0.1 seconds using the DES kernel of ns-3 on a modern server CPU can take over a day to complete. Moreover, deploying technologies into these giant networks will counter phenomena(现象) that a small-scale simulation or testbed cannot reflect [12]. For example, the performance of congestion control algorithms suffers considerable downgrades in large-scale DCNs with a large amount of bursty traffic [14]. Therefore, a full-scale DES for these giant networks is necessary to understand their performance.

尽管DES提供了全保真度和通用性,但其缓慢的性能限制了其应用。随着当前网络规模的扩大,例如数据中心网络(DCNs)的扩展,DES对这些庞大网络的仿真时间变得不可接受。根据我们的经验,在现代服务器CPU上使用ns-3的DES内核仿真一个仅连接100Gbps链路的1536主机的fat-tree网络0.1秒,可能需要超过一天的时间。此外,将技术部署到这些庞大网络中会遇到小规模仿真或测试平台无法反映的现象 [12]。例如,在具有大量突发流量的大规模DCNs中,拥塞控制算法的性能会显著下降 [14]。因此,对于这些庞大网络来说,进行全规模的DES仿真是必要的,以了解它们的性能。

To speed up the DES, many network simulators offer parallel discrete-event simulation (PDES). PDES requires the simulated network topology to be spatially divided into multiple partitions before the simulation starts. Each partition is called a logical process (LP). A synchronization mechanism between LPs is required to ensure the correctness of the whole simulation progress [10, 23]. As the simulated network is divided into multiple LPs, PDES can improve the simulation speed by processing these LPs in parallel.

为了加速DES,许多网络仿真器提供了并行离散事件仿真(PDES)。PDES要求在仿真开始前将被仿真的网络拓扑空间划分为多个分区。每个分区称为一个逻辑进程(LP)。LP之间需要一种同步机制,以确保整个仿真过程的正确性 [10, 23]。由于仿真网络被划分为多个LP,PDES可以通过并行处理这些LP来提高仿真速度。

However, adapting existing network simulation models to PDES requires complex reconfigurations and often yields limited performance improvement. Therefore, many researchers still prefer the slow DES instead of PDES [23]. In this paper, we perform a deep profiling of existing PDES algorithms based on ns-3. We argue that there are two major limitations of the existing PDES algorithms for networks:

  • Complex manual configurations. Existing algorithms require extensive modifications to model code, due to the manual partition(人为划分) on the simulated network and the aggregation of results collected by each LP. The configurations are inflexible, rely on experience, and can directly compromise the parallel performance.

  • Slow speedup. The existing static partition scheme and synchronization mechanisms of PDES are not well-suited for emerging low-latency, high-bandwidth, and large-scale networks, leading to significant synchronization overhead and parallel inefficiency.

然而,将现有的网络仿真模型适应PDES需要复杂的重新配置,且通常只能获得有限的性能提升。因此,许多研究人员仍然倾向于使用较慢的DES,而不是PDES [23]。在本文中,我们基于ns-3对现有的PDES算法进行了深入分析。我们认为现有的网络PDES算法存在两个主要局限性:

  • 复杂的手动配置。现有算法要求对模型代码进行广泛修改,原因在于仿真网络的手动划分和每个LP收集结果的聚合。这些配置不灵活,依赖于经验,并且可能直接影响并行性能。

  • 缓慢的加速。现有的静态划分方案和同步机制不适合新兴的低延迟、高带宽和大规模网络,导致显著的同步开销和并行效率低下。

These two limitations make PDES inflexible and infeasible to scale up [29, 32]. To avoid this problem, recent works cleverly borrow advances in machine learning (ML) to make the simulation faster with GPU [9, 36, 40, 42]. Although these ML-based data-driven methods can obtain satisfactory execution efficiency in well-trained scenarios, their generality is still limited. For new scenarios and algorithms, it is difficult to perform the data-driven methods since there is no sufficient data to train the ML models. Moreover, they still suffer from 7 to 12 hours-long training time and can only obtain approximated results [40, 42].

这两个限制使PDES变得不灵活且难以扩展 [29, 32]。为了避免这个问题,最近的研究巧妙地借用了机器学习(ML)的进展,利用GPU加速仿真 [9, 36, 40, 42]。尽管这些基于ML的数据驱动方法在训练良好的场景中可以获得令人满意的执行效率,但其通用性仍然有限。对于新场景和新算法,由于缺乏足够的数据来训练ML模型,难以实施数据驱动的方法。此外,这些方法仍然需要7到12小时的训练时间,并且只能获得近似结果 [40, 42]。

In this paper, we address this gap by proposing a parallel-efficient and user-transparent network simulation kernel, named Unison. The main idea of Unison is to perform automatic, fine-grained partition and dynamic, load-adaptive scheduling optimized for network scenarios in PDES.

For fine-grained partition, Unison divides the simulated network into fine-grained LPs automatically before the simulation starts to improve cache efficiency and for further scheduling. With this approach, Unison frees users from complex manual configurations when setting up the simulated network.

For load-adaptive scheduling, Unison decouples the relationship between LPs and processor cores, leaving the load-balancing job among each core to its scheduler. The scheduler of Unison dynamically balances the workload with several heuristic methods (启发式方法) based on network characteristics. With this approach, all processor cores can finish their events in unison, resulting in an efficient parallelization under various topologies and traffic patterns.

在本文中,我们通过提出一种并行高效且对用户透明的网络仿真内核Unison来解决这一问题。Unison的主要思想是在PDES中进行自动的细粒度分区和动态的负载自适应调度,优化网络场景。对于细粒度分区,Unison在仿真开始前自动将被仿真的网络划分为细粒度的LP,以提高缓存效率并便于进一步调度。通过这种方法,Unison使用户在设置仿真网络时无需进行复杂的手动配置。对于负载自适应调度,Unison解耦了LP与处理器核心之间的关系,将每个核心之间的负载平衡工作交给其调度器。Unison的调度器基于网络特性,通过多种启发式方法动态平衡工作负载。通过这种方法,所有处理器核心可以同步完成它们的事件,从而在各种拓扑和流量模式下实现高效的并行化。

We prototype Unison based on ns-3 and address several practical challenges. We improve the performance and usability of Unison by introducing a lock-free execution workflow on a shared-memory architecture, which is fully transparent to users.

With this approach, Unison can support dynamic topologies (e.g., reconfigurable DCN [8]) and obtain global statistics (e.g., FCT). By introducing a tie-breaking rule for simultaneous events, these collected statistics are also deterministic and reproducible. To scale Unison on a cluster, we also implemented a hybrid simulation kernel for distributed simulation. As a preview of our work, Figure 1 shows that Unison can achieve over 10× speedup compared with the existing PDES algorithms, turning over two days long DES into less than 2 hours. Moreover, all existing network simulation models included with ns-3 can be seamlessly transitioned(无缝衔接) to Unison to obtain performance improvements.

我们基于ns-3对Unison进行了原型设计,并解决了若干实际挑战。通过在共享内存架构上引入无锁执行工作流,我们提高了Unison的性能和可用性,这对用户完全透明。通过这种方法,Unison可以支持动态拓扑(例如,可重构的数据中心网络 [8])并获取全局统计数据(例如,FCT)。通过为同时发生的事件引入一个决策规则,这些收集的统计数据也具有确定性和可重复性。为了在集群上扩展Unison,我们还实现了一个用于分布式仿真的混合仿真内核。作为我们工作的预览,图1显示Unison相较于现有的PDES算法可以实现超过10倍的加速,将原本超过两天的DES仿真缩短到不到2小时。此外,所有包含在ns-3中的现有网络仿真模型都可以无缝迁移到Unison,以获得性能提升。

alt text

This paper contributes to the field of network simulation and network performance estimation through the following:

  • A deep profiling of existing PDES algorithms, highlighting important observations about the root cause of their complex configurations and unsatisfactory performance: static manual partition on the simulated network (§3).

  • A fine-grained partition scheme and a load-adaptive scheduling strategy for PDES, optimized with network scenarios, achieving a 40× speedup over DES using 24 CPU cores, and a 10× speedup over existing PDES algorithms (§4).

  • A user-transparent and lock-free implementation of Unison on ns-3, addressing several practical challenges to further improve its performance and usability (§5).

  • An extensive evaluation for Unison compared with existing PDES algorithms and data-driven approaches, demonstrating its high performance, scalability and usability under various topologies and traffic patterns (§6).

Our implementation of Unison is open-sourced . This work does not raise any ethical issues.

本文对网络仿真和网络性能估计领域的贡献如下:

  • 对现有PDES算法进行了深入分析,突出其复杂配置和不满意性能的根本原因:仿真网络的静态手动分区 (§3)。

  • 提出了一种针对PDES的细粒度分区方案和负载自适应调度策略,优化了网络场景,使用24个CPU核心实现了相较于DES 40倍的加速,相较于现有PDES算法10倍的加速 (§4)。

  • 在ns-3上实现了对用户透明且无锁的Unison,解决了若干实际挑战,进一步提高了其性能和可用性 (§5)。

  • 对Unison进行了广泛的评估,与现有PDES算法和数据驱动方法进行了比较,展示了其在各种拓扑和流量模式下的高性能、可扩展性和可用性 (§6)。

我们的Unison实现是开源的。此工作不涉及任何伦理问题。