跳转至

Evaluation

这一部分可能会在实验复现时需要着重关注,仅阅读的话没有必要对实验配置与参数设置太过计较

In this section, we evaluate Unison in comparison with prior PDES algorithms and ML-based data-driven approaches. We mainly focus on the following questions:

  • Can Unison efficiently reduce the synchronization time while providing a transparent interface for users? (§6.1)

  • How is the accuracy of Unison compared with sequential DES and data-driven approaches? (§6.2)

  • How do the partition and scheduling algorithms introduced in §4.2 and §4.3 improve the performance of Unison? (§6.3)

Our evaluation uses an optimized build of ns-3.36.1 in three different testbeds:

  • Testbed 1. To compare the overall performance of Unison with PDES, we run both Unison and other PDES algorithms over six identically configured machines. Each machine has 256GB RAM and two 12-core 2.2GHz CPUs with hyper-threading off.

  • Testbed 2. To demonstrate the behavior of packets in detail which requires recording 𝑃,𝑆 and 𝑀 defined in §3.2, we run both Unison and other PDES algorithms on one machine which has 256GB RAM and two 8-core 2.0GHz CPUs with hyper-threading off.

  • Testbed 3. For ML-based data-driven approaches, we run their evaluations on one machine with two 28-core 2.0GHz CPUs and 512GB RAM, alongside two NVIDIA A100 GPUs with 40GB high bandwidth memory.

Unison is Fast and Transparent

alt text

Unison is faster than existing approaches. As shown in Figure 1, we first evaluate Unison, the sequential DES, and PDES algorithms under extremely skewed traffic (i.e., incast traffic ratio is 1). Unison can achieve over 10× speedup with the same number of CPU cores compared with the existing PDES algorithms.

We further evaluate the performance of Unison against an ML-based data-driven simulator, DeepQueueNet [40], under balanced traffic (i.e., incast traffic ratio is 0). We use three fat-tree topologies 3 according to the configuration of DeepQueueNet [40]. The link is 100Mbps with a delay of 500µs. For traditional PDES algorithms, we evenly divide the topologies as shown in Figure 3, which leads to 4 LPs on fat-tree 16, 8 LPs on fat-tree 64 and fat-tree 128. For Unison, we launch 16 threads for all three topologies. For DeepQueueNet, since it supports parallel DNN inference on multiple GPUs, we use both GPUs in Testbed 3. We use the example device model provided by DeepQueueNet and only evaluate its inference time. As the number of injecting packets increases, we find that the DeepQueueNet implementation given by [40] will crash due to the GPU memory overflow. Therefore, we only inject 0.32, 1.28 and 2.56 million packets into the three topologies, respectively. As shown in Figure 8a, Unison and other PDES simulations are faster than data-driven approaches at a small scale. As the scale of the fat-tree grows, Unison outperforms DeepQueueNet with full-fidelity simulation and achieves over 13× speedup for sequential DES using 16 threads. It happens because that DeepQueueNet operates at the packet level. Therefore, its simulation time is proportional to the number of packets. The parallel-efficient Unison can achieve parallelism as high as DNN while avoiding long training time and providing full-fidelity simulation.

这个实验主要通过不同的网络流量配置和拓扑结构,评估了Unison、传统并行离散事件模拟(PDES)算法,以及基于机器学习的数据驱动模拟器DeepQueueNet的性能。以下是实验中涉及的变量和参数配置的梳理:

实验场景和变量配置
  1. Unison、传统PDES算法和DeepQueueNet的比较:

    • 流量模式: 实验在两种流量模式下进行比较:
      • 极度偏斜流量(Incast Traffic Ratio = 1):用于评估Unison与传统PDES算法的性能。
      • 均衡流量(Incast Traffic Ratio = 0):用于评估Unison与DeepQueueNet的性能。
  2. 拓扑结构:

    • 使用了三种不同规模的Fat-tree拓扑结构:
      • Fat-tree 16
      • Fat-tree 64
      • Fat-tree 128
  3. 网络配置:

    • 链路带宽: 100 Mbps
    • 链路延迟: 500 µs
  4. PDES算法配置:

    • 对于传统PDES算法,将拓扑结构平均划分为若干逻辑进程(LPs),具体划分如下:
      • Fat-tree 16: 4个LPs
      • Fat-tree 64: 8个LPs
      • Fat-tree 128: 8个LPs
  5. Unison配置:

    • 对于Unison,在所有三种拓扑结构下均启动16个线程进行并行计算。
  6. DeepQueueNet配置:

    • 使用Testbed 3中的两个GPU进行并行DNN推理。
    • 使用DeepQueueNet提供的示例设备模型,仅评估其推理时间。
    • 由于DeepQueueNet在处理大量数据包时会出现GPU内存溢出,因此对每个拓扑结构分别注入以下数量的数据包:
      • Fat-tree 16: 0.32百万个数据包
      • Fat-tree 64: 1.28百万个数据包
      • Fat-tree 128: 2.56百万个数据包

实验结果

  1. Unison对比传统PDES算法:

    • 在极度偏斜流量下,Unison在相同CPU核心数量下比传统PDES算法快了10倍以上。
  2. Unison对比DeepQueueNet:

    • 在均衡流量下,小规模时Unison和传统PDES算法比基于数据驱动的方法更快。
    • 随着Fat-tree拓扑规模的增加,Unison提供全保真模拟并在使用16个线程时比DeepQueueNet快13倍以上。
    • 这种优势主要是由于Unison可以在避免长时间训练的情况下提供类似DNN的高并行度,同时进行全保真模拟。相比之下,DeepQueueNet在包级别操作,模拟时间与数据包数量成正比。

Unison is transparent to users. Since the whole partition and scheduling process of Unison is automatic, all operations in Unison kernel are transparent to users. Users only need to import the Unison kernel, and the simulator’s parallelism can be freely adjusted without modifying the original DES model code. We enable Unison in several ns-3 examples [30], including queue discipline benchmarks, NIx-vector routing examples, multicast examples and a RIP routing model which changes the topology by teardown links during the simulation to observe its convergence. All of these examples are accurately simulated without any modifications to their model code.

The previous experiment on fat-tree has demonstrated that Unison can launch 16 threads while the manual partition scheme of the other PDES algorithms can only produce up to 8 LPs for fat-tree 128. To further demonstrate Unison’s flexibility, we increase the number of threads to 24 to fully utilize the hardware resources of Testbed 1. We reset the link bandwidth to 100Gbps and the link delay to 3µs to simulate a high-performance DCN for 1 second. The 𝑘 = 8 fat-tree, which can only be divided symmetrically into 2, 4 and 8 LPs for other PDES algorithms, can be simulated with 24 cores and achieve over 40× super-linear speedup, as shown in Figure 8b. Unison turns two-day long (45.2 hours) sequential DES into 1.1 hours. The super-linear speedup can be explained by the cache boost of fine-grained partition, which will be further evaluated in §6.3. The flexibility of Unison can fully utilize hardware resources without the need to manually re-divide the topology.

实验场景和变量配置

实验背景

  • Unison的透明性:Unison内核的分区和调度过程是自动的,所有操作对用户来说是透明的。用户只需导入Unison内核,就可以自由调整模拟器的并行度,而无需修改原始的DES模型代码。
  • 应用示例:Unison被应用于多个ns-3示例中,包括队列纪律基准测试、NIx-vector路由示例、多播示例,以及在模拟中通过拆除链路改变拓扑的RIP路由模型。这些示例在没有任何代码修改的情况下都能被精确模拟。

实验场景和变量配置

  1. 实验目的

    • 验证Unison的灵活性:通过增加线程数量,展示Unison在充分利用硬件资源方面的灵活性,特别是在处理高性能数据中心网络(DCN)模拟时的表现。
  2. 硬件资源配置

    • 测试平台(Testbed 1):利用Testbed 1的硬件资源,增加线程数量以达到最佳性能。
  3. 拓扑结构

    • 使用𝑘 = 8的Fat-tree拓扑结构(Fat-tree 128)。这是一个可以对称地划分为2、4和8个LPs的拓扑结构,其他PDES算法需要手动划分。
    • 线程数量
      • 其他PDES算法手动划分可以生成最多8个LPs。
      • Unison自动划分并启动24个线程。
  4. 网络配置

    • 链路带宽:100 Gbps
    • 链路延迟:3 µs
    • 模拟时间:1秒

实验结果

  1. 超线性加速

    • Unison可以将一个需要45.2小时的顺序DES模拟减少到1.1小时,达到40倍以上的超线性加速。
    • 这种超线性加速的原因是细粒度分区带来的缓存提升效果,这将在实验§6.3中进一步评估。
  2. 灵活性和硬件资源的利用

    • Unison能够灵活地利用硬件资源,而无需手动重新划分拓扑结构。

通过这个实验,研究展示了Unison在处理高性能DCN模拟时的显著优势。通过自动调整并行度,Unison能够实现比其他PDES算法更高的性能,并充分利用硬件资源。这些结果进一步证明了Unison的透明性和灵活性。

Unison eliminates the synchronization time. To understand the astonishing speedup brought by Unison, we run the fat-tree of 𝑘 = 8 introduced in §3.2 and record 𝑃, 𝑆, 𝑀 of each thread (instead of each LP). Here, the processing time 𝑃 is counted for both phases of processing events and handling global events. And the messaging time 𝑀 is counted for both phases of receiving events and updating the window.

As illustrated in Figure 9a, the 𝑆 of Unison is surprisingly less than 2% of the total time in every case, and the 𝑀 is less than 0.3% of the total time, which is hardly visible in Figure 9a, so we omit them. Meanwhile, the 𝑃 is also less than the other two algorithms by about 20%, which is a benefit from the cache boost. These two factors together make Unison achieve 5× speedup relative to other PDES algorithms. We also measure the 𝑆 of Unison under the balanced traffic in the first 1000 rounds of the simulation. As shown in Figure 9b, the ratio of 𝑆 is mainly under 1% in every round. Compared with Figure 5b, we can see that Unison solves the issue of highly unbalanced processing time in a transient time window. Therefore, we can understand that the performance gain of Unison is mainly caused by eliminating the synchronization time with adaptive scheduling optimized for network simulation, while reducing the processing time by leveraging the cache effect with fine-grained partition.

实验场景和变量配置

实验背景

这个实验分析了Unison通过消除同步时间和优化调度所带来的显著加速效果,并比较了其与其他并行离散事件模拟(PDES)算法在不同条件下的性能。以下是实验内容及变量/参数配置的详细梳理:

  • 实验目标:了解Unison实现显著加速的原因,特别是在消除同步时间和优化处理时间方面的效果。
  • 拓扑结构:实验使用了§3.2中介绍的𝑘 = 8的Fat-tree拓扑结构(Fat-tree 128)。
  • 比较对象:Unison与其他PDES算法的性能比较。

实验场景和变量配置

  1. 关键指标

    • 处理时间(𝑃):包括处理事件和处理全局事件两个阶段的时间。
    • 同步时间(𝑆):线程间同步所花费的时间。
    • 消息传递时间(𝑀):包括接收事件和更新窗口两个阶段的时间。
    • 线程级别记录:这些指标是针对每个线程(而非每个LP)进行记录的。
  2. 实验条件

    • 极度偏斜流量:主要用于观察Unison在处理不均衡处理时间上的表现。
    • 均衡流量:用于评估Unison在处理均衡流量时的同步时间表现。

实验结果

  1. Unison的同步时间(𝑆)和消息传递时间(𝑀)

    • 极度偏斜流量(见Figure 9a):
      • Unison的同步时间(𝑆)在每种情况下都少于2%的总时间,消息传递时间(𝑀)少于0.3%的总时间,这在图中几乎不可见,因此被忽略。
    • 均衡流量(见Figure 9b):
      • 在模拟的前1000轮中,Unison的同步时间(𝑆)比例主要在1%以下。
  2. 处理时间(𝑃)

    • Unison的处理时间(𝑃)比其他PDES算法少约20%,这得益于缓存提升效应(cache boost),该效应来自于细粒度分区带来的优势。
  3. 性能提升

    • 由于消除了同步时间(𝑆)并利用缓存效应优化了处理时间(𝑃),Unison相对于其他PDES算法实现了5倍的加速。
    • Unison解决了在瞬态时间窗口内处理时间高度不平衡的问题,这也是其性能提升的主要原因。

实验总结

通过这次实验,研究展示了Unison通过消除同步时间细粒度分区来提升缓存效应,从而大幅减少处理时间(𝑃),最终显著提高了模拟速度。Unison的自适应调度算法能够针对网络模拟进行优化,使得在模拟过程中,同步和消息传递的开销几乎可以忽略不计,这使得Unison在模拟大规模网络拓扑时表现出色。

Unison is effective in various scenarios. To evaluate the generality of Unison across different scenarios, we introduced four new topologies: 2D-torus [28], BCube [16], and two wide area networks from the Internet Topology Zoo [22], each with two traffic patterns: web-search [4] and gRPC [27]. Additionally, we added a reconfigurable DCN model [8] to analyze the performance impact of dynamic topologies on Unison.

alt text

For the 2D-torus topology, we set the size to 48 × 48, with a link bandwidth of 10Gbps and a link delay of 30µs. Each node in the topology, located at the \(i\)-th row and \(j\)-th column, was assigned an ID of \(i + 48 \times j\). For other PDES algorithms requiring manual partitioning, we defined the range as [0, 48 × 48] and treated node sub-arrays as logical processes (LPs). We ran the simulation using 96 and 144 cores for 1 second, generating the bisection bandwidth and recording the results. As shown in Figure 10a, Unison outperformed other algorithms by nearly 4×. The sequential DES could not complete within 5 days, only progressing the simulated time to 0.6 seconds, while Unison reduced this to less than 1.5 hours with 144 cores.

alt text

For the BCube topology, we set \(n = 8\) and constructed the topology using 2 levels. The link bandwidth was set to 10Gbps, and the link delay to 3µs. We generated traffic according to the web-search and gRPC distributions, including incast traffic, which together accounted for 30% of the bisection bandwidth. Each BCube 0 was treated as an LP (resulting in 8 LPs), and the top-level switches were distributed evenly among these LPs for other PDES algorithms. As shown in Figure 10b, Unison achieved the highest speed among all algorithms. Under gRPC traffic, Unison achieved nearly 10× speedup with 8 cores and 15× speedup with 16 cores.

alt text

For the wide-area backbone networks GEANT and ChinaNet, we used RIP dynamic routing with a 50% traffic load generated according to the web-search distribution. Since finding a symmetric division for these versatile topologies was impossible, we excluded other PDES algorithms. We launched 8 threads for Unison. As illustrated in Figure 10c, Unison achieved over 10× super-linear speedup relative to sequential DES.

alt text

For the reconfigurable DCN, we used 4 cores to simulate a 10Gbps, \(k = 4\) fat-tree under balanced traffic for 1 second. At specific intervals, we replaced all core switches with an optical switch and then reverted to the original configuration by changing the link connectivity between ToR and core switches. The configurations were similar to TDTCP [8]. As shown in Figure 10d, the simulation time for Unison and the default sequential simulator slightly increased as the topology change frequency increased. Thus, we conclude that the performance impact of Unison under dynamic topologies is negligible.

实验场景与参数配置
  1. 实验目的

    • 评估 Unison 在不同场景下的通用性。 为此,实验引入了四种新拓扑结构:2D-torus、BCube,以及来自 Internet Topology Zoo 的两个广域网络 (GEANT 和 ChinaNet),并结合了两种流量模式(web-search 和 gRPC)。此外,还添加了一个可重构的数据中心网络 (DCN) 模型,以分析动态拓扑对 Unison 性能的影响。
  2. 实验拓扑结构及参数配置

    • 2D-torus 拓扑:
      • 拓扑尺寸: 48 × 48
      • 链路带宽: 10Gbps
      • 链路延迟: 30µs
      • 节点 ID 分配: \(i\)-th 行, \(j\)-th 列的节点 ID 为 \(i + 48 \times j\)
      • 其他 PDES 算法: 需手动划分范围为 [0, 48 × 48],将节点子数组视为逻辑进程 (LP)
      • 计算资源: 使用 96 和 144 核心进行模拟,模拟时间为 1 秒
    • 结果: Unison 性能优于其他算法约 4 倍。顺序 DES 在 5 天内未能完成,仅将模拟时间推进到 0.6 秒,而 Unison 用 144 核心将模拟时间缩短到不到 1.5 小时。

    • BCube 拓扑:

      • 参数 n: \(n = 8\)
      • 层次结构: 2 层
      • 链路带宽: 10Gbps
      • 链路延迟: 3µs
      • 流量模式: 按照 web-search 和 gRPC 分布生成流量,并包含 incast 流量,占据 30% 的横截带宽
      • LP 处理: 每个 BCube 0 作为一个 LP (共 8 个 LP),其他 PDES 算法将顶层交换机均匀分配到这些 LP 中
      • 计算资源: 使用 8 和 16 核心进行模拟
    • 结果: 在所有算法中,Unison 实现了最高速度。在 gRPC 流量下,Unison 在 8 核时实现了接近 10× 的加速,在 16 核时实现了 15× 的加速。

    • 广域骨干网 (GEANT 和 ChinaNet):

      • 路由协议: 使用 RIP 动态路由
      • 流量负载: 50% 的流量负载根据 web-search 分布生成
      • PDES 算法: 由于这些多样化拓扑无法找到对称划分,因此排除了其他 PDES 算法
      • 计算资源: 启动了 8 个线程用于 Unison
    • 结果: Unison 相对于顺序 DES 实现了超过 10× 的超线性加速。

    • 可重构 DCN 模型:

      • 拓扑结构: 10Gbps 带宽,\(k = 4\) 的 fat-tree
      • 流量模式: 平衡流量
      • 拓扑变动: 在特定时间间隔内,用一个光交换机替换所有核心交换机,之后再切换回来,并改变 ToR 与核心交换机之间的链路连接性
      • 配置参考: 类似 TDTCP 的配置
      • 计算资源: 使用 4 个核心进行模拟,模拟时间为 1 秒
    • 结果: 随着拓扑变化频率的增加,Unison 和默认顺序模拟器的模拟时间略有增加。得出结论,Unison 在动态拓扑下的性能影响可以忽略不计。
  3. 实验结论

    • Unison 在不同的网络拓扑结构和流量模式下均表现出了优异的性能,尤其是在动态拓扑环境下,其性能几乎不受影响。

Unison is Accurate and Deterministic

Unison is accurate and deterministic based on our several optimizations in practice, including the lock-free implementation, the tie-breaking rule and the support for global threadsafe measurement.

Unison 是准确且确定的,这基于我们在实践中进行的多项优化,包括无锁实现 (lock-free mechanism)、平局决策 (tie-breaking) 规则以及对全局线程安全测量的支持。

Unison is as accurate as DES. To evaluate the accuracy of Unison and MimicNet, a data-driven approach [42], we simulate TCP New Reno with RED queue for 5 seconds in both 2-cluster and 4-cluster fat-trees. The number of hosts per rack is 2, so each cluster has 4 hosts. The link speed is 100Mbps with 500µs delay, which is similar to the setup given in MimicNet [42]. We generate the same traffic for Unison and MimicNet, which is sampled from the web-search distribution and takes up 70% of the bisection bandwidth. In addition, each time a flow is generated, its destination has a 10% chance of being changed into a random host in the very right cluster. For Unison, we record statistics with the FlowMonitor module of ns-3 and use the default sequential DES kernel of ns-3 as the baseline. For MimicNet, we train the ingress, egress and inter-mimic models according to their default configurations using the same traffic generation method but a different random seed 4 , and use its underlying OMNeT++ as the baseline.

Unison 与 DES 一样准确。为了评估 Unison 和 MimicNet(一种数据驱动方法 [42])的准确性,我们在 2-cluster 和 4-cluster 的 fat-tree 结构中模拟了 TCP New Reno 和 RED 队列,模拟时长为 5 秒。每个机架的主机数量为 2,因此每个集群有 4 个主机。链路速度为 100Mbps,延迟为 500µs,这与 MimicNet [42] 中的设置类似。我们为 Unison 和 MimicNet 生成相同的流量,该流量从 web-search 分布中采样,占用了 70% 的横截带宽。此外,每当生成一个流时,其目的地有 10% 的几率更改为最右侧集群中的随机主机。对于 Unison,我们使用 ns-3 的 FlowMonitor 模块记录统计数据,并使用 ns-3 的默认顺序 DES 内核作为基线。对于 MimicNet,我们根据其默认配置训练 ingress、egress 和 inter-mimic 模型,使用相同的流量生成方法但不同的随机种子,并使用其底层 OMNeT++ 作为基线。

As shown in Table 2, MimicNet can accurately predict the 2-cluster fat-tree. However, its accuracy dropped for RTT and throughput when simulating the 4-cluster fat-tree. This is because MimicNet only trains one cluster and uses this cluster to predict other clusters’ performance, which is not suitable for traffic that does not scale proportionally in this incast situation. For Unison, due to the different tie-breaking rules of simultaneous events (instead of the sequential DES, which always processes the earliest created event first), there is only a slight difference. It is notable that we cannot compare the accuracy between sequential DES and Unison, since both outcomes are possible if we handle all the simultaneous events in random order.

如表 2 所示,MimicNet 可以准确地预测 2-cluster fat-tree 的性能。然而,在模拟 4-cluster fat-tree 时,其在 RTT 和吞吐量方面的准确性有所下降。这是因为 MimicNet 只训练了一个集群,并使用该集群预测其他集群的性能,这在此种不成比例扩展的 incast 场景中并不适用。对于 Unison,由于同时事件的不同平局决策规则(不同于顺序 DES,DES总是首先处理最早创建的事件),结果仅有轻微差异。需要注意的是,无法比较顺序 DES 和 Unison 的准确性,因为在随机顺序处理所有同时发生的事件时,这两种结果都是可能的。

alt text

实验梳理与分析
  1. 实验目的

    • 评估 Unison 的准确性和确定性。 Unison 在实践中通过多个优化措施(包括无锁实现、平局决策规则以及对全局线程安全测量的支持)保持准确性和确定性。
  2. 实验设置

    • 比较对象: Unison 和 MimicNet(基于数据驱动的方法)
    • 模拟对象: TCP New Reno with RED queue
    • 拓扑结构: 2-cluster 和 4-cluster fat-trees
    • 集群配置: 每个机架有 2 个主机,每个集群有 4 个主机
    • 链路速度: 100Mbps
    • 链路延迟: 500µs
    • 流量生成:
      • 采样自 web-search 分布,使用 70% 的横截带宽
      • 每次生成流量时,有 10% 的几率将目的地更改为最右边集群中的一个随机主机
    • Unison 记录方式: 使用 ns-3 的 FlowMonitor 模块记录统计数据,并使用 ns-3 的默认顺序 DES 内核作为基线。
    • MimicNet 训练配置:
      • 训练 ingress、egress 和 inter-mimic 模型,使用相同的流量生成方法但不同的随机种子,并使用其底层 OMNeT++ 作为基线。
  3. 实验结果

    • 2-cluster fat-tree: MimicNet 能够准确预测 2-cluster fat-tree 的性能。
    • 4-cluster fat-tree: MimicNet 的准确性在模拟 4-cluster fat-tree 时有所下降,尤其是在 RTT 和吞吐量方面。原因在于 MimicNet 仅训练一个集群,并使用该集群预测其他集群的性能,这在不成比例扩展的流量场景(例如 incast)中不适用。
    • Unison 准确性:
      • 由于同时发生事件的不同平局决策规则(与顺序 DES 始终先处理最早创建的事件不同),Unison 的结果与顺序 DES 仅有轻微差异。
      • 需要注意的是,无法直接比较顺序 DES 和 Unison 的准确性,因为在处理同时发生的事件时,如果按随机顺序处理,可能会得出不同的结果。
  4. 实验结论

    • 准确性与确定性: Unison 的准确性接近于 DES,在同时处理多个事件时会有微小差异,但这些差异并不影响整体的准确性。MimicNet 在小规模集群模拟中表现良好,但在更复杂的拓扑结构中准确性下降。

Unison is deterministic. The tie-breaking rule introduced by Unison guarantees that the simulation is deterministic. To this end, we simulate a 𝑘 = 4 fat-tree 10 times and record the event count of different algorithms. As shown in Figure 11a, the event count of Unison is always the same. Yet for the other two PDES algorithms, their event counts will fluctuate when running the simulation for another time, leading to inconsistent results, as illustrated in Figure 11b. Moreover, since another two algorithms fail to measure global statistics, their results are neither precise nor accurate. We also run Unison with 1 to 16 threads. The event count and simulation results of Unison also match exactly regardless of the number of threads used.

Unison 是确定性的。 Unison 引入的平局决策规则保证了模拟过程的确定性。为此,我们对一个 \(k = 4\) 的 fat-tree 进行了 10 次模拟,并记录了不同算法的事件计数。如图 11a 所示,Unison 的事件计数始终保持一致。然而,对于另外两种 PDES 算法,它们的事件计数在多次运行模拟时会有所波动,导致结果不一致,如图 11b 所示。此外,由于另外两种算法无法测量全局统计数据,它们的结果既不精确也不准确。我们还使用 1 到 16 个线程运行 Unison,无论使用多少线程,Unison 的事件计数和模拟结果都完全匹配。

alt text

Micro Benchmarks

We evaluate the benefits of fine-grained partition and optimize configurations for different networking scenarios used in load-adaptive scheduling.

Fine-grained partition reduces cache misses. As shown in Figure 8b and Figure 9a, Unison can achieve super-linear speedup because of the cache-friendly design of Unison. To further validate the claim, we use Unison but manual partition for a 12×12 torus network discussed in §6.1, with 1 thread. As illustrated in Figure 12a, the number of cache misses decreases as the number of LPs increases, resulting in a faster simulation time of 1.5× with the most fine-grained partition (i.e., 144 LPs where each node is an LP).

细粒度分区减少缓存未命中。 如图 8b 和图 9a 所示,Unison 由于其缓存友好的设计,可以实现超线性加速。为了进一步验证这一结论,我们使用 Unison,但手动划分了在 §6.1 中讨论的 12×12 环形网络的分区,使用1个线程。正如图 12a 所示,随着 LPs 数量的增加,缓存未命中次数减少,最细粒度的分区(即 144 个 LP,每个节点为一个 LP)使模拟时间加快了 1.5 倍。

粒度越细,cache miss 越少,模拟时间越短

alt text

alt text

alt text

To explore how different partition schemes affect the performance, we use Unison but manual partition to simulate the DCTCP example [4] in §6.2 with 4 threads. There is a bottleneck link carrying huge traffic between sender clusters and receiver clusters in this model. We use Unison’s fine-grained partition, but avoid cutting off this bottleneck link in our first manual partition scheme. We also avoid cutting off links in the clusters in our second manual partition scheme (i.e. coarse-grained). As shown in Figure 12b, fine-grained partition reduces cache misses from frequent interleaving when cutting off a link carrying huge traffic. It also produces balanced scheduling and reduces the simulation time compared with coarse-grained partition, despite a small increase in cache misses when using multiple threads.

为了探讨不同分区方案如何影响性能,我们在 §6.2 中使用 Unison 对 DCTCP 示例进行模拟,使用了 4 个线程。在该模型中,发送集群与接收集群之间存在一个承载大量流量的瓶颈链路。我们使用了 Unison 的细粒度分区,但在第一个手动分区方案中避免切断该瓶颈链路。在第二个手动分区方案中(即粗粒度分区),我们也避免切断集群内部的链路。正如图 12b 所示,细粒度分区通过减少切断承载大量流量的链路时频繁交错产生的缓存未命中次数,改进了调度平衡并减少了模拟时间,尽管在使用多线程时缓存未命中次数略有增加。

alt text

Note
  1. 自变量:粗粒度 and 细粒度
  2. 背景:不切断瓶颈链路
  3. 结论:细粒度分区减少了缓存未命中次数,改进了调度平衡并减少了模拟时间

Evaluation on scheduling metrics and periods. To measure the impact of different metrics used in the load-adaptive scheduling, we define a slowdown factor, 𝛼, which is the sum of the actual completion time of each round divided by the sum of the idealistic round time. The idealistic round time is calculated by assuming that the scheduler knows the exact processing time of each LP in advance, and schedules these LPs according to their exact processing time.

调度指标和周期的评估。 为了衡量负载自适应调度中使用的不同指标的影响,我们定义了一个减速因子 \(\alpha\),它是每一轮的实际完成时间之和除以理想轮时间之和。理想轮时间是在假设调度器提前知道每个 LP 的确切处理时间的情况下计算的,并根据这些确切处理时间对 LP 进行调度。

alt text

We simulate a 𝑘 = 8 fat-tree. Figure 12c shows that the default metric used by Unison, which is the processing time of the last round, stands out by reducing the slowdown factor by 6% compared with no scheduling (i.e., random priorities of LPs) as the number of threads increases to 16, and achieves only 2% more of the round time compared with the idealistic situation.

我们模拟了一个 \(k = 8\) 的 fat-tree。图 12c 显示了 Unison 使用的默认指标(即上一轮的处理时间),相比于没有调度(即 LP 的随机优先级),随着线程数量增加至 16,减速因子降低了 6%,并且只比理想情况多 2% 的轮时间。

We evaluate the performance impact of different scheduling periods since there is a trade-off between scheduling benefits and overheads (i.e., caused by sorting LPs). As shown in Figure 12d, the performance of Unison gets better as the period increases to 16. Further increases in the scheduling period will cause degraded performance. Based on our experience, a period of 16 is large enough to handle 2 16 LPs with minimal impact on performance.

我们评估了不同调度周期对性能的影响,因为调度收益与开销(即 LP 排序引起的开销)之间存在权衡。正如图 12d 所示,Unison 的性能在调度周期增加到 16 时变得更好。调度周期进一步增加会导致性能下降。根据我们的经验,调度周期为 16 已足够处理 \(2^{16}\) 个 LP 且对性能影响最小。