Why Don’t We Use PDES in Practice¶
Open-source artifacts of network designs rarely use PDES for evaluation in recent 3 years except a few [11, 25, 40, 42]. Through the following analysis, we find that the main obstacles to the application of PDES are complex manual configurations (§3.1) and slow speedup (§3.2).
Complex Manual Configurations¶
When adapting a DES model of network to PDES, the following additional steps are required.
在将网络的DES模型适配为PDES时,需要以下附加步骤:
Dividing the network topology. As mentioned in §2.3, PDES requires us to divide the network topology manually into a set of LPs. For efficient parallelization, it is necessary to balance the workload of each LP. However, since the topology and workload in different simulation models can vary, there is no panacea for achieving optimal partition results in every scenario [39, 41]. This means that the partition is inflexible and heavily relies on experience. For example, in the case of a 𝑘-ary fat-tree topology [3] with a balanced workload, we can perform a symmetric partition where each pod is treated as an LP, and the core layer is evenly distributed among the clusters, as shown in Figure 3. In this case, the number of LPs is fixed at 𝑘, meaning that the maximum possible speedup is also fixed at 𝑘 fold. If the current hardware resources do not have enough slots to launch 𝑘 processes, or if we wish to further increase the speedup of the model, we must perform another partition, which can cause parallel inefficiency due to the spatial asymmetry of the new partition scheme. Additionally, if the simulated traffic pattern is not balanced among the LPs (e.g., an incast scenario), even a symmetric partition scheme may not reduce the waiting time. In this case, heuristic static partition methods that try to produce a balanced partition require users to provide hints or pre-run the whole simulation to acquire such hints [31, 38], which leads to extra time cost and also complex configurations.
划分网络拓扑。如§2.3所述,PDES要求我们手动将网络拓扑划分为一组LPs。为了实现高效的并行化,必须平衡每个LP的工作负载。然而,由于不同仿真模型中的拓扑和工作负载会有所不同,因此没有一种万能的方法可以在每种情况下都实现最佳的划分结果 [39, 41]。这意味着划分是不灵活的,并且严重依赖于经验。例如,在工作负载平衡的 \(k\)-ary fat-tree拓扑 [3]情况下,我们可以进行对称划分,每个pod被视为一个LP,核心层在各个集群之间均匀分布,如图3所示。在这种情况下,LP的数量固定为 \(k\) ,这意味着最大可能的加速比也固定为 \(k\) 倍。如果当前硬件资源没有足够的槽位来启动 \(k\) 个进程,或者如果我们希望进一步增加模型的加速比,则必须执行另一次划分,这可能会由于新划分方案的空间不对称而导致并行效率降低。此外,如果仿真流量模式在LP之间不平衡(例如,在incast场景中),即使是对称划分方案也可能无法减少等待时间。在这种情况下,尝试产生平衡划分的启发式静态划分方法需要用户提供提示或预运行整个仿真以获取这些提示 [31, 38],这会导致额外的时间成本和复杂的配置。
Collecting the result. PDES network simulators use ghost nodes to obtain global topology knowledge [35]. However, global data knowledge is still not achievable because the application code and the tracing code are separated into different LPs, and each LP cannot see the ongoing traffic of other LPs. As a result, it is difficult to track a flow when it passes through different LPs. To alleviate this issue, we have to either use inter-process communications to pass statistics or temporarily save the results of each LP and manually aggregate them after the simulation is finished. Both of these solutions will require extra effort to create lengthy configuration scripts. More importantly, the simultaneous events during the simulation can make the results indeterministic [21], making it difficult to reproduce the previous results and debug any issues.
收集结果。PDES网络仿真器使用幽灵节点来获取全局拓扑知识 [35]。然而,全局的数据知识仍然难以实现,因为应用代码和跟踪代码被分离到不同的LP中,每个LP无法看到其他LP的正在进行的流量。因此,当流量经过不同的LP时,很难进行跟踪。为缓解这一问题,我们必须使用进程间通信传递统计数据,或者在仿真结束后暂时保存每个LP的结果并手动汇总。这两种解决方案都需要额外的努力来创建冗长的配置脚本。更重要的是,仿真过程中同时发生的事件会使结果不确定 [21],这使得很难重现之前的结果并调试任何问题。
Carrying out the aforementioned steps will result in significant code changes, as shown in Table 1. The code modification is complex and error-prone since it requires the user to distinguish different LPs and carefully consider available CPU cores, proper topology division and the specific data collection strategy. Therefore, we conclude that the existing PDES is not user-friendly on existing network simulators because of the manual partition-and-collection process.
执行上述步骤将导致显著的代码更改,如表1所示。由于需要用户区分不同的LP,并仔细考虑可用的CPU核心、适当的拓扑划分和具体的数据收集策略,这些代码修改复杂且容易出错。因此,我们得出结论,由于手动划分和收集过程,现有网络仿真器上的现有PDES对用户不友好。
Slow Speedup¶
Even if users successfully adapted their model to PDES, the speedup is still unsatisfactory. To find out the performance bottleneck of PDES, we divide the total running time 𝑇 of a single LP into three parts: processing time 𝑃, synchronization time 𝑆, and messaging time 𝑀. So we have 𝑇 = 𝑃 + 𝑆 + 𝑀. 𝑃 counts when the LP is processing events from its FEL. 𝑆 counts when the LP is waiting for other LPs after event processing is finished 2 . 𝑀 counts when the LP is receiving events from other LPs.
即使用户成功地将他们的模型适配为PDES,加速效果仍然不尽如人意。为了找出PDES的性能瓶颈,我们将单个LP的总运行时间 \(T\) 分为三个部分:处理时间 \(P\)、同步时间 \(S\) 和消息传递时间 \(M\)。因此,我们有 \(T=P+S+M\)。\(P\)计数LP从其FEL处理事件的时间。\(S\) 计数事件处理完成后LP等待其他LP的时间2。\(M\) 计数LP接收其他LP的事件的时间。
We insert the time profiling code into ns-3 implementations of the two PDES discussed in §2.3, and record the 𝑃, 𝑆 and 𝑀 of each LP. Through in-depth profiling, we conclude the following three important observations.
我们在§2.3讨论的两种PDES算法的ns-3实现中插入时间分析代码,并记录每个LP的 \(P\)、\(S\) 和 \(M\)。通过深入分析,我们得出了以下三个重要观察结果。
Observation 1. The synchronization time gradually dominates as the traffic inhomogeneity increases.
We run a \(k = 8\) fat-tree with 100Gbps link bandwidth and 3µs link delay for 0.1 second. We divide the topology symmetrically according to Figure 3, where each cluster is an LP. Figure 5a shows the time composition in this experiment. Note that we omit \(M\) in Figure 5a since it takes less than 5% of \(T\). We find as the traffic inhomogeneity increases, \(S\) becomes a bottleneck on both algorithms. Under an extremely skewed(歪斜的) traffic pattern, \(S\) can take up over 70% of the total execution time. This is because the LP of the victim host becomes the slowest one in the incast scenario. Other LPs must wait for it to finish before entering the next round.
观察1:随着流量不均匀性的增加,同步时间逐渐占主导地位。
我们在100Gbps链路带宽和3µs链路延迟下运行一个\(k = 8\)的fat-tree网络模拟,模拟时间为0.1秒。根据图3对拓扑进行对称划分,每个集群作为一个LP。图5a展示了该实验中的时间组成。请注意,由于消息传递时间(\(M\))占总时间(\(T\))的比例不足5%,在图5a中省略了这一部分。我们发现,随着流量不均匀性的增加,同步时间(\(S\))成为瓶颈。在极端倾斜的流量模式下,\(S\)可以占据总执行时间的70%以上。这是因为在群发场景中,受害主机的LP成为最慢的一个,其他LP必须等待其完成后才能进入下一轮。
Note
很显然,这里的“受害主机”指的就是被“极端倾斜的流量”主要attack的host
Observation 2. The processing time of each LP is highly unbalanced in a transient time window, even if the traffic pattern is balanced in macro.
Even if the traffic pattern is load-balanced, \(S\) still takes about 20% of the total time. To figure out, this time we measure the \(P\) and \(S\) in each round of the barrier synchronization algorithm under load-balanced traffic. Interestingly, we found that the processing time in each round of each LP is highly unbalanced, resulting in the long synchronization time in a transient time window, as Figure 5b shows. This implies that even heuristic static partition methods that try to produce a balanced partition based on load estimation still have considerable synchronization time.
观察2:即使宏观上流量模式平衡,各LP在瞬时时间窗口内的处理时间仍然高度不平衡。
即使流量模式是负载平衡的,同步时间(\(S\))仍占总时间的约20%。为了弄清楚原因,这次我们在负载平衡的流量下,测量了每轮屏障同步算法中的处理时间(\(P\))和同步时间(\(S\))。有趣的是,我们发现每轮中每个LP的处理时间高度不平衡,导致在瞬时时间窗口内较长的同步时间,如图5b所示。这表明,即使是基于负载估计的启发式静态划分方法仍然存在相当可观的同步时间。
Observation 3. The synchronization time is long for lowlatency, high-bandwidth and large-scale networks.
When simulating data center networks, due to the low propagation delay of the links, the time window between two synchronization barriers is smaller. Along with their high bandwidth and traffic throughput, the synchronization time (\(S\)) caused by load variation during a transient time window is huge. This is proven by Figure 5c where we simulate a 10Gbps fat-tree with different link delay and Figure 5d where every host sends a total of 128Gbps traffic in fat-trees with 30µs link delay and different link bandwidth. In addition, the large scale of such data center networks will produce more LPs, further increasing the synchronization time. Based on our experience, the \(S\) ratio can reach over 50% for a 100Gbps, \(k = 16\) fat-tree with 1µs link delay under balanced traffic.
观察3:对于低延迟、高带宽和大规模网络,同步时间较长。
在模拟数据中心网络时,由于链路的传播延迟较低,两次同步屏障之间的时间窗口较小。伴随着其高带宽和高流量吞吐量,由于瞬时时间窗口内的负载变化导致的同步时间(\(S\))巨大。图5c证明了这一点,我们在不同链路延迟下模拟一个10Gbps的fat-tree;图5d也验证了这一点,每个主机在带有30µs链路延迟和不同链路带宽的fat-tree中发送总计128Gbps的流量。此外,这些数据中心网络的巨大规模将产生更多的LP,从而进一步增加同步时间。根据我们的经验,在平衡流量下,\(S\)的比例在一个100Gbps、\(k = 16\)的fat-tree中可以超过50%,链路延迟为1µs。
Through the above discussions, we find that the existing synchronization mechanisms based on static partition do not fit well with bursty, dynamically unbalanced traffic and the current high-bandwidth, low-latency, and large-scale networks. Alongside the manual partition discussed in §3.1, we conclude that the static manual partition is the root cause for both complex configurations and slow speedup.
Note
模拟DCN时,会碰见:
- 时间窗口小,负载压力大
- 网络体量大,负载压力大
PS:时延小是DCN的特色,也是很重要的网络需求,但在这里缺会导致“Host处理时间”急剧减小,导致网络负载激增
综上所述,我们发现基于静态划分的现有同步机制并不适用于突发、动态不平衡的流量以及当前高带宽、低延迟和大规模的网络。结合第3.1节中讨论的手动划分,我们得出结论:静态手动划分是导致复杂配置和速度提升缓慢的根本原因。