跳转至

Unison Design

这一节非常非常重要 ‼️

In this section, we introduce a new network simulation kernel, Unison, to address the limitations of existing PDES algorithms in §3. The design objectives of Unison are:

  • Parallel-efficient. The new kernel should reduce the synchronization time efficiently and be applicable to different traffic patterns and different topologies.

  • User-transparent. The new kernel should be transparent to users for topology partition and result aggregation. It should also let users flexibly specify the number of processor cores to be used for speedup with zero configurations.

We first provide an overview of the workflow of Unison. Then we discuss the fine-grained partition algorithm(细粒度划分算法) and the load-adaptive scheduling algorithm(负载自适应调度算法) that enables its parallel efficient performance and user-transparent characteristics.

在本节中,我们引入了一种新的网络仿真内核——Unison,以解决第3节中现有PDES算法的局限性。Unison的设计目标如下:

  • 并行高效。该内核应能高效地减少同步时间,并适用于不同的流量模式和拓扑结构。
  • 用户透明。该内核应对用户的拓扑划分和结果汇总保持透明,并允许用户灵活指定用于加速的处理器核心数量,且无需任何配置。

我们首先提供Unison的工作流程概述。接着,我们讨论使其具有并行高效性能和用户透明特性的细粒度划分算法和负载自适应调度算法。

System Overview

To achieve the design objectives, Unison decouples the relationship between LPs and processor cores. Unison then performs fine-grained partition (§4.2) and load-adaptive scheduling (§4.3) workflow, which is illustrated in Figure 6.

为了实现设计目标,Unison解耦了LP(逻辑进程)与处理器核心之间的关系。随后,Unison执行细粒度划分(§4.2)和负载自适应调度(§4.3)的工作流程,如图6所示。

alt text

下面这段比较重要,建议仔细读一遍 (我这块的所有note直接记在paper里了)

In this workflow, Unison first performs a 2-dimensional partition automatically for user transparency. This divides the topology into a set of LPs, and divides the simulated time into multiple rounds, within which LPs are parallelly processable according to a modified barrier synchronization algorithm. For parallel efficiency, Unison divides the topology with fine granularity. Fine-grained partition is beneficial to the scheduler by creating more LPs with smaller sizes, making it more likely to produce a balanced result. Meanwhile, since LPs have to process their FEL in time order, fine-grained partition can reorder spatially related but temporally isolated simulated packets of a node by extracting them from a larger LP and grouping them to be processed together in a smaller LP to increase cache affinity. In network simulation, a simulated flow will correspond to consecutive simulated packets (i.e., events) within a time window. Inspired by the cache boost of network stacks and applications by reordering packets in real-world NIC [13], grouping these simulated packets can also improve the performance of network simulation. Since the node IDs of consecutive events are not always the same for each LP, each LP will frequently bounce between different nodes. As the number of LPs increases, each LP will have a smaller range of nodes, thus decreasing cache misses.

在此工作流程中,Unison首先自动执行二维划分,以实现用户透明性。这一过程将拓扑划分为一组LP,并将模拟时间划分为多个轮次,在这些轮次内,LP可以根据改进的屏障同步算法进行并行处理。为了提高并行效率,Unison以细粒度划分拓扑。细粒度划分通过创建更多小规模的LP,使调度器更有可能产生平衡的结果。同时,由于LP必须按时间顺序处理其未来事件列表(FEL),细粒度划分可以通过从较大的LP中提取空间相关但时间上孤立的模拟数据包,并将其分组到较小的LP中以增加缓存亲和性,从而对节点的模拟包进行重新排序。在网络仿真中,一个模拟流对应于时间窗口内的连续模拟数据包(即事件)。受到现实世界中通过重新排序NIC(网络接口卡)中的数据包来提升网络堆栈和应用程序缓存性能的启发【13】,将这些模拟数据包分组处理也可以提高网络仿真性能。由于连续事件的节点ID对于每个LP来说并不总是相同,LP将频繁地在不同节点之间切换。随着LP数量的增加,每个LP处理的节点范围变小,从而减少缓存未命中次数。

After partitioning, the load balancing job is left for the scheduler to handle. The scheduler uses a thread pool to process decoupled LPs. To address the synchronization time issues discussed in Observation 1, we have to figure out the optimally balanced workload of each thread, which can be abstracted as the multiway number partitioning problem. Although the optimal solution is NP-hard, Unison employs an approximation algorithm that uses the longest job first policy combined with a heuristic approach for network simulation. To address the transient unbalanced workload issue discussed in Observation 2 and Observation 3, the scheduler balances the workload of each thread by assigning LPs to different cores dynamically in each round according to this approximation algorithm.

在完成划分后,负载均衡任务将由调度器处理。调度器使用线程池来处理解耦的LP。为了应对观察1中讨论的同步时间问题,我们必须找出每个线程的最佳平衡工作负载,这可以抽象为多路数划分问题。尽管最优解是NP难的,但Unison采用了一种近似算法,结合了最长作业优先策略和用于网络仿真的启发式方法来解决这个问题。针对观察2和观察3中讨论的瞬时不平衡负载问题,调度器通过在每轮中根据该近似算法动态地将LP分配到不同核心上,以平衡每个线程的工作负载。

By combining these techniques above, we directly eradicate the root cause for limitations of PDES discussed in §3. This is because the whole partition and scheduling process is transparent to users, allowing them to easily simulate models in parallel without further configurations. Meanwhile, cache misses are reduced by fine-grained partition, and the mutual waiting time among threads is minimized by load-adaptive scheduling, resulting in efficient parallelization.

通过结合以上这些技术,我们直接消除了第3节中讨论的PDES局限性的根本原因。由于整个划分和调度过程对用户是透明的,用户可以轻松地在无需进一步配置的情况下并行模拟模型。同时,细粒度划分减少了缓存未命中次数,负载自适应调度最小化了线程之间的相互等待时间,从而实现了高效的并行化。

Fine-Grained Partition

下面这段比较重要,建议仔细读一遍 (我这块的所有note直接记在paper里了)

The objective of this stage is to automatically divide the simulation model both spatially and temporally. For spatial partition, it should be efficient for scheduling while preserving parallelism under different link delays. For temporal partition, the causality should be guaranteed by utilizing lookahead, while supporting specific demand for network simulation such as dynamic topologies.

这一阶段的目标是自动地在空间和时间上划分仿真模型。对于空间划分,应确保调度效率,并在不同链路延迟下保持并行性。对于时间划分,应通过使用前瞻机制(lookahead)来保证因果关系,同时支持网络仿真中的动态拓扑等特定需求。

alt text

Spatial Partition. The partition algorithm is presented in Algorithm 1. The algorithm takes a network topology as input, assigns each node its LP identifier, and outputs the total number of LPs created. First, a lower bound on the lookahead value is set to the median of all link delays in the topology. This is because cutting off links with a low propagation delay value will lead to a small degree of parallelism. The lower bound is set to the median instead of the average to make sure at least half of the links will be cut off for fine granularity. Then, for every link in the topology, if the link delay value is greater than or equal to the lower bound on the lookahead value, the link should be logically cut off. Cutting off a link logically is used to produce LPs. It will not affect the connectivity in the simulation model. Finally, every node in each connected component forms an LP. Note that the links in this algorithm refer to stateless links, which have no state variables associated with them (i.e., point-to-point links and full duplex Ethernet links). This is because two LPs with different simulated clock times may read and change the state variables simultaneously or in a reversed order in real time, violating causality.

空间划分。空间划分算法如算法1所示。该算法以网络拓扑为输入,分配每个节点的LP(逻辑进程)标识符,并输出生成的LP总数。首先,将前瞻值的下限设置为拓扑中所有链路延迟的中位数。这是因为切断具有低传播延迟值的链路将导致并行度较低。下限设为中位数而非平均值,确保至少一半的链路将被切断以实现细粒度划分。然后,对于拓扑中的每条链路,如果其延迟值大于或等于前瞻值的下限,该链路应在逻辑上被切断。逻辑上切断链路用于生成LP,这不会影响仿真模型中的连通性。最后,每个连通组件中的每个节点形成一个LP。需要注意的是,该算法中的链路是无状态链路,即没有与之关联的状态变量(如点对点链路和全双工以太网链路)。这是因为两个具有不同仿真时钟时间的LP可能会同时或以相反顺序读取和更改状态变量,从而违反因果关系。

Here is a simple illustration of the partition scheme. Assuming the delay of links between hosts and aggregation switches of the topology in Figure 6 (i.e., the links at the bottom) is set to zero, Unison will produce 10 LPs in total. In contrast, for a space-symmetric static partition scheme, the optimal partition result is to cut the topology in half from the middle, yielding only 2 LPs.

这个简单的示例说明了划分方案的不同结果。假设图6中主机与汇聚交换机之间链路的延迟为零(即底部的链路),Unison 将最终生成10个 LP(逻辑进程)。相比之下,对于一个空间对称的静态划分方案,最佳的划分结果是从中间将拓扑切割为两半,这样只会生成2个 LP。

alt text

In addition to these automatically generated LPs, Unison introduces a public LP to handle global events. Global events are events that can potentially affect all LPs immediately. They are scheduled by users before the simulation begins or by another global event. Typical global events include changing the topology dynamically, stopping the simulator at a given time, and printing the simulation progress. Existing PDES approaches only support limited global events like stopping the simulator, which are duplicated by every LP to make sure they can coordinate at the same time. For Unison, since the memory is shared across LPs, global events have to be handled just once. This is what the public LP is for. It is equivalent for the public LP to have a zero-delay connection to other LPs. This implies that the lookahead value of the public LP is always zero, only global events with the same timestamp can be handled in the same round.

除了这些自动生成的LP(逻辑进程)外,Unison还引入了一个公共LP来处理全局事件。全局事件是指那些可能立即影响所有LP的事件。它们可以由用户在模拟开始之前安排,也可以由另一个全局事件触发。典型的全局事件包括动态改变拓扑结构、在指定时间停止模拟器以及打印模拟进度。现有的PDES(并行离散事件模拟)方法只支持有限的全局事件,比如停止模拟器,这些事件需要由每个LP进行复制,以确保它们可以在同一时间协调。对于Unison来说,由于内存是在LP之间共享的,全局事件只需处理一次。这正是公共LP的作用所在。公共LP与其他LP之间的连接等效于零延迟连接。这意味着公共LP的前瞻值始终为零,只有具有相同时间戳的全局事件才能在同一轮中被处理。

Temporal Partition. In the temporal dimension, we use a modified barrier synchronization algorithm, which divides the whole simulated time into multiple rounds, within which LPs are parallelly processable. The window size of each round is calculated using the lookahead value. To accommodate the public LP, the equation for calculating the window size, or LBTS, is modified as:

时间分区(Temporal Partition). 在时间维度上,我们使用了一种修改版的屏障同步算法,将整个模拟时间划分为多个回合,在每个回合内,LPs(逻辑进程)可以并行处理。每个回合的窗口大小是通过计算lookahead value来确定的。为了适应公共LP的需求,计算窗口大小或LBTS的公式被修改为:

\(LBTS = min{(N_{pub}, (min{N_i} + lookahead))}\)

where \(𝑁_pub\) is the timestamp of the next event of the public LP. \(𝑁_𝑖\) is the timestamp of the next event of the 𝑖-th LP. Equation (2) considers the constraint introduced by global events. Since global events can potentially affect all LPs, the current round must be interrupted when the next global event in the public LP occurs.

其中,\(N_{\text{pub}}\) 是公共LP的下一个事件的时间戳,\(N_i\) 是第𝑖个LP的下一个事件的时间戳。公式(2)考虑了全局事件带来的约束。由于全局事件可能会立即影响所有LPs,因此当公共LP中的下一个全局事件发生时,当前回合必须被中断。

Moreover, since the topology is not static anymore, the lookahead value in Equation (2) can be changed during the simulation when link delay changes, adding or removing a link. Therefore, in addition to Equation (2), Unison will recompute the lookahead value if a topology change occurs when processing the public LP.

此外,由于拓扑结构不再是静态的,当链路延迟发生变化或者增加/移除链路时,公式(2)中的lookahead value在模拟过程中也会随之改变。因此,除了公式(2)之外,当处理公共LP时,如果发生拓扑变化,Unison还将重新计算lookahead value。

这一机制确保了在拓扑发生动态变化时,模拟系统能够及时更新lookahead value,从而维持事件处理的因果性和模拟的准确性。

Load-Adaptive Scheduling

The objective of this stage is to assign LPs to multiple paralleledly executed threads \(\mathcal{T}\) while balancing workloads among these tasks, under the assumption that threads are running on identical CPU cores (i.e., with the same clock frequency). Each LP must be processed exactly once in each round. Assume that the job size, or processing time, of the 𝑖-th LP in round \(r\) is known in advance and denoted as \(P_{i,r}\). This assumption can be satisfied by our proposed scheduling metrics in this section. A thread \(t \in \mathcal{T}\) in this round has a set of LPs \(\mathcal{L}(t,r)\) assigned to it. Therefore, the total processing time of thread \(t\) can be expressed as \(\sum_{i \in \mathcal{L}(t,r)} P_{i,r}\). Our goal is to determine the optimal assigning strategy \(\mathcal{L}(\cdot, r)\) to minimize the largest total processing time (i.e., minimize \(\max_{t \in \mathcal{T}} \sum_{i \in \mathcal{L}(t,r)} P_{i,r}\)) for every round \(r\).

该阶段的目标是将逻辑进程(LPs)分配给多个并行执行的线程\(\mathcal{T}\),同时在这些任务之间平衡工作负载,前提是线程在相同的CPU核心上运行(即,具有相同的时钟频率)。每个LP在每轮中必须被处理一次。假设第\(i\)个LP在第\(r\)轮的作业大小或处理时间已知,并记为\(P_{i,r}\)。这一假设可以通过我们在本节中提出的调度指标来满足。在这一轮中,一个线程\(t \in \mathcal{T}\)拥有一组分配给它的LPs \(\mathcal{L}(t,r)\)。因此,线程\(t\)的总处理时间可以表示为\(\sum_{i \in \mathcal{L}(t,r)} P_{i,r}\)。我们的目标是确定最佳的分配策略\(\mathcal{L}(\cdot, r)\),以最小化最大的总处理时间(即最小化\(\max_{t \in \mathcal{T}} \sum_{i \in \mathcal{L}(t,r)} P_{i,r}\))在每一轮\(r\)中。

Note
  • \(\mathcal{T}\): total paralleledly executed threads
  • \(P_r\): JobSize / ProcessingTime of (round \(r\) , \(i\)-th LP)
  • \(\mathcal{L}(t,r)\): set of LPs of (thread \(t\) , round \(r\))
  • Goal: minimize \(\max_{t \in \mathcal{T}} \sum_{i \in \mathcal{L}(t,r)} P_{i,r}\)

Scheduling algorithms. For a given round, the objective can be abstracted to scheduling a set of jobs on \(|\mathcal{T}|\) identical machines such that the makespan is minimized, which is equivalent to the identical machine scheduling problem, or multiway number partitioning problem, which is NP-hard [17]. One of the approximation algorithms [15] is the longest job first policy: schedule the longest job to the thread with the smallest load currently. Unison employs this policy and stores LPs in a priority queue according to their estimated processing time \(𝑃_{𝑖,𝑟}\) in descending order. When a thread is idle or finished with its previously assigned LP, it pops the queue to get the current longest LP and process that one. With this approach, scheduling 𝑛 LPs only takes 𝑂(𝑛log𝑛) steps, plus the complexity of acquiring \(𝑃_{𝑖,𝑟}\) in each step, with a worse-case approximation ratio of 4/3 [15].

Scheduling metrics. The exact processing time of each LP, or \(P_{i,r}\), is impossible to know in advance. Therefore, estimation is required to use the approximation algorithm. Finding an accurate and low-complexity estimation method becomes a challenge. We propose several heuristic methods suitable for network simulation to estimate this value efficiently.

调度指标。 每个LP的确切处理时间,或\(P_{i,r}\),是不可能提前知道的。因此,需要进行估计以使用近似算法。找到一种准确且低复杂度的估计方法成为了一项挑战。我们提出了几种适用于网络仿真的启发式方法,以高效地估计该值。

One way is to evaluate the number of pending events in the next round of each LP, as LPs with more events tend to have a longer processing time. In network simulation, most event scheduling is related to packet transmission. These scheduled events will typically have a delay that is equal to the lookahead, falling into the next round exactly, which is already illustrated in Figure 4. Therefore, we can count the number of events scheduled to be received by each LP in the next round (i.e., pending events), which can be simply done in linear time. Note that the longest job policy described above only considers the partial order of job size, rather than the exact job size of each LP, so this method will work.

一种方法是评估每个LP在下一轮中未决事件的数量,因为具有更多事件的LP往往需要更长的处理时间。在网络仿真中,大多数事件调度都与数据包传输有关。这些被调度的事件通常会有与lookahead值相等的延迟,正好落入下一轮,如图4所示。因此,我们可以计算每个LP在下一轮中计划接收的事件数量(即未决事件),这可以在线性时间内简单完成。注意,上述最长作业策略只考虑了作业大小的部分顺序,而不是每个LP的确切作业大小,因此该方法仍然有效。

alt text

A second heuristic method is to use the processing time of the previous round, \(P_{i,r-1}\), as an estimate. This is because network simulation programs often have a high degree of temporal locality. The propagation delay, or lookahead, is far less than the transmission delay of the entire flow. As a result, the processing time of each LP in consecutive rounds tends to remain stable (although highly unbalanced), which we can notice from Figure 13a. The time complexity of this method is only constant, so it is faster than the previous one. Moreover, this method is more accurate than the first one and is used by default in Unison if a high-resolution system clock is available. §6.3 will further evaluate the performance of different scheduling metrics.

第二种启发式方法是使用上一轮的处理时间\(P_{i,r-1}\)作为估计。这是因为网络仿真程序通常具有高度的时间局部性。传播延迟或lookahead远小于整个流的传输延迟。因此,每个LP在连续轮次中的处理时间往往保持稳定(尽管高度不平衡),我们可以从图13a中注意到这一点。这种方法的时间复杂度仅为常数,因此比前一种方法更快。此外,如果有高分辨率的系统时钟可用,这种方法比第一种方法更准确,并且默认情况下在Unison中使用。§6.3将进一步评估不同调度指标的性能。

alt text

Scheduling periods. The scheduler itself will introduce performance costs as well. Although getting the scheduling metrics is optimized to constant time using \(𝑃_{𝑖,𝑟−1}\) , sorting 𝑛 LPs by their estimated processing time still takes 𝑂(𝑛log𝑛) steps. When simulating a large network topology, combined with the fine-grained partition scheme, there will be tons of LPs to schedule. To alleviate the cost, Unison runs the scheduler periodically rather than every round.

Due to the temporal locality of network simulation, which we have described above, the processing time of each LP tends to remain stable in consecutive rounds. Therefore, the schedule results should also tend to remain stable. In order to balance the schedule accuracy and the schedule cost, Unison let the schedule period grow logarithmically with respect to the number of LPs. A topology with 𝑛 LPs has a schedule period of ⌈\(log_{2}𝑛\). §6.3 will further evaluate the performance impact when choosing different scheduling periods.

调度周期。 调度器本身也会引入性能开销。尽管使用\(P_{i,r−1}\)对调度指标的获取进行了优化,使其达到常数时间复杂度,但按估计处理时间对𝑛个LP进行排序仍然需要\(O(n \log n)\)的步骤。当模拟一个大型网络拓扑时,结合细粒度的分区方案,将会有大量的LP需要调度。为了减轻这种开销,Unison并不是在每一轮都运行调度器,而是定期运行。

由于我们上面所描述的网络仿真的时间局部性,每个LP在连续轮次中的处理时间往往保持稳定。因此,调度结果也应趋于稳定。为了平衡调度精度和调度成本,Unison使调度周期随LP数量的对数关系而增长。对于一个具有𝑛个LP的拓扑结构,其调度周期为⌈\(log_{2}𝑛\)⌉。§6.3将进一步评估选择不同调度周期时对性能的影响。