APSimAI: Rethinking Dynamic Networks and Heterogeneous Computing with Automatic Parallelization¶

(1) Background & Motivation

背景：随着Transformer架构的大语言模型（LLM）参数量激增，分布式并行训练变得不可或缺
痛点：
- 异构性 (Heterogeneity)：实际集群中常混合使用不同型号的GPU（如RTX4090D与V100混用），且同一型号GPU之间也可能存在细微性能差异
- 动态网络 (Dynamic Networks)：网络带宽会波动，节点可能会故障或变慢，导致通信瓶颈
现有局限：目前的自动并行搜索框架（如ALPA, Galvatron等）通常基于理想化假设，忽略了设备异构性和网络拓扑的动态变化，难以在真实场景中达到最优

(2) Methodology

论文提出了一种集成优化框架，通过模拟（Simulation）来确定最佳并行配置

Multi-Edge Physical Link Abstraction
- 问题：传统模型常假设单一带宽，忽略了多条物理链路（如NVLink、PCIe）并存及其竞争情况
- 方案：引入“多边”设计，显式建模设备间的多个物理连接及其并发或冲突状态（例如NVLink和PCIe不能同时激活），从而更精准地管理数据传输调度
Operator Splitting and Fusion
- 拆分：
  - 将传统集合通信操作（如All-Reduce）分解为更细粒度的子操作（如: Reduce-Scatter + All-Gather）
  - 这不仅消除了单节点瓶颈，还创造了更多调度组合的可能性
- 融合：类似于FlashAttention，将多个算子融合以减少内存访问
- 策略：通过先拆分再重新组合算子，在模拟器中预测执行时间，寻找适应异构硬件的最佳映射
Parallel Branch-and-Bound Search
- 为了应对指数级增长的搜索空间，提出了一种并行化的分支定界算法
- 利用启发式规则和剪枝技术（Pruning）快速丢弃不可行的配置，并结合模拟器（SimAI）并行评估候选策略的成本

Introduction¶

The rapid growth in parameter count of deep neural networks (DNNs), especially large language models (LLMs)[30][2][34][29] [12][24] based on the Transformer[36] architecture, has made distributed parallel training across large GPU clusters indispensable. Thus, efficient implementation of distributed training is critical. Researchers have proposed various parallel strategies[27] [31][32], including Data Parallelism (DP)[23], Tensor Parallelism (TP)[27], Pipeline Parallelism (PP)[14][9] [22][25][10], Sequence Parallelism (SP)[19], and Fully Sharded Data Parallelism (FSDP)[41], to address computational, storage, and communication challenges in training large models. However, selecting appropriate parallel strategies in practical large-scale clusters typically requires extensive manual tuning. While existing automatic search frameworks, such as ALPA[42], AMP[21], Metis[35] and Galvatron[26], offer some degree of automation, their deployment in real-world scenarios is limited due to overly idealized assumptions.

This paper aims to address the problem of selecting distributed parallel strategies more effectively for realistic scenarios. Our core insight is that computation can be viewed as mapping data and algorithms onto computational devices, while communication corresponds to data transmission tasks across network links. Specifically, by selecting suitable parallel strategies, computational tasks can be efficiently assigned to heterogeneous computing devices and communicated through network links. However, due to device performance heterogeneity and dynamic network conditions, the actual execution time of tasks typically exhibits significant uncertainty.

The uncertainty in task execution time arises mainly from two aspects: first, variability in computation and communication times caused by heterogeneous device performance and fluctuations in network bandwidth; second, additional variations resulting from operators’ splitting and fusion processes. For example, operator fusion reduces memory accesses and thus shortens execution time, while operator splitting can effectively utilize idle computational resources, also reducing execution time. Additionally, decomposing traditional collective communication operations such as all-reduce into reduce-scatter and all-gather can significantly enhance communication efficiency.

To address these practical challenges, we propose an integrated optimization framework combining strategic operator splitting and fusion, an adaptive task scheduling strategy based on parallelized branch-and-bound search, and resource management strategies tailored for heterogeneous computational environments and dynamic network conditions. We validate our framework using SimAI[37], an existing performance prediction model, and demonstrate significant performance improvements over mainstream frameworks. Specifically, our contributions include:

• A novel multi-edge physical link abstraction model that more accurately describes heterogeneous device connectivity characteristics and link contention conditions;

• A parallelized branch-and-bound optimization algorithm that systematically searches task scheduling strategies, significantly improving task execution efficiency;

• Preliminary experimental validation using SimAI, indicating the potential of our method to outperform existing mainstream frameworks under heterogeneous computational environments and dynamic network conditions.

以下是该引言部分（Introduction）的学术化中文翻译：

深度神经网络（DNNs），特别是基于Transformer架构[36]的大语言模型（LLMs）[30][2][34][29][12][24]，其参数量的急剧增长使得跨大型GPU集群的分布式并行训练变得不可或缺。因此，分布式训练的高效实现至关重要。

为了解决大模型训练中的计算、存储和通信挑战，研究人员提出了多种并行策略[27][31][32]，包括数据并行（DP）[23]、张量并行（TP）[27]、流水线并行（PP）[14][9][22][25][10]、序列并行（SP）[19]以及全分片数据并行（FSDP）[41]。

然而，在实际的大规模集群中选择合适的并行策略通常需要大量且繁琐的手工调优。尽管ALPA[42]、AMP[21]、Metis[35]和Galvatron[26]等现有的自动搜索框架提供了一定程度的自动化，但由于其基于过于理想化的假设，在现实场景中的部署应用受到了限制。

本文旨在解决如何在现实场景中更有效地选择分布式并行策略的问题。

我们的核心观点是: 计算可被视为将数据和算法映射到计算设备的过程，而通信则对应于跨网络链路的数据传输任务

具体而言，通过选择合适的并行策略，可以将计算任务高效地分配给异构计算设备，并通过网络链路进行通信。然而，由于设备性能的异构性和网络条件的动态变化，任务的实际执行时间通常表现出显著的不确定性。

任务执行时间的不确定性主要源于两个方面：

由异构设备性能和网络带宽波动引起的计算与通信时间的差异
由算子（Operator）的拆分与融合过程导致的额外变化
- 例如，算子融合能够减少内存访问从而缩短执行时间，而算子拆分则可以有效利用空闲的计算资源，同样能减少执行时间
- 此外，将传统的集合通信操作（如all-reduce）分解为reduce-scatter和all-gather，可以显著提高通信效率

为了应对这些实际挑战，我们提出了一个集成优化框架，该框架结合了策略性的算子拆分与融合、基于并行化分支定界（Branch-and-Bound）搜索的自适应任务调度策略，以及针对异构计算环境和动态网络条件量身定制的资源管理策略。

我们利用现有的性能预测模型SimAI[37]对框架进行了验证，结果表明其相对于主流框架具有显著的性能提升。具体而言，我们的贡献包括:

提出了一种新颖的多边物理链路抽象模型，能够更准确地描述异构设备的连接特性和链路竞争状况
设计了一种并行化分支定界优化算法，能够系统地搜索任务调度策略，显著提高任务执行效率
基于 SimAI 进行了初步实验验证，结果表明我们的方法在异构计算环境和动态网络条件下，具有超越现有主流框架的潜力

Background and Motivation¶

This section addresses critical challenges faced in realistic distributed training environments, specifically illustrated by the scenarios depicted in Figure 1. In practical GPU clusters, several factors significantly impact overall training efficiency and robustness: (1) heterogeneous GPU setups combining diverse device types, (2) unbalanced network bandwidth causing performance bottlenecks, and (3) node failures resulting in computational disruptions.

In Section 2.1, we first examine the impact of GPU performance heterogeneity on overall system throughput and discuss predictive performance modeling approaches. In Section 2.2, we analyze dynamic network conditions, emphasizing the necessity for adaptive bandwidth management and fault-tolerance mechanisms. Lastly, in Section 2.3, we explore strategic operator fusion and splitting methods, highlighting their potential to effectively mitigate performance degradation and improve resource utilization under these challenging conditions.

alt text

本节探讨了实际分布式训练环境中面临的关键挑战，具体场景如图 1 所示。在实际的 GPU 集群中，以下几个因素会显著影响整体训练效率和鲁棒性：

异构 GPU 配置，即混合了不同类型的设备
网络带宽不均衡，导致性能瓶颈
节点故障，导致计算中断

在 2.1 节中，我们首先考察 GPU 性能异构性对系统整体吞吐量的影响，并讨论预测性能建模方法。在 2.2 节中，我们分析了动态网络状况，强调了自适应带宽管理和容错机制的必要性。最后，在 2.3 节中，我们探讨了策略性的算子融合和拆分方法，重点介绍了它们在这些挑战性条件下有效缓解性能下降和提高资源利用率的潜力。

2.1 Performance Heterogeneity¶

Performance heterogeneity refers to variations in computational speed and capabilities among devices of the same type. Even if all nodes within a cluster employ GPUs that share the same instruction set (e.g., CUDA), significant performance disparities can still exist due to differences in micro-architecture or hardware generation.

The Roofline Model [38] is commonly used to analyze and predict computational system performance. It characterizes the performance of a system using the following equation:

\[ \text{roofline}_{BW} = \min \left( K \times \text{memBW}_p, \text{FLOPs}_p \right) \tag{1} \]

where \(\text{FLOPs}_p\) is the peak floating-point operations per second and \(\text{memBW}_p\) is the peak memory bandwidth of the GPU. The term \(K\) represents the arithmetic intensity, defined as the number of floating-point operations per memory access, computed as:

\[ K = \frac{\text{FLOPs}_k}{\text{mem}_k} \tag{2} \]

Figure 2 illustrates throughput differences between H100 and V100 GPUs executing the same attention kernel. We observe significant computational capability differences between these GPUs. Once the computational load reaches a certain threshold, the GPU throughput stabilizes at a constant value.

However, the Roofline Model has limitations in accurately modeling fused operators and operations with explicitly specified resource usage, as actual GPU execution times are heavily influenced by specific hardware attributes. Prior research [40, 20] has employed Multi-Layer Perceptron (MLP) to estimate GPU performance, addressing performance as a nonlinear, multivariate function. In such scenarios, traditional optimization methods like Integer Linear Programming (ILP) and Dynamic Programming (DP) struggle to effectively map variable-length operators onto devices. This limitation arises fundamentally because ILP and DP cannot solve optimization problems with nonlinear objectives in convex spaces.

In contrast, simulators precisely predict CUDA kernel execution times, providing accurate operator execution time estimates. Moreover, simulators can concurrently evaluate execution times for multiple scheduling strategies, significantly accelerating the identification of optimal parallelization strategies.

GPU 性能异构性 (Performance Heterogeneity)

现象：即使是同一型号的GPU（共享相同的指令集），也会因为微架构差异或硬件代际不同，导致计算速度和能力的显著差异
现有模型的局限：
- 传统的 Roofline 模型 难以准确模拟融合算子（Fused Operators）或特定资源的使用情况
- GPU性能通常表现为非线性的多变量函数，传统的整数线性规划（ILP）和动态规划（DP）方法难以处理这种非凸空间的优化问题
解决方案倾向：模拟器（Simulators） 能够更精准地预测 CUDA 内核的执行时间，并能并发评估多种调度策略，是寻找最佳并行策略的更有效工具

2.2 Dynamic Networks¶

Dynamic networks are characterized by topological changes over time, contrasting with static networks that maintain constant nodes and edges. Formally, a dynamic network can be modeled as a temporal graph, represented by a sequence \(G(0), G(1), \ldots, G(t)\), or a time-dependent edge set \(E(t)\).

Efficient parallel training of large language models (LLMs) across multiple GPUs inherently faces dynamic network conditions. Communication bandwidth fluctuates due to hardware limitations or network congestion, while long-running tasks frequently experience node slowdowns or failures.

2.2.1 Dynamic Bandwidth Variations. In practical distributed training scenarios, the available bandwidth among nodes and within nodes frequently fluctuates rather than remaining constant. Such variations stem from multi-tenant datacenter networks, hardware bottlenecks, and background workloads. However, current distributed training frameworks[23][1] [33][17] typically cause GPUs with higher bandwidth lanes to idle, waiting for GPUs with lower bandwidth lanes to complete data transmission, despite similar computational capabilities.

2.2.2 Dynamic Node and Interconnect Adjustments. In long-running, large-scale training tasks, node failures or temporary disconnections are inevitable. Traditional approaches typically halt training upon encountering node failures, reloading from checkpoints, and restarting new nodes, resulting in substantial downtime and wastage of computational resources. Recent research has emphasized fault-tolerance capabilities, which enable distributed training systems to operate continuously despite node additions or removals. For instance, ReCycle[11] leverages the redundancy inherent in data-parallel training by dynamically reallocating workloads from failed nodes to the remaining active nodes, avoiding delays from node replacement. Oobleck[15] proactively computes pipeline-parallel configurations optimized for varying numbers of nodes, seamlessly transitioning to smaller-scale configurations upon node removal, thereby eliminating the need for retraining.

动态网络环境 (Dynamic Networks)

大模型训练中的网络是随时间变化的（Temporal Graph），主要体现在两个方面":

动态带宽波动 (Dynamic Bandwidth Variations)：
- 由于多租户环境、硬件瓶颈或背景负载，节点间和节点内的带宽会频繁波动
- 现有框架往往会导致高性能 GPU 被迫闲置，等待低带宽链路完成传输（即“短板效应”）
节点动态调整与容错 (Dynamic Node/Interconnect Adjustments)：
- 节点故障在长时训练中不可避免
- 传统做法：遇到故障停止训练、读取检查点、重启，导致大量停机时间和资源浪费
- 新趋势：强调容错能力（如 ReCycle, Oobleck），即在节点增减时动态重新分配负载或调整配置，使训练能持续进行而无需重头开始

2.3 Operation Fusion and Split¶

Modern machine learning frameworks [23][41][3][16] typically accelerate computation by forming efficient fused kernels through the fusion of multiple consecutive operators, thereby reducing data movements from external memory. A classic example is Flash Attention[8], which combines originally independent operations such as matmul, dropout, softmax, and mask into a single fused kernel, significantly shortening execution time.

In contrast to operator fusion, distributed computations often utilize an operator-splitting strategy, exemplified by decomposing the standard All-Reduce operation into two sub-operations: Reduce-Scatter and All-Gather. As illustrated in Figure 3, the traditional All-Reduce aggregates gradients fully at a single node before broadcasting the result to all other nodes. The decomposed approach, however, first partitions and aggregates gradients across nodes via the Reduce-Scatter step, and subsequently disseminates these partial aggregation results to all nodes through the All-Gather step. This decomposition effectively eliminates single-node bottlenecks and enhances overall communication efficiency.

Moreover, the process of splitting and recombining operators introduces additional opportunities for optimization in parallel computation. By decomposing operators into smaller sub-operations and recombining them in novel configurations, new operators with varying execution characteristics emerge. Searching through these configurations enables identification of optimal mappings onto heterogeneous hardware, thus effectively mitigating the previously discussed straggler effect, and ultimately leading to more balanced workload distribution and improved overall performance.

算子的融合与拆分 (Operation Fusion and Split)

算子融合 (Fusion)：
- 通过合并连续算子（如 Flash Attention），减少对外部内存的访问次数，从而显著缩短执行时间
算子拆分 (Split)：
- 典型案例：将标准的 All-Reduce 拆解为 Reduce-Scatter 和 All-Gather 两个子步骤
  - 传统的 All-Reduce 会先在单节点上完全聚合梯度，然后广播到所有节点
  - 拆分后的方式先通过 Reduce-Scatter 在各节点间分块聚合梯度，再通过 All-Gather 将部分聚合结果传播到所有节点
- 优势：这种分解消除了单节点的通信瓶颈，提高了整体通信效率
战略意义：
- 拆分与重组算子创造了新的算子配置可能性
- 这使得系统可以在异构硬件上搜索到更优的任务映射方案，从而有效缓解上述的“短板效应”（straggler effect），实现负载均衡

不看了, 这个领域比较陌生, 与我关注的内容八杆子打不着边