SaTE: Low-Latency Traffic Engineering for Satellite Networks¶
This paper explores traffic engineering (TE) for large-scale Low-Earth-Orbit satellite constellations. While there is rich prior work on TE algorithms for global cloud wide-area networks (WANs), they are designed for static network topologies and often require significant computation time for largescale networks. Such limitations make existing WAN TE algorithms unsuitable for large-scale satellite networks which rapidly change topology and require computing optimal traffic allocation under stringent latency constraints.
We present SaTE, a low-latency TE algorithm for largescale satellite networks, computing traffic allocation at millisecond latency. SaTE formulates a heterogeneous graph to model the TE problem, adapting to dynamic satellite topologies. By removing redundant graph relations, SaTE reduces computational latency, allowing the graph to be efficiently learned by a graph neural network that leverages GPUs to rapidly infer traffic allocations. SaTE also exploits the similarity of satellite network topologies and the geospatial distribution of traffic demands to facilitate model training. We evaluate SaTE through extensive data-driven simulation on today’s largest satellite constellation, Starlink with 4236 satellites. Our results show over a 23.5% improvement in satisfied demand with an average TE runtime of 17 𝑚𝑠, achieving a 2738× speedup compared to commercial solvers.
本文探讨了面向大规模低地球轨道(LEO)卫星星座的流量工程(TE)技术。尽管针对全球云广域网(WAN)的TE算法已有丰富的研究,但这些算法专为静态网络拓扑设计,在处理大规模网络时通常需要大量的计算时间。这些局限性使得 现有的WAN TE算法不适用于拓扑结构快速变化、且需要在严格延迟约束下计算最优流量分配的大规模卫星网络。
我们提出了SaTE,一种面向大规模卫星网络的低延迟TE算法,它能够以毫秒级的延迟计算流量分配。
- SaTE构建了一个异构图来对TE问题进行建模,以适应动态的卫星拓扑
- 通过移除图中冗余的关系,SaTE降低了计算延迟,使得图神经网络(GNN)可以高效地学习该图,并利用GPU(图形处理单元)快速推断出流量分配方案
- SaTE还利用卫星网络拓扑的相似性以及流量需求的地理空间分布特性来简化模型训练
我们通过在当今最大的卫星星座——拥有4236颗卫星的“星链”(Starlink)上进行广泛的数据驱动仿真,对SaTE进行了评估。我们的结果表明,SaTE在平均17毫秒的TE运行时间内,使需求满足量提升了超过23.5%,与商业求解器相比,实现了2738倍的加速。
Introduction¶
Satellite networks have experienced rapid growth in recent years [3], providing high-speed internet [66], direct-to-phone services [12], as well as IoT connectivity for applications like global asset tracking [21] and precision agriculture [41]. The scale of these satellite networks has significantly evolved; for example, Starlink, serving 3 million users, has deployed over 6,000 satellites and plans to expand to 12,000 [50].
As constellations grow and traffic demands from users surge, satellite networks must carefully allocate traffic between satellite nodes to enable high-throughput services. Existing traffic allocation methods for satellite networks, such as congestion avoidance [16, 56, 64] and Quality of Service (QoS) constraints [4, 14, 46] based routing, often create congestion and hotspots in the network due to their distributed nature [17, 51]. Similar issues in terrestrial WANs have driven recent advancements in centralized traffic engineering (TE) algorithms [1, 29, 55, 61, 73, 78], which computes traffic allocation along preconfigured paths as a global optimization problem to achieve goals like maximizing overall throughput. While software-defined networking (SDN) for satellites paves the way for applying these TE algorithms to satellite networks [5, 43, 83], the dynamic nature of satellite networks presents challenges to existing TE algorithms.
(1) Frequent Topology Changes: Unlike the static topology of terrestrial WANs, the topology of LEO satellite networks (e.g., Starlink) changes frequently, ranging from seconds to as fast as every 70 ms (Sec. 2.3.1). These changes make configured network paths quickly obsolete, often outpacing even a few seconds of TE computation time (Sec. 2.3.2), causing considerable throughput degradation. (2) Traffic Fluctuations: user-facing network traffic experiences subsecond dynamics [23], while traffic predictions can naturally be erroneous [61]. These factors necessitate real-time reaction from TE systems to handle instantaneous traffic fluctuations. However, existing TE algorithms cannot meet both dynamics. Commercial solvers [24] and heuristic methods [1, 29, 55, 73] exhibit polynomial or higher complexity in TE computation as network scale increases, taking minutes to compute—too slow to keep up with fluctuating traffic. Recent learning-enabled TE solutions provide fast TE computation on static WANs [61, 78], but their trained models are tied to a single topology and require hours of re-training for each change—impractical for the vast number of varying topologies in satellite networks. Thus, there remains a gap for a low-latency TE solution adaptable to fast-changing topologies and dynamic traffic in satellite networks.
We present SaTE, a low-latency TE algorithm designed for dynamic satellite constellations. At the core of SaTE is a novel satellite TE graph design that models the entire TE problem, enabling the use of graph neural network (GNN) layers alone to rapidly compute TE and adapt to frequent changes in topology and traffic. In addition, SaTE facilitates its model training by exploiting the structural similarity of satellite topologies and the geospatial distribution of user-facing traffic demands, significantly reducing the size of the training dataset while generalizing well to unseen topologies. Our detailed evaluation on a simulated 4236-nodes Starlink constellation, reveals that SaTE achieves a 2738× improvement in computational latency compared to commercial solvers [24].
SaTE aims to compute TE for large-scale satellite networks with low latency. Existing GNN-based TE [78] relies on graph designs that capture only partial TE components (e.g., link connectivity) and require dense neural networks (DNNs) layers to model essential relations like traffic demands and path associations. However, DNN layers require fixed input dimensions, fail to generalize to unseen topologies, necessitating frequent retraining. To address this, SaTE formulates a heterogeneous graph 1 that models the entire TE problem. This design eliminates the need for DNN layers, allowing GNNs to learn and solve TE entirely. This approach generalizes to unseen topologies and accommodates dynamic changes in key satellite TE components, such as varying input dimensions for network paths and traffic demands. SaTE also reduces computational latency by removing redundant relations from the graph. The simplified graph is then used to build an attention-enabled GNN for traffic allocation.
Additional considerations in SaTE are training feasibility and effectiveness. For constellations with thousands of nodes, each training data point and its corresponding graph require memory (e.g., 335 GB for Starlink) far exceeding GPU capacity [59]. Moreover, distributing the training of a single graph across multiple GPUs with performance guarantees remains an open challenge [79, 82]. SaTE addresses this issue by pruning non-contributory elements from the TE input, leveraging the sparsity in global distribution of satellite users [7, 49, 69], as satellites over sparsely populated areas (e.g., deserts) often have no traffic demand, leaving many paths idle. This input pruning is enabled by our graph design, which allows the use of GNN layers alone, unlike prior works relying on DNNs that require fixed-sized inputs. Another challenge is the vast number of varying topologies for Starlink. Covering all possible topologies risks overfitting [81] and high training overhead. Instead, SaTE explores the similarity of satellite topologies—while the satellites’ geometric positions change as they orbit the Earth, certain network topologies remain similar or isomorphic. Thus, we develop a topology pruning method that reduces the number of training data points required while maintaining SaTE’s performance. By combining these pruning methods, SaTE condenses the dataset size, making training feasible while reducing training overhead.
We evaluate the feasibility of SaTE through extensive data-driven simulations on Starlink [18], with 4236 satellites 2 . Our evaluation includes scenarios where Starlink satellites are solely linked with space lasers (as they currently operate with over 10000 lasers [37, 66]) and scenarios where they are partially linked with ground relays. We generate traffic matrices for 3 million satellite users globally, with fluctuating traffic across services such as voice, video, and file transfer, and varying overall traffic loads (e.g., arrival intensity). The trained model of SaTE is evaluated on unseen topologies and traffic flows. Our results reveal that:
• SaTE achieves 17 𝑚𝑠 computational latency 3 in solving Starlink’s TE problems, improving by 2738× over commercial solvers [24].
• SaTE satisfies at least 11.0% more traffic demand in Starlink, outperforming state-of-the-art methods [24, 35, 55, 78] by over 23.5%.
• SaTE demonstrates generalizability across various scales of satellite networks, tested on unseen topologies and traffic from the trained model.
Contributions: SaTE’s core contributions include:
• A low-latency TE computation solution for dynamic large-scale LEO satellite constellations.
• A dataset pruning technique that significantly reduces the training dataset volume, facilitating model training.
• Detailed evaluation of SaTE performance on a simulated mega-constellation, Starlink.
近年来,卫星网络经历了快速发展,提供了高速互联网、直连手机服务,以及全球资产追踪和精准农业等物联网连接应用。这些卫星网络的规模已显著扩大;例如,为300万用户提供服务的“星链”已经部署了超过6000颗卫星,并计划扩展到12000颗。
随着星座规模的增长和用户流量需求的激增,卫星网络必须精心分配卫星节点间的流量,以支持高吞吐量服务。现有的卫星网络流量分配方法,如基于拥塞避免和服务质量(QoS)约束的路由,由于其分布式特性,常常导致网络拥塞和热点问题。地面广域网中的类似问题推动了集中式流量工程(TE)算法的最新发展,这类算法将预配置路径上的流量分配作为一个全局优化问题来计算,以实现最大化总吞吐量等目标。尽管卫星软件定义网络(SDN)为将这些TE算法应用于卫星网络铺平了道路,但卫星网络的动态特性给现有TE算法带来了挑战。
(1) 频繁的拓扑变化:与地面广域网的静态拓扑不同,LEO卫星网络(如“星链”)的拓扑变化频繁,周期从数秒到快至每70毫秒。这些变化使得已配置的网络路径迅速失效,其变化速度常常超过了TE算法所需的数秒计算时间,从而导致显著的吞吐量下降
(2) 流量波动:面向用户的网络流量存在亚秒级的动态变化,同时流量预测本身也可能存在误差。这些因素要求TE系统能够实时响应,以处理瞬时的流量波动。然而,现有的TE算法无法同时应对这两种动态变化。商业求解器和启发式方法在计算TE时,其复杂度随网络规模呈多项式或更高阶增长,计算耗时长达数分钟,无法跟上流量的快速波动。近期基于学习的TE解决方案虽能在静态广域网上实现快速TE计算,但其训练好的模型与单一拓扑绑定,每次拓扑变化都需要数小时的重新训练,这对于卫星网络中海量的多变拓扑而言是不切实际的。因此,目前仍缺乏一种能够适应快速变化的拓扑和动态流量的低延迟TE解决方案
我们提出了SaTE,一种专为动态卫星星座设计的低延迟TE算法。SaTE的核心是一种新颖的卫星TE图设计,它对整个TE问题进行建模,使得仅使用图神经网络(GNN)层就能快速计算TE,并适应拓扑和流量的频繁变化。此外,SaTE通过利用卫星拓扑的结构相似性以及面向用户的流量需求的地理空间分布特性来简化其模型训练,显著减少了所需训练数据集的规模,同时能很好地泛化到未见过的拓扑。我们在一个模拟的包含4236个节点的“星链”星座上进行了详细评估,结果显示,与商业求解器相比,SaTE在计算延迟方面实现了2738倍的提升。
SaTE旨在以低延迟为大规模卫星网络计算TE。现有的基于GNN的TE方法依赖于仅能捕获部分TE要素(如链路连通性)的图设计,并且需要密集神经网络(DNN)层来建模诸如流量需求和路径关联等关键关系。然而,DNN层要求固定的输入维度,无法泛化到未见过的拓扑,因此需要频繁地重新训练。为了解决这个问题,SaTE构建了一个能够对整个TE问题进行建模的异构图。这种设计无需DNN层,使得GNN能够完全学习并解决TE问题。该方法能够泛化到未见过的拓扑,并能适应卫星TE关键组件的动态变化,例如网络路径和流量需求的可变输入维度。SaTE还通过从图中移除冗余关系来降低计算延迟。简化后的图被用于构建一个带有注意力机制的GNN,以进行流量分配。
SaTE的额外考量在于训练的可行性与有效性。对于拥有数千个节点的星座,每个训练数据点及其对应的图所需的内存(例如,“星链”需要335 GB)远超GPU容量。此外,在多个GPU上分布式训练单个图并保证性能仍然是一个开放性挑战。SaTE通过从TE输入中剪除无贡献的元素来解决这个问题,这利用了全球卫星用户分布的稀疏性,因为飞越人口稀疏地区(如沙漠)的卫星通常没有流量需求,导致许多路径处于空闲状态。我们的图设计支持了这种输入剪枝,因为它允许单独使用GNN层,这与依赖DNN并需要固定大小输入的先前工作不同。另一个挑战是“星链”存在海量多变的拓扑。覆盖所有可能的拓扑会带来过拟合风险和高昂的训练开销。因此,SaTE探索了卫星拓扑的相似性——尽管卫星的几何位置随着其绕地运行而改变,但某些网络拓扑保持相似或同构。因此,我们开发了一种拓扑剪枝方法,该方法在保持SaTE性能的同时,减少了所需的训练数据点数量。通过结合这些剪枝方法,SaTE压缩了数据集的大小,使训练变得可行,同时降低了训练开销。
我们通过在拥有4236颗卫星的“星链”上进行广泛的数据驱动仿真,来评估SaTE的可行性。我们的评估场景包括“星链”卫星仅通过空间激光链路连接(目前其运行超过10000个激光器)以及部分通过地面中继连接的情况。我们为全球300万卫星用户生成了流量矩阵,涵盖了语音、视频和文件传输等服务的波动流量,以及不同的整体流量负载(如到达强度)。训练好的SaTE模型在未见过的拓扑和流量上进行了评估。我们的结果显示:
- SaTE在解决“星链”的TE问题时,实现了17毫秒的计算延迟,比商业求解器提升了2738倍
- SaTE使“星链”中满足的流量需求至少增加了11.0%,性能比现有先进方法高出23.5%以上
- SaTE在不同规模的卫星网络中展示了其泛化能力,在训练模型未见过的拓扑和流量上通过了测试
本文贡献:SaTE的核心贡献包括:
- 一种适用于动态大规模LEO卫星星座的低延迟TE计算解决方案
- 一种能够显著减少训练数据集体量、简化模型训练的数据集剪枝技术
- 在模拟的巨型星座“星链”上对SaTE性能进行的详细评估
核心概括
基于 GNN, 采用两种"剪枝"来提升效率, 构建极低延迟的TE
- 用户分布的集中性 (地球上70%是海洋)
- 卫星星座整体拓扑具有相对不变性
(1) 为什么我们要这么在意"速度/效率"?
- TE计算只考虑那些在计算运行时间内保持一致的星间链路。更长的计算时间会导致更多的链路被排除, 从而造成巨大的带宽浪费
- 更长的计算时间会延迟调整, 因为在此期间流量需求可能已发生显著变化
(2) 前人工作就是在WAN上, 拿GNN解决TE, 这一篇的区别是?
我们使用的是 异构图+剪枝, 不再需要 DNN 了!
- 不再需要DNN来辅助, 提升泛化性
- 每个源-目的对的所有预配置路径都必须作为输入显式地表示给DNN, 内存开销太大了
SaTE GNN
核心指导思想: 传统的优化问题 可以转换成 推理过程
系统构建:
- 异构图: GNN不再需要DNN就可以正常工作
- 只留下3个关系对: R1, R2, R3
- GNN:
- 使用了三个GNN模块, 每个模块分别用于处理一种关系类型
- 基于GATs(图注意力网络)
- 数据集剪枝: 不仅仅是优化, 还是系统工作的必要条件!
- "地球上真正的高负载区域很有限"
- 仅保留流量矩阵中的非零元素, 并从训练数据集中排除流量输入为零的空闲路径
- "整个星座处于具有动态相对一致性"
- 我们不是遍历所有可能的拓扑快照, 而是 在一个具有多样化结构的小型代表性拓扑集上训练模型
- "地球上真正的高负载区域很有限"
Background and Motivation¶
2.1 Primer on Satellite Networks¶
This paper focuses on non-geostationary satellite constellations that implement inter-satellite links (ISLs). Examples include Iridium [48] with radio-frequency links, and Starlink using space lasers (up to 200 Gbps [68]) across over 4,000 satellites [37], with some scenarios using "bent-pipe" links [28, 58] (ground relays linking close satellites). Historically, satellite communication relied solely on ground stations; requiring satellites to wait until they were within range of ground stations for data relay. ISLs [28, 34, 53] now enable direct satellite-to-satellite communication, forming an independent switched network—a satellite network. Our work focuses on in-space end-to-end communication scenarios, where traffic is transmitted through multiple satellite hops. Fig. 1 shows a typical satellite network architecture.
Satellite constellations are organized into orbital shellsdistinct layers of satellites each operating at specific altitudes. For instance, Starlink’s four shells are at altitudes of 540, 550, 560, and 570 km. These shells form the network topology that connects satellites both within the same shell and across different shells. As satellites orbit, the network topology varies, mainly induced by two types of connections:
(1) Links within the same shell: Within each shell, each satellite usually connects with four neighbors: two on the same orbit (intra-orbit links) and two on adjacent orbits (inter-orbit links), as shown in Fig. 2 (a). The intra-orbit links are generally more stable and rarely change, while inter-orbit links are deactivated when satellites reach high latitudes due to the excessive viewing angles between satellites in neighboring orbits [11], leading to changes in network topology. For Starlink, these links operate on space lasers.
(2) Cross-shell links: Satellites in different orbital shells can be interconnected via two primary methods, each depicted in Fig. 2 (b & c). The first method involves direct laser links that connect satellites across different orbital shells. These links frequently re-pair due to the relative motion among satellites located in separate shells [44]. The second method connects satellites through ground station relays, known as "bent-pipe" links [47], which are commonly utilized when direct laser systems are not yet fully operational [28, 58]. These ground relay-based radio-frequency links undergo frequent changes as non-geostationary satellites rapidly move relative to the ground.
本文专注于采用星间链路(ISLs)的非对地静止卫星星座。例如,铱星(Iridium)系统采用射频链路,而“星链”(Starlink)则在超过4000颗卫星之间使用空间激光(速率高达200 Gbps),在某些场景下也使用“弯管”(bent-pipe)链路(即通过地面中继连接临近的卫星)。历史上,卫星通信完全依赖地面站,卫星需要等到进入地面站的通信范围后才能进行数据中继。如今,星间链路(ISLs)实现了卫星间的直接通信,形成了一个独立的交换网络——即卫星网络。我们的工作重点关注端到端的空间通信场景,其中流量通过多次卫星跳转进行传输。图1展示了一个典型的卫星网络架构。
卫星星座被组织成多个轨道层(orbital shells)——即在特定高度运行的不同卫星层。例如,“星链”的四个轨道层分别位于540、550、560和570公里的高度。这些轨道层构成了网络拓扑,连接着同一轨道层内以及不同轨道层之间的卫星。随着卫星的运行,网络拓扑会发生变化,主要由两类连接引起:
(1) 同一轨道层内的链路: 在每个轨道层内部,每颗卫星通常与四个邻居相连:两个在同一轨道上(轨道内链路),两个在相邻轨道上(轨道间链路),如图2(a)所示。轨道内链路通常更稳定,极少变化;而当卫星到达高纬度地区时,由于相邻轨道卫星间的视角过大,轨道间链路会被停用,从而导致网络拓扑变化。对于“星链”而言,这些链路通过空间激光运行。
(2) 跨层链路: 不同轨道层中的卫星可以通过两种主要方式互连,如图2(b & c)所示。
第一种方法是使用直接的激光链路连接不同轨道层的卫星。由于位于不同轨道层的卫星之间存在相对运动,这些链路会频繁地重新配对
第二种方法是通过地面站中继来连接卫星,即所谓的“弯管”链路,这种方式在直接激光系统尚未完全普及时被广泛使用。由于非对地静止卫星相对于地面快速移动,这些基于地面中继的射频链路会经历频繁的变化
2.2 Primer on Traffic Engineering¶
Centralized TE [76] is a well-established concept that has been extensively researched and deployed in production WANs [30]. The TE workflow shown in Fig. 3, involves a control center that periodically gauges traffic demands (by a bandwidth broker [35]), computes traffic allocation, and translates the results into router configurations deployed through SDN [6, 19, 36]. Central to TE is an optimization algorithm that computes the allocation of the traffic demand across preconfigured network paths, aiming to optimize a TE objective, such as maximizing overall throughput. We outline a typical satellite TE workflow as follows:
(1) Traffic Matrix Acquisition: Satellite users must establish connections prior to data transmission to facilitate the allocation of channel resources (e.g., beam time slots). This procedure typically involves authentication by the control center, which can also be leveraged to estimate traffic demands based on traffic types and service-level agreements. Examples are provided in Appendix D. The aggregated demands form a traffic matrix, with each entry representing total authorized demand between specific satellite pairs.
(2) Topology Determination: Satellite network topologies change over time but are predefined by satellites’ designated orbits. A satellite topology refers to the network’s connectivity structure. The TE control center monitors satellite coordinates and network state [32] in real time, adapting to subtle changes in the topology. Thus, the topology is determined in advance of TE computation.
(3) Network Path Preconfiguration: Traffic paths (e.g., from source A to destination B) are precomputed based on the determined network topology. Network operators use routing algorithms and configure these paths with techniques like MPLS labels [22, 55, 78] prior to TE computation.
(3) TE Computation: With the traffic matrix, topology, and paths as inputs, the TE algorithm computes the optimal allocation for each traffic flow across its candidate paths. The TE objective, defined by the operator, typically aims to achieve goals such as maximizing throughput. The mathematical formulation of TE is detailed in Appendix A.
(4) Traffic Rule Distribution and Loading: Traffic allocation results are converted into traffic rules for onboard satellites switches. These rules are distributed to satellites via the control center and propagated through ISLs with minimal delays. Satellites then compile and load the rules into their flow tables.
Scope of The Work: The paper focuses on the TE computation step within the broader TE workflow (Fig. 3). The workflow operates periodically over fixed time intervals, with the duration primarily determined by the runtime of TE computation step. While other steps in the workflow typically take within sub-second timescales 4 , traditional TE computation methods [1, 22, 29, 38, 55] require several minutes to even hours for large-scale networks with thousands of nodes. This creates a runtime bottleneck in the workflow, resulting in significant bandwidth waste and delays in traffic allocation adjustments (as discussed in Sec. 2.3.2). Our goal is to develop low-latency TE algorithms designed for large-scale, dynamic satellite networks.
集中式流量工程(TE)是一个成熟的概念,已在生产环境的广域网(WAN)中得到广泛研究和部署。
如图3所示的TE工作流程,涉及一个控制中心,该中心周期性地评估流量需求(通过带宽代理),计算流量分配,并将结果转换为路由器配置,通过SDN进行部署。
TE的核心是一个优化算法,该算法计算流量需求在预配置的网络路径上的分配,旨在优化一个TE目标,例如最大化总吞吐量。我们概述一个典型的卫星TE工作流程如下:
(1) 流量矩阵获取:卫星用户在数据传输前必须建立连接,以便分配信道资源(如波束时隙)
- 此过程通常涉及控制中心的身份验证,这一环节也可被用来根据流量类型和服务等级协议来估计流量需求
- 附录D中提供了示例。 汇聚后的需求构成一个流量矩阵,其中每个条目代表特定卫星对之间的总授权需求
(2) 拓扑确定:卫星网络拓扑随时间变化,但由卫星指定的轨道预先定义
卫星拓扑指的是网络的连接结构。TE控制中心实时监控卫星坐标和网络状态,以适应拓扑的微小变化。因此,拓扑在TE计算之前就已确定
(3) 网络路径预配置:流量路径(例如,从源A到目的B)基于已确定的网络拓扑预先计算
网络运营商使用路由算法,并在TE计算前通过MPLS标签等技术配置这些路径
(4) TE计算:以流量矩阵、拓扑和路径作为输入,TE算法计算每个流量在其候选路径上的最优分配
由运营商定义的TE目标通常旨在实现最大化吞-吐量等目标。TE的数学公式详见附录A
(5) 流量规则分发与加载:流量分配结果被转换为卫星星上交换机的流量规则
- 这些规则通过控制中心分发给卫星,并通过星间链路以极小的延迟传播
- 随后,卫星编译这些规则并将其加载到流表中
本文工作范围: 本文专注于整个TE工作流程(图3)中的TE计算步骤。该工作流程以固定的时间间隔周期性运行,其周期时长主要由TE计算步骤的运行时间决定。虽然工作流程中的其他步骤通常耗时在亚秒级,但传统的TE计算方法对于拥有数千个节点的大规模网络,需要数分钟甚至数小时。这在工作流程中造成了运行时间瓶颈,导致显著的带宽浪费和流量分配调整的延迟(详见2.3.2节讨论)。我们的目标是为大规模、动态的卫星网络开发低延迟的TE算法。
2.3 Dynamics and Implications for TE¶
2.3.1 How Long Does A Topology Hold?
Changes in network topology are induced by changing ISLs due to the relative motion between satellites and asynchronous movement of the satellites relative to Earth. Here, we provide an analysis of the Topology Holding Time (THT), which measures how long a topology remains unchanged. We focus our analysis on Starlink with 4236 satellites.
Topology Generation for Starlink: We generate the topology snapshots for Starlink based on FCC specifications [18]. A) Each Starlink satellite connects with four neighbors within the same shell using lasers, see Fig. 2 (a). These ISLs remain stable and only break when the satellites reach high latitudes larger than 75 ◦ . B) Each satellite also connects with the nearest satellite in neighboring shells, forming cross-shell links. Our analysis includes two scenarios with different types of cross-shell links deployed: (B.1) Cross-shell links via lasers: Each satellite connects to the nearest satellite in adjacent orbital shells via lasers (see Fig. 2 (b)). The link remains until the distance exceeds 2,000 km. (B.2) Cross-shell links via ground relays: Each satellite connects to the nearest ground relay using RF (see Fig. 2 (c)). The link remains until the satellite is at an elevation angle lower than 25°relative to the ground relay [18]. Ground relays are globally deployed at 222 real-world locations [49].
THT Analysis: We sample 40,000 consecutive topology snapshots every 12.5 ms. THT is measured as 12.5𝑘 ms, where 𝑘 is the number of sampled intervals in which the topology remains unchanged. Fig. 4 (a) shows the CDF of THT, with a maximum of around 700 ms and an average of 70 ms. Crossshell link types do not significantly impact THT.
网络拓扑的变化是由卫星间的相对运动以及卫星相对于地球的异步运动所导致的星间链路变化引起的。
在此,我们分析了 拓扑保持时间(THT),该指标衡量一个拓扑保持不变的时长。 我们的分析聚焦于拥有4236颗卫星的“星链”网络。
“星链”的拓扑生成: 我们基于FCC的规范生成了“星链”的拓扑快照。
A) 每颗“星链”卫星在同一轨道层内通过激光与其四个邻居连接,见图2(a)。这些星间链路保持稳定,仅在卫星到达大于75°的高纬度地区时才会断开
B) 每颗卫星还与相邻轨道层中最近的卫星连接,形成跨层链路。我们的分析包括两种部署了不同类型跨层链路的场景:
(B.1) 通过激光的跨层链路:每颗卫星通过激光连接到相邻轨道层中最近的卫星(见图2(b))。该链路将一直保持,直到距离超过2000公里
(B.2) 通过地面中继的跨层链路:每颗卫星使用射频(RF)连接到最近的地面中继(见图2(c))。该链路将一直保持,直到卫星相对于地面中继的仰角低于25°。地面中继部署在全球222个真实世界的位置
THT分析: 我们每隔12.5毫秒连续采样了40,000个拓扑快照。THT被测量为 \(12.5k\) 毫秒,其中\(k\)是拓扑保持不变的采样间隔数量。图4(a)显示了THT的累积分布函数(CDF),其最大值约为700毫秒,平均值为70毫秒。跨层链路的类型对THT没有显著影响。
2.3.2 Implications.
(1) Implication of Topology Changes: Traditional TE computation for large-scale networks with thousands of nodes often requires several minutes to solve, creating a runtime bottleneck in the periodic TE workflow and delaying subsequent rounds. Then, many configured paths become outdated over time during a TE workflow round, as shown in Fig. 4 (b). To ensure performance, TE computation considers only ISLs that remain consistent during the computation runtime. However, longer computation times result in greater link exclusion and substantial bandwidth waste.
To quantify this impact, we analyze 40,000 consecutive topology snapshots of Starlink sampled every 12.5 ms. For a given interval (i.e., 12.5𝑘 ms), we retain only ISLs present in every snapshot. We then calculate the ratio of excluded ISLs to the total number of potentially changing ISLs (a number primarily contributed by cross-shell links), with results shown in Fig. 4 (c) for intervals ranging from 12.5 ms to 250 s. We note that longer intervals result in a significant fraction of ISLs becoming outdated and excluded. This underscores the importance of low-latency TE computation to reduce link exclusion and bandwidth waste.
(2) Implication of Traffic Fluctuations: Networks serve user-facing traffic, which fluctuates over time. These fluctuations prevent pre-allocation of traffic, requiring TE allocation to be computed rapidly in response to received traffic demands. Longer computation times delay adjustments, as traffic demand may shift significantly during the period.
Motivation for Faster TE Computation: Frequent topology changes and fluctuating traffic demands in satellite networks demand low-latency TE algorithms. Faster TE computation can shorten the workflow interval, mitigate the exclusion of ISLs caused by topology changes and enable timely traffic allocation adjustments. This motivates the development of SaTE, designed to compute TE at low latency.
(1) 拓扑变化的影响: 传统TE算法对拥有数千个节点的大规模网络进行计算,通常需要数分钟才能求解,这在周期性的TE工作流程中造成了运行时间瓶颈,并延迟了后续轮次的计算。因此,在一个TE工作流程周期内,许多已配置的路径会随着时间的推移而失效,如图4(b)所示。为保证性能, TE计算只考虑那些在计算运行时间内保持一致的星间链路。然而,更长的计算时间会导致更多的链路被排除,从而造成巨大的带宽浪费。
为了量化这一影响,我们分析了每12.5毫秒采样的40,000个连续的“星链”拓扑快照。对于给定的时间间隔(即\(12.5k\)毫秒),我们只保留在每个快照中都存在的星间链路。然后,我们计算被排除的星间链路与可能发生变化的总星间链路(该数量主要由跨层链路贡献)的比率,结果如图4(c)所示,时间间隔范围从12.5毫秒到250秒。我们注意到,更长的时间间隔会导致大部分星间链路变得过时而被排除。这凸显了低延迟TE计算对于减少链路排除和带宽浪费的重要性。
(2) 流量波动的影响: 网络服务于面向用户的流量,这些流量随时间波动。这些波动使得预先分配流量变得不可行,要求TE分配必须根据接收到的流量需求快速计算。更长的计算时间会延迟调整,因为在此期间流量需求可能已发生显著变化。
加速TE计算的动机: 卫星网络中频繁的拓扑变化和波动的流量需求,迫切需要低延迟的TE算法。更快的TE计算可以缩短工作流程的间隔,减轻由拓扑变化引起的星间链路排除问题,并实现及时的流量分配调整。这推动了我们开发SaTE,旨在以低延迟计算TE。
2.4 Applying ML to Satellite TE¶
Scalability Challenges: Solving TE for satellite networks with thousands of nodes is a large-scale optimization challenge, involving NP-hard linear programming. Existing heuristic methods [1, 22, 29, 38, 55, 71, 73] typically require minutes or even hours to compute solutions. Commercial solvers [24] take 46 seconds for Starlink (see Sec. 5.1), which far exceeds the THT shown in Fig. 4 (a). Consequently, many preconfigured network paths become obsolete due to changes in topologies, leading to degraded network performance.
Acceleration via Machine Learning: ML offers opportunities to overcome TE computation bottlenecks via GPU parallelization. Recent studies [23, 61] use deep neural networks and multi-agent reinforcement learning to accelerate TE in terrestrial WANs but struggle with topology changes in satellite networks, requiring frequent retraining following topology changes. In contrast, model architectures like GNNs are well-suited for learning from graph-structured data with dynamic connectivity (e.g., topologies). Advanced GNNs, like graph attention networks (GATs), are particularly adept at handling variable-sized inputs and generalizing to unseen graphs [9, 72]. This capability presents a promising opportunity to address the challenges of satellite TE.
Limitations of Existing GNN-Based TE Solutions: However, recent GNN-based TE approaches [78] are hamstrung for satellite networks due to their design of the input graph. Specifically, their graphs only represent how individual links (e.g., direct connections between nodes) are used in the configured paths that span source-destination pairs. While this relation can be learned using GNN layers, the graph [78] fails to represent essential TE relations, such as the paths tied to traffic demand. This critical relation is not captured by the graph and instead requires additional learning layers, such as DNN. Consequently, their models, consisting of GNN and DNN layers, introduce the following limitations:
• Limited Generalization: As topologies change, the DNN layers trained to process embeddings from a specific set of paths cannot generalize to preconfigured paths in topologies that are not part of the training dataset (referred to as “unseen” topologies). This requires extensive re-training (e.g., 6-10 hours [78]) whenever the topology changes.
• Memory Challenges: DNN layers require fixed-sized inputs, which means all preconfigured paths for each sourcedestination pair must be explicitly represented as inputs to the DNN. In Starlink, with 4,236 satellites and 10 preconfigured paths per source-destination pair, a single DNN input has a path dataset of size 263 GB, exceeding commercial GPU memory. The fixed dimensions prevent the application of flexible pruning techniques, like those developed by SaTE (Sec. 3.4). These limitations make existing learning-based TE approaches struggle to address the scale and dynamics of satellite networks.
可扩展性挑战: 为拥有数千个节点的卫星网络求解TE是一个大规模优化挑战,涉及NP难的线性规划问题。现有的启发式方法通常需要数分钟甚至数小时来计算解决方案。商业求解器对“星链”网络的计算耗时为46秒(见5.1节),这远超过图4(a)所示的THT。因此,许多预配置的网络路径会因拓扑变化而失效,导致网络性能下降。
通过机器学习加速: 机器学习(ML)通过GPU并行化为克服TE计算瓶颈提供了机会。近期的研究使用深度神经网络和多智能体强化学习来加速地面广域网的TE,但难以应对卫星网络的拓扑变化,因为拓扑变化后需要频繁地重新训练。相比之下,像图神经网络(GNNs)这样的模型架构非常适合学习具有动态连接性(如拓扑)的图结构数据。先进的GNN,如图注意力网络(GATs),尤其 擅长处理可变大小的输入并能泛化到未见过的图。这种能力为解决卫星TE的挑战提供了一个有前景的机会。
现有基于GNN的TE解决方案的局限性: 然而,由于输入图的设计问题,近期基于GNN的TE方法在应用于卫星网络时受到了限制。具体来说,它们的图仅表示了单个链路(如节点间的直接连接)在跨越源-目的对的已配置路径中是如何被使用的。虽然这种关系可以用GNN层来学习,但该图未能表示一些必要的TE关系,例如与流量需求绑定的路径。这种关键关系没有被图捕获,而是需要额外的学习层,如深度神经网络(DNN)。因此,它们由GNN和DNN层组成的模型引入了以下局限性:
- 泛化能力有限: 随着拓扑的变化,为处理特定路径集合的嵌入而训练的DNN层,无法泛化到不属于训练数据集的拓扑(即“未见过”的拓扑)中的预配置路径。这导致每当拓扑变化时,都需要进行耗时很长的重新训练(例如,6-10小时)
- 内存挑战: DNN层要求固定大小的输入,这意味着每个源-目的对的所有预配置路径都必须作为输入显式地表示给DNN。在“星链”网络中,有4236颗卫星,每个源-目的对有10条预配置路径,单个DNN输入的路径数据集大小就达到了263 GB,超过了商用GPU的内存。固定的维度也阻碍了灵活剪枝技术的应用,例如SaTE所开发的技术(见3.4节)。这些局限性使得现有的基于学习的TE方法难以应对卫星网络的规模和动态性
SaTE - Satellite TE¶
3.1 Overview¶
SaTE aims to achieve low-latency TE computation, enabling rapid adaptation to frequent topology changes and fluctuating traffic demands in satellite networks. To achieve this, we transform classic TE optimizations into an inference process of neural networks, which can be accelerated by GPU, leveraging modern deep learning frameworks. The key enabler of SaTE is a graph representation that models the satellite TE problem. The graph is then used to build a GNN framework. To make training on commercial GPUs feasible, SaTE introduces a topology pruning method as well as a traffic and path pruning approach, as shown in Fig. 5. The paper addresses the two main challenges in SaTE’s design:
(1) Satellite Graph Design and Learning: Our solution must design a graph that captures the changing interconnectivity in satellite network topologies. Existing approaches [33, 78] construct graphs based solely on the physical interconnectivity, serving as inputs to GNN layers. However, they miss crucial correlations between traffic flows and feasible paths—key for solving TE. To compensate for this, existing methods rely on additional DNN layers, but this combination of GNN and DNN struggles to generalize to dynamic and unseen topologies, requiring frequent retraining. SaTE addresses this by formulating a heterogeneous graph that comprehensively represents all elements and relations critical to the satellite TE problem. This graph enables the use of GNN layers alone, allowing the model to generalize to dynamic changes in the graph (e.g., topology, traffic demands, and paths) and even entirely unseen graphs, thereby alleviating the need for frequent retraining with each topology change. Furthermore, SaTE reduces the number of relations within the graph to remove computational redundancy and enhance inference latency. Sec. 3.2 and 3.3 detail our approach.
(2) Dataset Pruning: The size of the input to the GNN modules increases quadratically with the number of satellites. For example, Starlink, with its 4,236 satellites, generates a data point of 335 GB for TE computation, which exceeds the memory capacity of GPUs [59]. While multi-GPU training is an active field of study, efficiently distributing and computing a singular data point and its corresponding graph across multiple GPUs remains an open question [79, 82]. Instead, SaTE prunes massive input data to make it feasible to train its model on a single GPU. Further, the vast number of varying satellite topologies makes the training challenging as well. Sec. 3.4 addresses these challenges. We note that the pruning step is not merely an optional optimization, but a necessary precondition for scalable training and inference.
SaTE旨在实现低延迟的TE计算,从而能够快速适应卫星网络中频繁的拓扑变化和波动的流量需求。为实现此目标, 我们将经典的TE优化问题转化为神经网络的推理过程 ,该过程可利用现代深度学习框架通过GPU进行加速。SaTE的关键赋能技术是一种对卫星TE问题进行建模的图表示方法。该图随后被用于构建一个GNN框架。为了使在商用GPU上的训练变得可行,SaTE引入了一种拓扑剪枝方法以及一种流量与路径剪枝方法,如图5所示。本文解决了SaTE设计中的两个主要挑战:
(1) 卫星图的设计与学习: 我们的解决方案必须设计一个能够捕捉卫星网络拓扑中不断变化的互连性的图。现有的方法仅基于物理互连性构建图,并将其作为GNN层的输入。然而,它们忽略了流量与可行路径之间的关键关联——这对求解TE至关重要。为了弥补这一点,现有方法依赖于额外的DNN层,但这种GNN与DNN的组合难以泛化到动态和未见过的拓扑,需要频繁地重新训练。SaTE通过构建一个异构图来解决此问题,该图全面表示了对卫星TE问题至关重要的所有元素和关系。这种图结构使得模型可以仅使用GNN层,从而能够泛化到图中的动态变化(如拓扑、流量需求和路径),甚至能泛化到完全未见过的图,从而减轻了每次拓扑变化时都需要频繁重新训练的需求。此外,SaTE减少了图中的关系数量,以消除计算冗余并提升推理延迟。第3.2节和3.3节详细介绍了我们的方法。
(2) 数据集剪枝: GNN模块的输入规模随卫星数量呈二次方增长。例如,“星链”拥有4236颗卫星,为一次TE计算生成的单个数据点大小高达335 GB,这超过了GPU的内存容量。虽然多GPU训练是一个活跃的研究领域,但在多个GPU上高效地分布式计算单个数据点及其对应的图仍然是一个开放性问题。因此,SaTE通过剪枝海量输入数据,使其能够在单个GPU上进行模型训练。此外,卫星拓扑数量庞大且种类繁多,也给训练带来了挑战。第3.4节解决了这些挑战。我们注意到,剪枝步骤不仅仅是一个可选的优化,而是可扩展训练和推理的必要前提。
核心机制
优化问题可以转换成推理过程
- 异构图 GNN: 不再需要DNN了!
- 数据集剪枝: 不仅仅是优化, 而是必要条件!
3.2 Satellite Graph Modeling¶
In this section, we describe how SaTE designs a heterogeneous graph representation tailored for satellite networks.
Graph Design: A graph essentially represents a relational mapping (i.e., edges) between elements of interest (i.e., nodes) and serves as the input for any GNN model. Previous works construct graphs to represent computer networks solely from physical interconnectivity and associated preconfigured paths [78], relying on additional DNN layers (e.g., an MLP) that limit their ability to generalize to changing topologies. In contrast, SaTE constructs a heterogeneous satellite TE graph to represent the entire TE problem. This graph captures changes in topology, time-varying traffic matrices, and path reconfigurations. These dynamics can then be modeled by GNNs, which leverage their inherent generalization capabilities to adapt even to unseen topologies.
The satellite TE graph includes the following key elements: (1) Satellite – denotes the end nodes of ISL, including satellites and potentially ground relays when "bent-pipe" links are used; (2) Traffic – denotes traffic demands from users; (3) Path – specifies the route taken by traffic from a source to a destination satellite; (4) Link – connects two satellites via ISL. We then specify the relations among these elements: e.g., satellites inter-connect each other; each path crosses chains of satellites from a source to a destination, being able to transport network traffic; each path contains a group of links. Fig 6 (a) details all possible relations in the graph.
Graph Reduction for Enhanced Computational Latency: GNNs learn complex dependencies and interactions within a graph using message passing [26], where neighboring elements exchange and aggregate information based on specific relations to update their embeddings (i.e., vector representations). However, SaTE’s satellite TE graph contains multiple heterogeneous elements and relations, typically requiring distinct message passing for each relation type (e.g., connects, contains). As the number of distinct message passing increases, so does the computational latency of the GNN. Therefore, SaTE must simplify its heterogeneous graph.
To address this, SaTE reduces redundancy within the graph by eliminating unnecessary relations while retaining all essential information for TE computation. For example, the "access" relation between Satellite and Traffic is redundant because it is implicitly covered by the “crosses” and “transports” relations. Specifically, the “crosses” relation connects Path to Satellite, indicating which satellites are traversed by each preconfigured path. Similarly, the “transports” relation connects Path to Traffic, capturing the allocation of traffic demands to paths. Together, these relations enable message passing from Traffic to Satellite via Path, ensuring that satellites receive traffic-related information without requiring a direct “access” edge. In addition, the representation of 𝐿𝑖𝑛𝑘 can be merged into the "connect" relation, with its link incorporated into the weight of the "connect" relation, facilitating the use of edge-weighted attention mechanism [40]. SaTE retains the Satellite element due to the dynamic nature of satellite interconnections, which frequently break and reform. Thus, this time-varying relation—where satellites connect with each other—must be explicitly modeled and learned by the GNN. Fig. 6 (b) shows the simplified satellite TE graph. This simplification reduces the learning complexity, leaving three types of relational pairs: (R1) Inter-Satellite connections; (R2) Path-Satellite; and (R3) Path-Traffic.
在本节中,我们描述SaTE如何为卫星网络设计一种量身定制的异构图表示方法。
图设计: 图本质上表示了目标元素(即节点)之间的关系映射(即边),并作为任何GNN模型的输入。先前的工作仅从物理互连性和相关的预配置路径来构建图以表示计算机网络,并依赖额外的DNN层(如MLP),这限制了它们泛化到变化拓扑的能力。相比之下,SaTE构建了一个异构卫星TE图来表示整个TE问题。该图能够捕捉拓扑的变化、时变的流量矩阵以及路径的重新配置。这些动态变化随后可以由GNN进行建模,利用其固有的泛化能力,甚至可以适应未见过的拓扑。
卫星TE图包括以下关键元素:
(1) 卫星(Satellite) - 表示星间链路的端节点,包括卫星本身,当使用“弯管”链路时也可能包括地面中继
(2) 流量(Traffic) - 表示来自用户的流量需求
(3) 路径(Path) - 指定流量从源卫星到目的卫星所采用的路由
(4) 链路(Link) - 通过星间链路连接两颗卫星。然后我们定义这些元素之间的关系:例如,卫星之间相互连接;每条路径都跨越从源到目的的一系列卫星,能够传输网络流量;每条路径包含一组链路。图6(a)详细说明了图中所有可能的关系
为提升计算延迟而进行图简化: GNN通过消息传递机制学习图内复杂的依赖关系和相互作用,其中相邻元素根据特定关系交换和聚合信息,以更新它们的嵌入向量(即向量表示)。然而,SaTE的卫星TE图包含多种异构元素和关系,通常需要为每种关系类型(如connects
, contains
)进行不同的消息传递。随着不同消息传递类型的增加,GNN的计算延迟也会增加。因此,SaTE必须简化其异构图。
为了解决这个问题,SaTE通过消除不必要的关系来减少图中的冗余,同时保留TE计算所需的所有基本信息。例如,“卫星”和“流量”之间的“接入(access)”关系是冗余的,因为它被“跨越(crosses
)”和“传输(transports
)”关系所隐式覆盖。具体来说,“跨越”关系连接了“路径”与“卫星”,指明了每条预配置路径经过了哪些卫星;类似地,“传输”关系连接了“路径”与“流量”,捕捉了流量需求到路径的分配。这两个关系共同实现了信息从“流量”经由“路径”到“卫星”的消息传递,确保卫星能够接收到与流量相关的信息,而无需直接的“接入”边。此外,“链路(Link)”的表示可以合并到“连接(connect)”关系中,其链路属性(如容量)可以作为“连接”关系边的权重,从而便于使用边加权的注意力机制。SaTE保留了“卫星”元素,因为卫星间的互连具有动态性,会频繁地断开和重组。因此,这种时变的关系——卫星相互连接——必须被GNN显式地建模和学习。
图6(b)展示了简化后的卫星TE图。这种简化降低了学习复杂度,最终剩下三类关系对:(R1) 卫星间连接;(R2) 路径-卫星;以及 (R3) 路径-流量
3.3 Learning the Graph¶
Our discussion so far has covered how SaTE formulates a comprehensive yet simplified satellite TE graph to adapt to changing topologies and other TE elements while reducing computational redundancy. This graph serves as the input for graph neural networks, which are designed to learn representations of graph-structured data [72].
SaTE’s simplified satellite TE graph includes three distinct types of relations (R1, R2, R3, as shown in Fig. 6 (b)). To handle this heterogeneity [70], SaTE uses three GNN modules, each applied to process a type of relation. The architecture for all GNN modules is based on the graph attention network [9], which dynamically prioritizes the importance of neighboring nodes during information aggregation. This allows the neural network to focus on the most relevant connections in the graph, reducing the influence of irrelevant relations. The three GNN modules sequentially aggregate and propagate information according to their respective relations, as shown in Fig. 7. Through this sequential structure, embeddings updated by one module are passed to the next, capturing interdependencies across embeddings and facilitating complex inference through multiple layers of abstraction.
Embedding Initialization: We empirically select the embedding dimension to be 768 for nodes and edges in the simplified satellite TE graph (Fig. 6 (b)). As shown in the table of Fig. 7, each embedding is initialized by multiplying its respective TE input with a \(1 \times 768\) learnable weight matrix W, randomly initialized at the start. For example, node embeddings of satellites (NE1) are initialized using the number of a satellite’s neighbors; while edge embeddings (e.g., EE1) are initialized with the link capacity. Link or satellite failures are handled by setting the corresponding capacities to zero, which is inherently supported by the GNN design.
Inference through Message Passing: SaTE stacks multiple GNN layers in each module to iteratively update the initialized embeddings. A layer’s input consists of a set of node embeddings {\(v_i \in \mathbb{R}^d | i \in N\)} where \(N\) is the node set and a set of edge embeddings {\(e_{i,j}\)} for all related nodes (e.g., an edge exists between the graph nodes of two connected satellites in the ‘GNN for R1’ module). Each layer computes a weighted average of the embeddings of the neighbor nodes and related edges, followed by a nonlinear function LeakyReLU(·)), to produce a new set of node embeddings {\(v_i' \in \mathbb{R}^d | i \in N\)}:
\(v_i' \leftarrow \Theta_s \cdot v_i + \left\Vert_{k=1}^K \left( \sum_{j \in r(i)} \alpha_{j,i}^k (\Theta_n^k \cdot v_j + \Theta_e^k \cdot e_{j,i}) \right) \right.\), (1) \(v_i' \leftarrow \text{LeakyReLU}(v_i')\),
where \(\alpha\), \(\Theta\) are learned attention scores and weights, \(r(i)\) denotes neighbors of node \(i\), and \(\Vert\) denotes vector concatenation. A detailed description of these parameters is provided in Appendix F. Within the framework, embeddings are updated sequentially across three GNN modules: ‘GNN for R1’ module computes satellite embeddings. ‘GNN for R2’ module updates satellite and path embeddings concurrently to mutually enhance their representations. ‘GNN for R3’ module refines path embeddings and traffic embeddings together. Finally, an MLP decoder computes the traffic allocation \(x_{jp}\), which represents the amount of traffic \(j\) assigned to path \(p\).
Correction for Constraint Violation: Neural networks including GNNs are inherently soft constraint models, meaning they approximate solutions but do not strictly enforce hard constraints during optimization. For example, they cannot guarantee that the total bandwidth allocated to a flow will not exceed its demand or that traffic allocation will remain within link capacity limits. As a result, the computation outcome \(x_{jp}\) may violate TE constraints. To address this, we trim overloaded traffic, ensuring the final TE solution is feasible. Evaluation results in Sec. 5 account for this trimming.
Training Method: We train the GNN model using supervised learning, with ground-truth labels for traffic allocation generated by the commercial solver Gurobi [24]. These labels are obtained by solving the TE problem with the same set of inputs, including traffic demand, network topology, and paths, ensuring alignment with SaTE’s input data. During training, we iteratively calculate the loss between the label \(x_{jp}^*\), and the traffic allocation \(x_{jp}\) computed by SaTE. The loss is backpropagated to update all model parameters: attention scores \(\alpha\), weights \(\Theta\), and embedding initialization matrices W in each message passing layer. The loss function and hyperparameter are detailed in Appendix B.
至此,我们的讨论涵盖了SaTE如何构建一个全面而简化的卫星TE图,以适应变化的拓扑和其他TE元素,同时减少计算冗余。该图作为图神经网络的输入,用于学习图结构数据的表示。
SaTE的简化卫星TE图包含三种不同类型的关系(R1, R2, R3,如图6(b)所示)。为了处理这种异构性,SaTE使用了三个GNN模块,每个模块分别用于处理一种关系类型。所有GNN模块的架构都基于图注意力网络,该网络在信息聚合过程中动态地优先考虑邻居节点的重要性。这使得神经网络能够专注于图中最相关的连接,减少不相关关系的影响。三个GNN模块按照它们各自的关系顺序地聚合和传播信息,如图7所示。通过这种顺序结构,由一个模块更新的嵌入向量被传递给下一个模块,从而捕捉不同嵌入向量间的相互依赖关系,并通过多层抽象促进复杂的推理。
嵌入初始化: 我们根据经验选择简化卫星TE图(图6(b))中节点和边的嵌入维度为768。如图7的表格所示,每个嵌入向量都是通过将其各自的TE输入与一个\(1 \times 768\)维的可学习权重矩阵W相乘来初始化的,该矩阵在开始时随机初始化。例如,卫星的节点嵌入(NE1)使用卫星的邻居数量进行初始化;而边嵌入(如EE1)则使用链路容量进行初始化。链路或卫星故障可以通过将其对应的容量设置为零来处理,GNN的设计本身就支持这种操作。
通过消息传递进行推理: SaTE在每个模块中堆叠多个GNN层,以迭代更新初始化的嵌入向量。一个层的输入包含一组节点嵌入 {\(v_i \in \mathbb{R}^d | i \in N\)}(其中\(N\)是节点集合)和一组相关节点间的边嵌入 {\(e_{i,j}\)}(例如,在“GNN for R1”模块中,两个相连卫星的图节点之间存在一条边)。每个层计算邻居节点和相关边的嵌入的加权平均值,然后通过一个非线性函数LeakyReLU(·),生成一组新的节点嵌入 {\(v_i' \in \mathbb{R}^d | i \in N\)}:
\(v_i' \leftarrow \Theta_s \cdot v_i + \left\Vert_{k=1}^K \left( \sum_{j \in r(i)} \alpha_{j,i}^k (\Theta_n^k \cdot v_j + \Theta_e^k \cdot e_{j,i}) \right) \right.\), (1) \(v_i' \leftarrow \text{LeakyReLU}(v_i')\),
其中\(\alpha\)、\(\Theta\)是学习到的注意力分数和权重,\(r(i)\)表示节点\(i\)的邻居,\(\Vert\)表示向量拼接。这些参数的详细描述在附录F中提供。在该框架内,嵌入向量按顺序跨越三个GNN模块进行更新:“GNN for R1”模块计算卫星嵌入;“GNN for R2”模块同时更新卫星和路径嵌入,以相互增强它们的表示;“GNN for R3”模块共同优化路径嵌入和流量嵌入。最后,一个MLP解码器计算出流量分配\(x_{jp}\),表示分配给路径\(p\)的流量\(j\)的量。
约束违规校正: 包括GNN在内的神经网络本质上是软约束模型,这意味着它们在优化过程中近似求解,但并不严格强制执行硬约束。例如,它们无法保证分配给一个流的总带宽不超过其需求,或者流量分配保持在链路容量限制内。因此,计算结果\(x_{jp}\)可能会违反TE约束。为了解决这个问题,我们修剪超载的流量,确保最终的TE解是可行的。第5节的评估结果考虑了这种修剪操作。
训练方法: 我们使用监督学习来训练GNN模型,其真实标签(ground-truth labels)由商业求解器Gurobi生成的流量分配方案构成。 这些标签是通过使用与SaTE相同的输入集(包括流量需求、网络拓扑和路径)求解TE问题获得的,以确保与SaTE的输入数据对齐。在训练过程中,我们迭代计算标签\(x_{jp}^*\)与SaTE计算的流量分配\(x_{jp}\)之间的损失。该损失被反向传播以更新所有模型参数:每个消息传递层中的注意力分数\(\alpha\)、权重\(\Theta\)以及嵌入初始化矩阵W。损失函数和超参数详见附录B。
3.4 Dataset Dimension Pruning¶
The input size to the GNN model increases quadratically with the scale of satellite constellations. Mega-constellations like Starlink, with thousands of satellites, result in large-sized inputs. For example, Starlink, with its 4,236 satellites, generates a data point of 335 GB. This includes a 4,236×4,236 float traffic matrix (72 GB), where each element specifies the demand between source-destination pairs, and a 4,236×4,236 path matrix with 10 preconfigured paths per entry (263 GB). This total size far exceeds typical GPU capacity (e.g., 80 GB on an NVIDIA A100 [59]). While multi-GPU training is being explored, training a single data point and its corresponding large graph across multiple GPUs with performance guarantees remains an open challenge [79, 82]. Furthermore, mega-constellations exhibit numerous varying topologies, and training on all possible topologies risks overfitting and incurs high training overhead.
Existing GNN-based TE approaches rely on DNN layers to model key relations to TE [78], but these layers require fixed-sized input dimensions. Furthermore, these DNN layers cannot generalize to dynamic and unseen topologies, requiring frequent retraining. In contrast, in Sec. 3.2 and 3.3, SaTE introduces a graph design that models the entire TE problem and is learned solely by GNN layers. By leveraging attentionenabled GNN [72], SaTE can process variable-sized inputs, enabling traffic and path pruning. Additionally, the inductive learning capabilities of GNNs [9] allow SaTE to generalize to unseen topologies, enabling training on a small set of representative topologies instead of all possible snapshots. This section details two pruning strategies:
Traffic & Path Pruning: SaTE prunes the traffic and path matrices by leveraging the uneven distribution of satellite users [65, 66]. While the entire satellite network provides global coverage, some satellites often receive zero traffic, especially when passing over sparsely populated areas such as deserts [7]. Thus, we retain only the non-zero elements in the traffic matrix and exclude idle paths with zero traffic input from the training dataset. Such pruning still preserves the spatial dynamics of satellite networks by recording the source and destination of each traffic demand, thus maintaining representational fidelity. Table 1 shows that pruning reduces Starlink data volumes by up to 22,381×, shrinking a data point from 335 GB to 15 MB.
The pruning method is applicable only to fully GNN-based frameworks. It cannot be applied to hybrid models with DNNs. This is because DNNs require fixed-size and positionspecific input structures, and a model trained on a specific pattern of non-trivial traffic matrix entries cannot generalize to new patterns where the positions of active entries differ.
Topology Pruning: SaTE leverages the fact that certain satellite network topologie snapshots are similar or isomorphic as satellites orbit the Earth. Instead of traversing all possible topology snapshots, we train the model on a small set of representative topologies with diverse structures, where the set size can be customized. Specifically, our empirical study in Sec. 5.3 shows that when trained with 512 representative topologies for Starlink, the model can already outperform state-of-the-art approaches. Appendix E details how we extract those representative topologies from a large set of snapshots using Determinantal Point Process sampling [42].
GNN模型的输入大小随卫星星座的规模呈二次方增长。像“星链”这样的巨型星座拥有数千颗卫星,会导致输入规模巨大。例如,拥有4236颗卫星的“星链”,其单个数据点大小可达335 GB。这包括一个4236×4236的浮点型流量矩阵(72 GB),其中每个元素指定了源-目的对之间的需求,以及一个4236×4236的路径矩阵,每个条目包含10条预配置路径(263 GB)。这个总大小远超典型的GPU容量(例如,NVIDIA A100为80 GB)。虽然多GPU训练正在被探索,但在多个GPU上训练单个数据点及其对应的大型图并保证性能,仍然是一个开放性挑战。此外,巨型星座表现出大量变化的拓扑,对所有可能的拓扑进行训练可能会导致过拟合,并产生高昂的训练开销。
现有的基于GNN的TE方法依赖DNN层来建模与TE相关的关键关系,但这些层需要固定大小的输入维度。此外,这些DNN层无法泛化到动态和未见过的拓扑,需要频繁重新训练。相比之下,在3.2和3.3节中,SaTE引入了一种能够对整个TE问题建模的图设计,并且完全由GNN层进行学习。
- 通过利用支持注意力的GNN,SaTE可以处理可变大小的输入,从而实现了流量和路径剪枝
- 此外,GNN的归纳学习能力使得SaTE能够泛化到未见过的拓扑,从而可以在一个小的代表性拓扑集上进行训练,而不是所有可能的快照
本节详细介绍两种剪枝策略:
流量与路径剪枝: SaTE利用卫星用户分布不均的特点来剪枝流量和路径矩阵。虽然整个卫星网络提供全球覆盖,但一些卫星在经过人口稀疏地区(如沙漠)时通常接收不到任何流量。因此, 我们仅保留流量矩阵中的非零元素 ,并从训练数据集中排除流量输入为零的空闲路径。这种剪枝通过记录每个流量需求的源和目的地,仍然保留了卫星网络的空间动态性,从而保持了表示的保真度。表1显示,剪枝可将“星链”的数据量减少高达22,381倍,使单个数据点从335 GB缩小到15 MB。
这种剪枝方法仅适用于完全基于GNN的框架,无法应用于包含DNN的混合模型。这是因为DNN需要固定大小且位置特定的输入结构,一个在特定非零流量矩阵模式上训练的模型,无法泛化到活动条目位置不同的新模式。
拓扑剪枝: SaTE利用了这样一个事实:随着卫星绕地球运行,某些卫星网络拓扑快照是相似或同构的。我们不是遍历所有可能的拓扑快照,而是 在一个具有多样化结构的小型代表性拓扑集上训练模型 ,该集合的大小可以自定义。具体来说,我们在5.3节的实证研究表明,当使用512个“星链”的代表性拓扑进行训练时,模型的性能已经可以超越现有先进方法。附录E详细介绍了我们如何使用行列式点过程采样(Determinantal Point Process sampling)从大量的快照中提取这些代表性拓扑。
Implementation and Evaluation¶
SaTE Software: SaTE is implemented in Python, trained and deployed on Microsoft Azure with an NVIDIA A100 GPU, 24 cores, and 220 GB memory. SaTE’s model uses traffic matrix, network topology, and ten pre-calculated shortest paths between each source-destination pair to compute optimal traffic allocation. We implemented a satellite trajectory emulator to generate realistic network topologies based on orbital parameters from FCC official specifications [18]. Since Starlink has not disclosed its traffic traces or per-country gateway distribution, we developed an in-house satellite network simulator to produce traffic matrices for evaluation. This work does not raise any ethical issues.
SaTE软件: SaTE使用Python实现,在Microsoft Azure平台上进行训练和部署,该平台配备了NVIDIA A100 GPU、24核CPU和220 GB内存。SaTE的模型利用流量矩阵、网络拓扑以及每个源-目的对之间预先计算的十条最短路径来计算最优流量分配。我们实现了一个卫星轨迹模拟器,根据FCC官方规范中的轨道参数来生成真实的网络拓扑。由于“星链”尚未公开其流量轨迹或各国网关的分布情况,我们开发了一个自研的卫星网络模拟器来生成用于评估的流量矩阵。本项工作不涉及任何伦理问题。
Starlink Constellation: Based on the orbital parameters from FCC specifications [18], we emulate satellite trajectories using the open-source orbital mechanics emulator [63]. The emulation replicates the largest actively operating satellite constellation as of April 2024: Starlink (Phase 1), with 4,236 satellites. These satellites are distributed across four orbital shells, as detailed in Table 4 in Appendix G.
Starlink Topology: We generate the topology snapshots for Starlink based on the emulated satellite trajectories. As of March 2024, Starlink has been operating over 10,000 lasers [67, 68]. Additional details about ISLs, including the conditions under which they form and break, are provided in Sec.2.1 and 2.3.1. Our evaluation presents results for two scenarios with different types of cross-shell links. All links via lasers and ground relay have a capacity limit of 200 Mbps 5 .
Other Constellations: To evaluate SaTE’s performance on various network scales, we simulate additional constellations: (1) Iridium: A constellation of 66 satellites in a single shell at 781 km altitude, using Ka-band ISLs [34]. (2) Mid-Sized Constellations: By retaining only the first two orbital shells of Starlink and reducing the number of orbital planes by factors of 8 and 2, we created two mid-sized constellations with 396 and 1584 satellites, serving as intermediaries between Iridium and full-scale Starlink.
Satellite Traffic Matrices: Our simulations place 3 million globally distributed users and 1000 gateways based on population density [20], with adjustments using a smoothing factor to account for remote areas (detailed in Appendix G). Traffic flows include user-to-user services such as voice and video, and gateway-to-user traffic, where users access Internet data through gateways. Traffic is routed directly through satellite links, with users and gateways directly connected to satellites. The traffic parameters are detailed in Table 2.
“星链”星座: 我们基于FCC规范中的轨道参数,使用开源的轨道力学模拟器来模拟卫星轨迹。该模拟复现了截至2024年4月仍在运行的最大卫星星座:“星链”(第一阶段),拥有4236颗卫星。这些卫星分布在四个轨道层中,详情见附录G的表4。
“星链”拓扑: 我们基于模拟的卫星轨迹生成“星链”的拓扑快照。截至2024年3月,“星链”已运行超过10,000个激光器。关于星间链路(ISL)的更多细节,包括其建立和断开的条件,已在第2.1节和第2.3.1节中提供。我们的评估展示了两种不同类型跨层链路场景下的结果。所有通过激光和地面中继的链路容量限制均为200 Mbps。
其他星座: 为了评估SaTE在不同网络规模下的性能,我们还模拟了其他星座:(1) 铱星(Iridium):一个由66颗卫星组成的星座,位于781公里高度的单一轨道层,使用Ka波段的星间链路。(2) 中等规模星座:通过仅保留“星链”的前两个轨道层,并将轨道平面数量分别减少8倍和2倍,我们创建了两个中等规模的星座,分别拥有396颗和1584颗卫星,作为铱星和全规模“星链”之间的中间规模。
卫星流量矩阵: 我们的模拟根据人口密度在全球范围内部署了300万用户和1000个网关,并使用平滑因子进行调整以覆盖偏远地区(详见附录G)。流量包括用户到用户的服务(如语音和视频),以及网关到用户的流量(用户通过网关访问互联网数据)。流量直接通过卫星链路路由,用户和网关直接与卫星相连。流量参数详见表2。
In each 1-second interval 6 , a Poisson process with traffic intensity 𝜆 (flows per second) generates new flows between randomly selected user pairs or between a gateway and a user. This process introduces stochastic fluctuations in traffic generated. The parameter 𝜆 controls traffic loads. Each flow’s traffic demand is associated with the satellites directly connected to the user or gateway and is represented as a demand between the corresponding satellite pair. The traffic matrix is generated by aggregating the total demands of both new and ongoing flows between each satellite pair. Uplink and downlink capacities are set to 50 Mbps per connection.
在每个1秒的时间间隔内,一个流量强度为 \(λ\)(流/秒)的泊松过程在随机选择的用户对之间或网关与用户之间生成新的流量。该过程引入了流量生成的随机波动。参数 \(λ\) 用于控制流量负载。每个流的流量需求与其直接连接的用户或网关的卫星相关联,并表示为相应卫星对之间的需求。通过聚合每个卫星对之间所有新流和持续流的总需求,生成流量矩阵。上行和下行链路容量设置为每个连接50 Mbps。
Path Calculation: 10 shortest paths are precomputed per satellite pair for routing. Instead of recalculating all paths every interval, only those impacted by topology changes—such as newly consistent ISLs or removed links—are updated. This incremental approach detailed in Appendix C updates fewer than 2% of paths per second, with an average computation time of 56 ms.
路径计算: 每个卫星对预先计算10条最短路径用于路由。我们并非在每个时间间隔都重新计算所有路径,而是仅更新那些受拓扑变化影响的路径——例如由新建立的稳定星间链路或被移除的链路所影响的路径。这种在附录C中详述的增量式方法每秒更新的路径不到2%,平均计算时间为56毫秒。
Training and Testing Datasets: We train separate models for different satellite constellations. For each constellation, we collect 10,000 time-varying topologies at an 1-second interval as satellites move and associate them with fluctuating network flows at various levels of traffic intensity. We maintain a training to testing data size ratio of 4:1. For all experiments, the testing dataset consists of completely unseen topologies and traffic matrices from the trained model. After applying traffic and path pruning and topology pruning (Sec. 3.4) on the initial training dataset, we train SaTE on a reduced set of representative topologies and their corresponding traffic matrices.
训练与测试数据集: 我们为不同的卫星星座训练了不同的模型。对每个星座,我们以1秒的间隔收集了10,000个随卫星移动而变化的时变拓扑,并将它们与不同流量强度水平下的波动网络流相关联。我们保持训练集与测试集数据大小为4:1的比例。在所有实验中,测试数据集包含模型完全未见过的拓扑和流量矩阵。在对初始训练数据集应用了流量与路径剪枝以及拓扑剪枝(第3.4节)后,我们在一个简化的代表性拓扑集及其对应的流量矩阵上训练SaTE。
Objectives and Baselines: Our default objective is to maximize total throughput, with Appendix H.2 focusing on minimizing maximum link utilization (MLU). We compare SaTE against six competing schemes: (1) Gurobi [24]: a commercial solver widely used in TE computation, (2) POP [55]: a resource allocation method that decomposes a network into fractions with 1/𝑘 capacities and execute sub-allocation with a part of flows in each fraction for faster TE. (3) ECMP with Water Filling [35]: a scheme equally allocates traffic to shortest paths with equal hops until the paths are saturated. (4) Satellite Routing [56]: a backpressure-based routing approach for satellite networks and other intermittently connected networks [51, 64]. We compare only performance for this scheme, not computational latency, since its computation is distributed across the routers rather than on a centralized controller. (5) Teal [78]: a GNN-based TE method known for its fast inference performance in WANs. (6) HARP [2]: a GNN-based TE method for changing topologies.
目标与基准: 我们的默认目标是最大化总吞吐量,附录H.2中则关注最小化最大链路利用率(MLU)
我们将SaTE与六个对比方案进行比较:
- Gurobi:一种广泛用于TE计算的商业求解器
- POP:一种资源分配方法,将网络分解为容量为 \(1/k\) 的多个部分,并在每个部分中对一部分流量执行子分配以加速TE
- ECMP with Water Filling:一种将流量均等地分配到跳数相等的各最短路径上,直到路径饱和的方案
- Satellite Routing:一种用于卫星网络和其他间歇性连接网络的基于背压的路由方法。对于此方案,我们仅比较性能,不比较计算延迟,因为其计算是分布在各个路由器上,而非在集中式控制器上
- Teal:一种基于GNN的TE方法,以其在广域网中的快速推理性能而闻名
- HARP:一种用于变化拓扑的基于GNN的TE方法
Performance Metrics: We focus on two critical metrics:
• Computational Latency: The time required to compute the optimal traffic allocation for a given network topology, paths, and traffic matrix.
• Satisfied Demand: The percentage of total demand met (throughput divided by total traffic demand). We measure satisfied demand in online settings, accounting for TE computation delays, where the current traffic allocation remains in effect until the new one is computed. This online metric highlights the impact of computational latency on real-world network performance.
性能指标: 我们关注两个关键指标:
- 计算延迟:为给定的网络拓扑、路径和流量矩阵计算最优流量分配所需的时间
- 需求满足率:满足的总需求百分比(吞吐量除以总流量需求)
- 我们在在线场景下测量需求满足率,该场景考虑了TE计算延迟的影响,即当前的流量分配将持续生效,直到新的分配方案计算完成
- 这个在线指标突显了计算延迟对真实世界网络性能的影响
Related Work¶
Traffic Engineering and Fast Computation: With the development of software-defined networking (SDN), network operators can steer traffic and orchestrate resources from a global network view and perform online traffic control through programmable switches based on P4 [8, 39, 62]. Centralized TE solutions that use heuristic algorithms [29, 38, 71, 73] outperform conventional approaches in performance metrics, including throughput [1, 55] and link utilization [35]. As network scales increase, commercial solvers [24] require minutes, even hours to compute TE for thousands of nodes. Some TE approaches [1, 22, 55] tackles this by decomposing TE into sub-problems with reduced complexity [60]. However, these methods fail to achieve low computational latency for mega-constellations due to their polynomial increase in complexity as the network scales. While recent learningbased approaches [2, 23, 61, 78] accelerate TE for WANs via GPU parallelization, some require costly retraining when network topologies change, while others are not inherently designed for throughput maximization [2]—our primary TE objective. In contrast, SaTE rapidly computes throughput maximization for topology-changing mega-constellations.
流量工程与快速计算: 随着软件定义网络(SDN)的发展,网络运营商可以从全局网络视角引导流量和调配资源,并通过基于P4的可编程交换机执行在线流量控制。使用启发式算法的集中式TE解决方案在吞吐量和链路利用率等性能指标上优于传统方法。
随着网络规模的扩大,商业求解器为数千个节点计算TE需要数分钟甚至数小时。一些TE方法通过将TE分解为复杂度更低的子问题来解决这一挑战。然而,由于这些方法的复杂度随网络规模呈多项式增长,它们无法为巨型星座实现低计算延迟。
尽管近期的基于学习的方法通过GPU并行化加速了广域网的TE,但其中一些方法在网络拓扑变化时需要高昂的重新训练成本,而另一些方法(如HARP)并非为吞吐量最大化(我们的主要TE目标)而设计。相比之下,SaTE能够为拓扑变化的巨型星座快速计算吞吐量最大化方案。
Routing and Traffic Engineering for Satellite Networks: Recent routing protocols for satellite networks are more efficient than Dijkstra, calculating feasible routes by leveraging the grid structure of satellite networks and ISL distances [15, 45]. For constellations without direct ISLs, routing using ground relays has also been explored [27, 28]. Layered on top of routing, existing TE solutions for satellite networks manage the traffic flows on feasible routes by applying QoS constraints [4, 14, 17, 46] or congestion avoidance [16, 51, 56, 64]. However, these approaches lack centralized control and do not explicitly optimize for desirable network characteristics. [33] adopts the same graph design as [78], facing similar scalability and frequent retraining issues. In contrast, SaTE introduces a TE algorithm designed for low-latency computation and generalization to large-scale, dynamic satellite constellations.
卫星网络的路由与流量工程: 近期的卫星网络路由协议比Dijkstra算法更高效,它们利用卫星网络的网格结构和星间链路(ISL)距离来计算可行路由。对于没有直接星间链路的星座,使用地面中继的路由也得到了探索。在路由之上,现有的卫星网络TE解决方案通过应用QoS约束或拥塞避免来管理可行路由上的流量。然而,这些方法缺乏集中式控制,并且没有为期望的网络特性进行显式优化。文献[33]采用了与[78]相同的图设计,面临着类似的可扩展性和频繁重新训练的问题。相比之下,SaTE引入了一种专为低延迟计算和泛化到大规模动态卫星星座而设计的TE算法。
Discussion and Limitations¶
Fine-tuning Considerations: As shown in Sec. 5.5, SaTE’s model trained on a specific scale has the potential to generalize to other scales, with slight performance degradation. To further improve performance in such cases, appropriate fine-tuning using techniques like curriculum learning [25]which incrementally trains the model with data from other scales—may be considered. This is particularly relevant for companies gradually expanding their satellite networks.
微调考量: 如第5.5节所示,SaTE在一个特定规模上训练的模型有潜力泛化到其他规模,尽管性能会略有下降。在这种情况下,为了进一步提升性能,可以考虑使用课程学习(curriculum learning)等技术进行适当的微调——即用来自其他规模的数据增量式地训练模型。这对于那些逐步扩展其卫星网络的公司尤为重要。
Starlink’s Workflow Interval: Recent studies reveal that Starlink operates on a globally synchronized 15-second workflow interval for periodic network reconfigurations [52] However, this interval corresponds to the frequency of controlplane operations—including routing updates and policy reconfigurations—which is longer than the physical stability duration of the link topology, as observed in Fig. 4 (a). If the 15-second interval is applied to the ISL-enabled scenario considered in this paper, approximately 10–20% of ISLs become unstable or are excluded, as illustrated in Fig. 4 (b). By significantly reducing the computation bottleneck with an average runtime of 17 ms, SaTE offers the potential to shorten this interval and improve network performance.
“星链”的工作流间隔: 近期研究表明,“星链”在一个全球同步的15秒工作流间隔内运行,以进行周期性的网络重构。然而,这个间隔对应的是控制平面操作(包括路由更新和策略重构)的频率,其时长要大于链路拓扑的物理稳定持续时间,正如在图4(a)中观察到的。如果将这15秒的间隔应用于本文所考虑的支持星间链路的场景,如图4(b)所示,大约10-20%的星间链路会变得不稳定或被排除。通过将计算瓶颈显著降低到平均17毫秒的运行时间,SaTE为缩短这一间隔、提升网络性能提供了可能。
Centralized Control in Satellite Networks: SaTE’s TE computation operates within the centralized SDN for satellite networks [6, 19, 36], where satellites are managed by a centralized control center [5, 43, 83]. While the bandwidth required for control messages in such frameworks poses challenges, especially for constellations with thousands of satellites, Starlink’s planned laser links (up to 200 Gbps) provide sufficient capacity for traffic rule distribution. Additionally, strategies like improved controller placement [13], multi-controller load balancing [31], and edge-assisted traffic control [10, 77] can further mitigate bandwidth overhead.
卫星网络中的集中式控制: SaTE的TE计算在用于卫星网络的集中式SDN框架内运行,其中卫星由一个集中式控制中心管理。尽管在此类框架中,控制消息所需的带宽带来了挑战,特别是对于拥有数千颗卫星的星座,但“星链”规划的激光链路(高达200 Gbps)为流量规则的分发提供了充足的容量。此外,诸如优化的控制器部署、多控制器负载均衡以及边缘辅助的流量控制等策略可以进一步减轻带宽开销。
Conclusion¶
This paper presents SaTE, a low-latency TE algorithm designed for dynamic, large-scale satellite constellations. We present an analysis of how frequently topology changes in a large and dense satellite constellation with over thousands of nodes. We present a graph design that comprehensively models the satellite TE problem, and is learned solely using GNN layers, enabling fast TE computation without relying on additional DNN layers that may restrict its generalizability. SaTE then develops a dataset pruning method to make model training feasible on commercial GPUs. We present extensive data-driven simulation results of SaTE on Starlink. This work does not raise any ethical issues.
本文提出了SaTE,一种专为动态、大规模卫星星座设计的低延迟TE算法。我们分析了在一个拥有数千个节点的密集大型卫星星座中拓扑变化的频率。我们提出了一种图设计,该设计全面地对卫星TE问题进行建模,并仅使用GNN层进行学习,从而实现了快速的TE计算,而无需依赖可能限制其泛化能力的额外DNN层。随后,SaTE开发了一种数据集剪枝方法,使得在商用GPU上进行模型训练变得可行。我们展示了SaTE在“星链”上广泛的、数据驱动的仿真结果。本项工作不涉及任何伦理问题。