Transmitting, Fast and Slow: Scheduling Satellite Traffic through Space and Time¶

Earth observation Low Earth Orbit (LEO) satellites collect enormous amounts of data that needs to be transferred first to ground stations and then to the cloud, for storage and processing. Satellites today transmit data greedily to ground stations, with full utilization of bandwidth during each contact period. We show that due to the layout of ground stations and orbital characteristics, this approach overloads some ground stations and underloads others, leading to lost throughput and large end-to-end latency for images. We present a new end-to-end scheduler system called Umbra, which plans transfers from large satellite constellations through ground stations to the cloud, by accounting for both spatial and temporal factors, i.e., orbital dynamics, bandwidth constraints, and queue sizes. At the heart of Umbra is a new class of scheduling algorithms called withhold scheduling, wherein the sender (i.e., satellite) selectively under-utilizes some links to ground stations. We show that Umbra’s counter-intuitive approach increases throughput by 13-31% & reduces P90 latency by 3-6 ×.

地球观测领域的低地球轨道（LEO）卫星采集了海量数据，这些数据需先传输至地面站，而后再传送至云端进行存储与处理。当前，卫星在每个与地面站的接触时段内，均采用贪婪策略（greedily）以最大化利用带宽的方式传输数据。我们发现，受地面站布局和卫星轨道特性的影响，该策略会 导致部分地面站网络过载，而另一些则负载不足 ，最终造成吞吐量损失以及图像端到端延迟的显著增加。

我们为此提出了一套名为 Umbra 的新型端到端调度系统。该系统通过综合考量空间与时间两大维度因素，即轨道动力学、带宽约束及队列大小，对从大型卫星星座到地面站、再到云端的数据传输进行统一规划。

Umbra 系统的核心是一类我们称之为“预扣调度”（withhold scheduling）的新型调度算法。其核心思想是让发送方（即卫星）有选择性地不充分利用 与某些地面站的通信链路。研究表明，Umbra 这一看似有悖常理的方法，可将系统总吞吐量提升 13-31%，并将第90百分位（P90）的延迟降低 3-6 倍

Introduction¶

The current generation of Low Earth Orbit (LEO) satellite constellations is a unique new class of mobile networking systems characterized by scale and spatio-temporal dynamics. Advancements in technology over the last decade have led to a five-fold increase in the number of LEO satellites in orbit [42]. Many companies [1, 14] launched constellations containing hundreds of satellites to perform frequent high-resolution monitoring of Earth. These satellites rotate around the Earth in low orbits (< 1000 km above Earth) and track planet-scale events. Just during 2021-22, LEO constellations were used to monitor the war in Ukraine [48], the Tonga volcanic eruption [18], and California forest fires [8].

Each satellite generates approximately one Terabyte of data per day. A satellite communicates this massive data to the cloud (for analysis and distribution of Earth imagery) via multiple (fixed) ground stations located thousands of Kilometers away on Earth. This data transfer problem is challenging due to the scale of the imagery and the temporal and spatial challenges, which arise from natural orbital dynamics of LEO satellites (see Fig. 1), and uneven layout of ground stations. The contact between a satellite and a ground station is short-lived: four to six ten-minute windows per day per satellite-ground station pair. In order to reduce interference from ambient signals and blockages and to increase the number of satellite-ground station contacts , ground stations are typically located in remote regions, e.g., closer to the poles, and away from large populations. This limits the backhaul bandwidth from the ground station to the cloud. Due to these factors getting data from satellites to the cloud suffers from day-level delays.

This paper makes the following contributions:

• We discover a new phenomenon that we call Uneven Queuing Effect or UQE (pronounced “You-k”), wherein the prevalent strategy of greedy full-utilization data transfer from satellites to ground stations, is creating load imbalance across ground stations and leading to sub-optimal end-to-end throughput and latency, all due to temporal and spatial reasons.

• We propose withhold scheduling, a new class of satelliteground station transfer algorithms, for LEO constellations.

• We build the Umbra data transfer system, where we design, implement, and evaluate a new withhold scheduling algorithm based on time-expanded networks.

• We perform a large scale trace-driven simulation using data collected from a real 153-satellite constellation.

当前一代的低地球轨道（LEO）卫星星座，因其庞大的规模和复杂的时空动态特性，构成了一类独特的新型移动网络系统。过去十年间的技术进步使得在轨 LEO 卫星的数量增加了五倍 [42]。许多公司 [1, 14] 已经发射了包含数百颗卫星的星座，以对地球进行频繁的高分辨率监测。这些卫星在低轨道（距离地球表面 < 1000公里）上环绕地球运行，追踪全球范围的事件。仅在 2021-22 年间，LEO 星座就被用于监测乌克兰战争 [48]、汤加火山喷发 [18] 以及加利福尼亚的森林火灾 [8]。

alt text

每颗卫星每天大约产生 1TB 的数据。卫星通过地球上相距数千公里的多个（固定）地面站，将这些海量数据传输到云端（用于地球影像的分析与分发）。由于影像数据规模巨大，加上 LEO 卫星固有的轨道动力学（见图1）和地面站布局不均所带来的时空挑战，这一数据传输问题极具挑战性。一颗卫星与一个地面站之间的接触是短暂的：每个“卫星-地面站”对每天仅有四到六个十分钟左右的通信窗口。为了减少环境信号干扰和信号遮挡，并增加卫星与地面站的接触次数，地面站通常位于偏远地区，例如靠近极点且远离人口密集区。这限制了从地面站到云端的回程带宽。由于这些因素，将数据从卫星传输到云端会产生天级别的延迟

本文做出以下贡献：

我们发现了一种我们称之为 不均衡排队效应 或 UQE (发音为“You-k”) 的新现象。在这种现象中，当前普遍采用的、从卫星到地面站的 贪婪式全利用率数据传输策略，由于时空因素，正在导致地面站间的负载失衡，并进一步造成次优的端到端吞吐量和延迟
我们为 LEO 星座提出了一种名为 预扣调度（withhold scheduling） 的新型“卫星-地面站”数据传输算法
我们构建了 Umbra 数据传输系统，在其中设计、实现并评估了一种基于 时延网络（time-expanded networks） 的新型预扣调度算法
我们使用从一个包含153颗卫星的真实星座收集的数据，进行了一次大规模的轨迹驱动仿真

1.1 Uneven Queuing Effect (UQE)¶

Due to orbital dynamics, a satellite moves past a ground station receiver in less than ten minutes. Therefore, the conventional wisdom in satellite networks has been to transmit data greedily (or “fast”), i.e., send as much data as possible to a ground station during its contact, using the full available bandwidth. Naturally, a bulk of past work focuses on improving the radio design at the satellite and the ground stations so that they can maximize the amount of data transfer during the short contacts [12, 13, 41]. This line of work has made great progress, and today, even small cubesats in low earth orbits can achieve Gbps links to Earth [13].

由于轨道动力学，一颗卫星飞越一个地面站接收器的过程不足十分钟。因此，卫星网络领域的传统共识一直是进行贪婪式（或“快速”）传输，即在接触期间利用全部可用带宽，向地面站发送尽可能多的数据。自然地，过去的大量工作都集中于改进卫星和地面站的无线电设计，以便在短暂的接触时间内最大化数据传输量 [12, 13, 41]。这一系列工作已取得巨大进展，如今，即使是低地球轨道上的小型立方体卫星也能实现与地面 Gbps 级别的链路速率 [13]。

With these advances in satellite-ground links, the status quo “fast” transmission style for data from satellites to ground stations leads to long outgoing queues (to the cloud) at some ground stations, and relatively shorter queues and thus idling, at other ground stations. This arises from two reasons. First, ground stations have an uneven (heterogeneous) spatial distribution, due to logistical reasons involving spectrum licensing, country-wise regulations, proximity to poles, etc. This means that the amount of new data that a satellite has in between its consecutive ground station contacts may vary widely, and lead to unbalanced queues at ground stations.

We term this new phenomenon we discovered as the Uneven Queuing Effect or UQE 1 . Fig. 2a shows an example of UQE. The shown satellite passes over consecutive ground stations A, B, and C. However the A-B distance is longer than the B-C distance. This means the satellite collects far more data during its A-B segment than its B-C segment. Greedy transfer means B would receive 9 GB from the satellite, while C would receive only 1 GB. Thus, UQE leads to unbalanced queues at B vs. C.

随着星地链路技术的进步，从卫星到地面站的“快速”传输现状，导致了部分地面站出现了发往云端的长出口队列，而其他地面站则队列较短甚至处于空闲状态。

这源于两个原因。首先，由于频谱许可、国家法规、地理位置（如靠近极点）等后勤原因，地面站的空间分布是不均衡的（异构的）。这意味着一颗卫星在两次连续的地面站接触之间新采集的数据量可能差异巨大，从而导致地面站的队列不均衡。

alt text

我们将这一新发现的现象命名为 不均衡排队效应 （Uneven Queuing Effect, UQE）。图 2a 展示了一个 UQE 的例子。图中的卫星依次飞越地面站 A、B 和 C。然而，A-B 的轨道距离长于 B-C 的距离。这意味着卫星在 A-B 段采集的数据远多于 B-C 段。贪婪式传输将导致地面站 B 从卫星接收 9 GB 数据，而 C 仅接收 1 GB。因此，UQE 导致了 B 和 C 的队列长度不均衡。

The UQE problem is further exacerbated because different ground stations can have different backhaul bandwidths to the cloud, from 100s of Mbps to a few Gbps. This means that outgoing queue lengths at ground stations can wildly vary across time and space. Therefore, images stuck at high queue, low bandwidth stations experience large delays. This situation is worsening as more compute resources are being added to ground stations for “edge”-style processing, which further exaggerates the problem of load imbalance due to both network delays and computational delays, both of which could be imbalanced. Therefore, even if backhaul bandwidths increase in the future, UQE will continue to back up queues.

UQE 问题 被进一步加剧，因为不同地面站到云端的回程带宽可能不同 ，范围从数百 Mbps 到数 Gbps 不等。这意味着地面站的出口队列长度在时间和空间上可能剧烈变化。因此，被困在高队列、低带宽地面站的图像会经历巨大的延迟。随着越来越多的计算资源被部署到地面站用于“边缘式”处理，这种情况正在恶化，因为网络延迟和计算延迟都可能存在不均衡，从而进一步加剧了负载失衡问题。因此，即使未来回程带宽增加，UQE 仍将继续导致队列积压。

Fig. 3 shows UQE causes idling at some ground stations and uneven egress throughput at the cloud in the greedy approach (“Baseline”), while our system (“Umbra”, described soon) offers stable throughput. Section 3.3 formally proves that UQE causes quadratic growth in Greedy’s queues.

alt text

图 3 显示，在贪婪式方法（“Baseline”）中，UQE 导致部分地面站空闲，且云端的出口吞吐量不稳定，而我们的系统（“Umbra”，后文详述）则能提供稳定的吞吐量。第 3.3 节从理论上证明了 UQE 会导致贪婪式算法的队列长度呈二次方增长。

1.2 Withhold Scheduling¶

To counter UQE, we define a new scheduling paradigm for satellite data transfers called withhold scheduling. The key idea in withhold scheduling is to allow a satellite to selectively under-utilize a subset of its ground station contacts and intelligently withhold data for subsequent links if it identifies an opportunity for a better end-to-end latency in the future. Withhold scheduling aims to equalize queue sizes across ground stations and leads to higher throughput and lower latency for the transfer of satellite data to the cloud. Returning to Fig. 2a, if the satellite were to intelligently withhold 4 GB of data from B, this would equalize data transferred to B and C. If on the other hand, C had a 1.5 × higher backhaul bandwidth (i.e., to the cloud) than B, transferring 4 GB to B and 6 GB to C would be preferable.

为了应对 UQE，我们为卫星数据传输定义了一种新的调度范式，称为预扣调度（withhold scheduling）。

其核心思想是，允许卫星有选择性地不充分利用其与部分地面站的接触机会，并在识别到未来存在更好端到端延迟的机会时，智能地预扣数据以用于后续的链路。

预扣调度旨在均衡各地面站的队列大小 ，从而为卫星数据传输至云端带来更高的吞-吐量和更低的延迟。

回到图 2a 的例子，如果卫星智能地从地面站 B 处预扣 4 GB 数据，这将使传输到 B 和 C 的数据量变得均衡
另一方面，如果 C 的回程带宽（即到云端的带宽）比 B 高 1.5 倍，那么向 B 传输 4 GB、向 C 传输 6 GB 将是更优的选择

Withhold scheduling needs to tell each satellite: When to withhold, and How much to withhold. This is complex because any decision to withhold data needs to account for both spatial factors and temporal factors. Spatial factors include the relative positions of satellites and ground stations.

预扣调度需要告诉每颗卫星：何时预扣，以及预扣多少。这是一个复杂的问题，因为任何预扣决策都需要同时考虑空间因素和时间因素。空间因素包括卫星和地面站的相对位置。

Temporal factors for withhold scheduling include: (a) the evolution of this link quality and visibility over time due to the orbital motion of the satellite, and (b) the queue size variation at ground stations. Furthermore, any decision made by a satellite (say, X) to withhold data in a time slot has multiple downstream effects: (i) a different satellite (say, Y) may choose to use this slot to transfer data to the same ground station, (ii) satellite X now needs a slot in the future at a different ground station (and with increased urgency).

预扣调度所需考虑的时间因素包括：(a) 由于卫星的轨道运动，链路质量和可见性随时间的演变；(b) 地面站队列大小的变化。此外，任何由一颗卫星（如 X）在某个时间槽内做出的预扣决策都会产生多种下游效应：(i) 另一颗卫星（如 Y）可能会选择利用这个时间槽向同一个地面站传输数据；(ii) 卫星 X 现在需要在未来的某个时间于另一个地面站获得一个传输时槽（并且紧迫性增加）。

Withhold Scheduling via Time Expanded Networks: We formulate the spatial and temporal factors uniquely using a time expanded network (or TEN). This allows us to capture spatial factors, like connections between satellites, ground stations and cloud, via a graph representation at each instance of time. Further, we also add holdover edges from a vertex to itself in the future, signifying the possibility of the vertex (satellite) withholding outgoing data in spite of available bandwidth. Any withhold scheduling algorithm or heuristic can be captured via this TEN.

Given this TEN network, we design a polynomial-time algorithm that combines bipartite matching, max flow, and binary search. We also compare our approach against simpler heuristics. Time expanded networks have been used to plan flows in the internet [16, 17], and in sneakernets [7]. Our work is the first to adapt them for satellite data transfers.

通过时延网络实现预扣调度: 我们利用 时延网络 (Time Expanded Network, TEN) 对时空因素进行独特的建模。这使我们能够通过在每个时间实例上的图表示来捕捉空间因素，例如卫星、地面站和云之间的连接。

此外，我们还添加了从一个顶点指向其未来自身的 “滞留”边（holdover edges） ，表示该顶点（卫星）尽管有可用带宽但仍可能预扣待发数据。任何预扣调度算法或启发式方法都可以通过这个 TEN 来表达

基于此 TEN 网络，我们设计了一个结合了二分图匹配、最大流和二分搜索的多项式时间算法。我们还将我们的方法与更简单的启发式算法进行了比较。时延网络曾被用于规划互联网 [16, 17] 和潜行网（sneakernets）[7] 中的流量。我们的工作是首次将其应用于卫星数据传输

TEN是一个非常古老的方式

Time-Expanded Network (TEN), 时延扩展网络

它不是一种物理网络，而是一种数学建模技巧。核心思想是：将一个随时间动态变化的网络问题，转化为一个规模更大但静态的图问题来求解

将整个时间段（比如一天）切分成离散的时间片（比如每10分钟一片）。然后，为每一个时间片都创建一套完整的节点副本
- 一个地面站 A 就变成了 \(A_{t0}\), \(A_{t1}\), \(A_{t2}\), ... \(A_{tn}\)，其中 \(A_{t1}\) 代表“在时间点 t1 时的地面站 A”
创建新的边：在这些跨时间的节点副本之间建立新的有向边，这些边代表了“动作”或“状态转移”。主要有两种边
- 传输边 (Transfer Edges):
  - 如果卫星 S 可以在时间 t1 到 t2 之间向地面站 A 传输数据
  - 就在图上画一条从 \(S_{t1}\) 指向 \(A_{t2}\) 的边, 边权就是 \(t_1\) ~ \(t_2\) 可传输的数据量
- 滞留边 (Holdover Edges):
  - 在同一个节点的连续时间副本之间画一条边
  - 如从 \(S_{t1}\) 指向 \(S_{t2}\): 这条边代表“等待”或“持有”的动作

We build Umbra, a new system for scheduling data transfers from satellite constellations to the cloud via ground stations. Umbra accounts for the dynamics of satellite motion, back-end bandwidth constraints of ground stations, and queue sizes at ground stations. We implement the Umbra scheduler, and our trace-driven evaluation uses 6 million images captured by the Planet Dove constellation [37] comprising 153 satellites’ trajectories and collected across 15 days. This data is the real set of images collected by the constellation. We simulate the orbital dynamics of the satellites and a ground station layout by using Planet’s published and frequently updated orbital information [25, 29]. Our paper is, to the best of our knowledge, the largest evaluation performed using data collected by a real operational satellite constellation. Our evaluation shows that the Umbra scheduler can improve the satellite constellation’s throughput of by 13-31% and 90th percentile latency by 3-6× compared to a greedy baseline and a heuristic-based scheduler.

我们构建了 Umbra，一个用于调度从卫星星座经由地面站到云端的数据传输的新系统。Umbra 综合考虑了卫星运动的动态性、地面站的后端带宽限制以及地面站的队列大小。我们实现了 Umbra 调度器，并通过轨迹驱动的评估方法，使用了由包含 153 颗卫星的 Planet Dove 星座 [37] 在 15 天内采集的 600 万张图像。这些数据是该星座采集的真实图像集合。我们利用 Planet 公司发布并频繁更新的轨道信息 [25, 29] 来模拟卫星的轨道动力学和地面站布局。据我们所知，本文是使用真实在轨运行的卫星星座所采集数据进行的规模最大的评估。我们的评估表明，与贪婪式基线和基于启发式的调度器相比，Umbra 调度器能将卫星星座的吞吐量提升 13-31%，并将第 90 百分位延迟降低 3-6 倍