Towards Global Outage Detection for LEO Networks¶
Low Earth Orbit (LEO) satellite networks are increasingly deployed, yet users continue to experience frequent, short-lived outages. We present Roman-HitchHiking, a system for measuring LEO satellite outages globally and in near-real time. Roman-HitchHiking significantly reduces the measurement overhead by leveraging path redundancy to eliminate duplicate probes to shared pre-satellite routers, thereby reducing overall network traffic and increasing coverage. With Roman-HitchHiking, we identify large clusters of simultaneous outages across geographically diverse regions, pointing to potential centralized failures that traditional outage detection systems overlook. Roman-HitchHiking is open-sourced to enable reproducibility and foster further research on LEO outages.
低地球轨道 (LEO) 卫星网络正被日益广泛地部署,但用户仍然会经历频繁且短暂的网络中断。为此,我们提出了 Roman-HitchHiking,一个能够 近乎实时地在全球范围内测量 LEO 卫星中断的系统。 Roman-HitchHiking 通过利用路径冗余来消除对共享的星前路由器的重复探测,从而显著降低了测量开销,减少了整体网络流量并扩大了覆盖范围。借助 Roman-HitchHiking,我们在地理上分散的区域中识别出大规模的并发中断集群,这指向了传统中断检测系统可能忽略的潜在集中式故障。Roman-HitchHiking 已经开源,以确保可复现性并促进 LEO 中断领域的进一步研究。
Introduction¶
Despite the growing deployment of Low Earth Orbit (LEO) satellite networks, users continue to experience frequent outages that differ in nature from those observed in terrestrial networks [19]. Disruptions in LEO connectivity stem from satellite mobility, the dish’s susceptibility to obstructions, and sporadic geomagnetic storms [11, 22]—all of which can trigger outages at both the individual and regional levels [16, 19]. Understanding whether particular regions and customers are predisposed to outages would allow researchers to determine user quality of experience, identify disparities in service reliability, and prioritize improvements for the most affected populations. However, to date, there has been no systematic global way to study outages in LEO satellite networks because such analysis requires high-resolution data captured in near-real time across a wide geographic scale.
Conventional outage detection systems designed for terrestrial networks are inadequate for analyzing outages affecting LEO satellite networks [12, 23]. These systems focus on identifying largescale, long-duration outages, whereas LEO satellite disruptions often occur at a much finer timescale—on the order of seconds. While Izhikevich et al. [13] introduced HitchHiking, a methodology for measuring global Starlink customer latency, we demonstrate this approach does not scale to the demands of collecting global, simultaneous outage data. Rather, in practice, naively running HitchHiking leads to massive packet loss.
In this work, we introduce a methodology to measure LEO satellite outages at a global scale: Roman-HitchHiking. RomanHitchHiking solves HitchHiking’s packet loss problem by capitalizing on a key insight: widespread path redundancy in the network allows us to reduce the number of paths we need to probe. RomanHitchHiking is inspired by the idiom “all roads lead to Rome”: at any given moment, if a destination just before the satellite hop is reachable (or unreachable) via one network path, it is highly likely to be reachable (or unreachable) via many others.
Roman-HitchHiking reduces measurement-induced packet loss from 73.4%—as seen in prior methods—to under 0.01%, enabling over four orders of magnitude greater coverage for simultaneously measuring customer outages. We apply Roman-HitchHiking to collect measurements from Starlink [26], the largest LEO satellite provider, over a three day period. Our analysis reveals large clusters of simultaneous outages account for most disruptions, with Australia consistently among the most impacted areas. We open-source Roman-HitchHiking at https://github.com/UCLA-SCaN/roman_ hitchhiking and hope it serves as a foundation for future research on LEO network outages.
尽管低地球轨道 (LEO) 卫星网络部署日益增多,用户仍然经历着频繁的网络中断,这些中断在性质上与地面网络有所不同 [19]。LEO 连接的中断源于卫星的移动性、用户终端(碟形天线)易受障碍物影响以及偶发的地磁风暴 [11, 22]——所有这些因素都可能在个体和区域层面引发中断 [16, 19]。了解特定地区和客户是否更容易出现中断,将有助于研究人员评估用户体验质量、识别服务可靠性的差异,并优先为受影响最严重的人群进行改进。然而,迄今为止,还没有一种系统性的全球方法来研究 LEO 卫星网络的中断,因为此类分析需要在大地理范围内以近乎实时的方式捕获高分辨率数据。
为地面网络设计的传统中断检测系统不足以分析影响 LEO 卫星网络的中断 [12, 23]。这些系统侧重于识别大规模、持续时间长的中断,而 LEO 卫星的中断通常发生在更精细的时间尺度上——通常是秒级。 虽然 Izhikevich 等人 [13] 提出了 HitchHiking,一种测量全球 Starlink 客户延迟的方法,但我们证明该方法无法扩展以满足收集全球并发中断数据的需求。实际上,朴素地运行 HitchHiking 会导致大量的丢包。
在本文中,我们介绍了一种在全球范围内测量 LEO 卫星中断的方法:Roman-HitchHiking。
Roman-HitchHiking 解决了 HitchHiking 的丢包问题,其 核心思想在于一个关键洞察:网络中广泛存在的路径冗余使我们能够减少需要探测的路径数量。
Roman-HitchHiking 的灵感来源于谚语“条条大路通罗马”:在任何给定时刻,如果一个位于卫星跳之前(pre-satellite hop)的目的地可以通过一条网络路径到达(或不可达),那么它极有可能通过许多其他路径也能到达(或不可达)。
Roman-HitchHiking 将测量引发的丢包率从先前方法的 73.4% 降低到 0.01% 以下,使得并发测量客户中断的覆盖范围扩大了四个数量级以上。我们应用 Roman-HitchHiking 对最大的 LEO 卫星提供商 Starlink [26] 进行了为期三天的测量。我们的分析揭示,大规模的并发中断集群是大多数中断的主要原因,其中澳大利亚一直是受影响最严重的地区之一。
我们在 https://github.com/UCLA-SCaN/roman_hitchhiking 开源了 Roman-HitchHiking,希望它能为未来 LEO 网络中断的研究奠定基础。
Background¶
To enable global, scalable outage detection in LEO satellite networks, a low-barrier, widely deployable technique is essential. We begin by reviewing an existing methodology, HitchHiking, and exploring how it might be naively adapted for outage detection. In Section 3, we experimentally demonstrate the scalability limitations of the HitchHiking approach, which motivates the development of RomanHitchHiking.
Defining Outages. Starlink’s Service Level Agreement defines an “outage” as “...a period where the Starlink is unable to send/receive pings to/from servers at a Starlink Point of Presence” [27]. In this paper, we employ a narrower definition and focus on outages that occur within the satellite link path rather than any outage that occurs within the Starlink network path. In particular, we define an outage as a period of time in which the Starlink client is unresponsive to the Starlink pre-satellite hop router.
HitchHiking Overview. LEO HitchHiking, introduced by Izhikevich et al. [13], infers network characteristics by probing publicly accessible satellite-routed devices. As illustrated in Figure 1, HitchHiking first identifies the pre-satellite hop using ICMP Paris traceroute. It maps the path from the last visible pre-satellite router to the exposed Starlink client IP (e.g., Client Dish 1, 2, 3). A TTL-limited ping is then used to isolate latency at specific hops: one ping targets the pre-satellite hop, and another the client dish. The difference in latency estimates the satellite link delay. We adopt this technique to identify pre-satellite hops and isolate the satellite segment in our methodology.
HitchHiking Outages. HitchHiking can be naively applied to identify outages. For instance, in Figure 1, Client Dish 1 experiences an outage if the probe Ping1 𝑝 is responsive but Ping1 𝑒 is not responsive. Note that since we are only conducting one-way measurements from the “outside-in,” an unresponsive client can indicate that they are unable to send or receive pings from our server, but we cannot distinguish between the two. Even if the endpoint receives the ping, the absence of a response is still an outage by our definition. While filtering policies could in theory explain missing responses, we observe that clients respond to probes both before and after the outage presented in Section 4. This transient behavior makes persistent filtering rules an unlikely explanation.
为了实现对 LEO 卫星网络的全球性、可扩展的中断检测,一种低门槛、可广泛部署的技术至关重要。我们首先回顾一种现有方法 HitchHiking,并探讨如何将其朴素地应用于中断检测。在第 3 节中,我们将通过实验证明 HitchHiking 方法的可扩展性限制,这正是我们开发 Roman-HitchHiking 的动机。
中断的定义 Starlink 的服务水平协议将“中断”定义为“......Starlink 无法向/从 Starlink 接入点 (Point of Presence) 的服务器发送/接收 ping 包的一段时间”[27]。
在本文中,我们采用一个更窄的定义,专注于发生在卫星链路路径内的中断,而非 Starlink 网络路径内的任何中断
具体来说,我们将中断定义为 Starlink 客户端对 Starlink 星前跳路由器无响应的一段时间
HitchHiking 概述 由 Izhikevich 等人 [13] 提出的 LEO HitchHiking 通过探测可公开访问的、经由卫星路由的设备来推断网络特性。
如图 1 所示,HitchHiking 首先使用 ICMP Paris traceroute 识别星前跳。它绘制出从最后一个可见的星前路由器到暴露的 Starlink 客户端 IP(例如,客户端碟形天线 1, 2, 3)的路径。
然后使用 TTL 受限的 ping 来隔离特定跳的延迟:一个 ping 目标是星前跳,另一个是客户端碟形天线。两者延迟的差异估算出卫星链路的延迟。在我们的方法中,我们采用这种技术来识别星前跳并隔离卫星段。
HitchHiking 用于中断检测 HitchHiking 可以被朴素地应用于识别中断。例如,在图 1 中,如果探测包 Ping1 p
有响应而 Ping1 e
没有响应,则客户端碟形天线 1 经历了中断。需要注意的是,由于我们只进行“由外向内”的单向测量,一个无响应的客户端可能表示他们无法从我们的服务器发送或接收 ping,但我们无法区分这两种情况。即使终端接收到了 ping 包,但没有响应,根据我们的定义,这仍然是一次中断。虽然理论上过滤策略可以解释响应的缺失,但我们观察到,在第 4 节展示的中断事件前后,客户端均对探测有响应。这种瞬时行为使得持久的过滤规则成为一个不太可能的解释。
Related Work¶
While prior work has laid the groundwork for network measurement and macroscopic outage detection, none address the need for a scalable, fine-grained system capable of capturing secondscale LEO satellite outages across a global footprint: the gap our methodology aims to fill.
LEO satellite measurement systems have primarily relied on community-driven platforms (e.g., RIPE Atlas [1]), research testbeds and simulations (e.g., LEOScope [17], Hypatia [24], and StarryNet [15]), and industry tools (e.g., Ookla SpeedTest [21], M-Lab [10]). While valuable for characterizing performance metrics like latency and throughput, these systems were not built to detect outages and are limited by sparse vantage points and coarse temporal resolution.
In contrast, outage detection systems developed by both academia and industry have focused on long-duration disruptions. For example, IODA [12] uses BGP updates, darknet traffic, and active probing to detect macroscopic Internet outages, probing /24 blocks every 10 minutes. Similarly, Hubble [14] targets persistent Internet reachability issues lasting over 15 minutes. USC’s ANT Lab [28] has built large-scale infrastructure (e.g., Trinocular [23]) to monitor IPv4 space through extensive active probing at 11-minute granularity.
Industry platforms like Cloudflare Radar [7] and Cisco ThousandEyes [6] also monitor broad outages, but with limited temporal granularity (typically 10–15 minutes) and often proprietary data. Downdetector [3] aggregates user-reported issues but provides only coarse, daily time-series data, limiting its utility for fine-grained analysis. Moreover, few of these tools support public access or offer visibility into customer-level outages.
尽管先前的工作为网络测量和宏观中断检测奠定了基础,但没有一个能够满足构建一个可扩展、精细化的系统,以在全球范围内捕获秒级 LEO 卫星中断的需求:这正是我们方法旨在填补的空白。
LEO 卫星测量系统主要依赖于社区驱动的平台(如 RIPE Atlas [1])、研究试验平台和仿真器(如 LEOScope [17]、Hypatia [24] 和 StarryNet [15])以及行业工具(如 Ookla SpeedTest [21]、M-Lab [10])。虽然这些系统对于表征延迟和吞吐量等性能指标很有价值,但它们并非为检测中断而构建,并且受限于稀疏的观测点和粗糙的时间分辨率。
相比之下,学术界和工业界开发的中断检测系统主要关注长时间的中断。例如,IODA [12] 使用 BGP 更新、暗网流量和主动探测来检测宏观互联网中断,每 10 分钟探测一次 /24
网段。同样,Hubble [14] 针对持续超过 15 分钟的持续性互联网可达性问题。南加州大学的 ANT 实验室 [28] 构建了大规模基础设施(如 Trinocular [23]),以 11 分钟的粒度通过广泛的主动探测来监控 IPv4 空间。
像 Cloudflare Radar [7] 和 Cisco ThousandEyes [6] 这样的行业平台也监控大范围的中断,但时间粒度有限(通常为 10-15 分钟),并且数据通常是专有的。Downdetector [3] 汇总用户报告的问题,但只提供粗糙的每日时间序列数据,限制了其在精细化分析中的应用。此外,这些工具中很少支持公众访问或提供对客户级中断的可见性。
Conclusion¶
We present Roman-HitchHiking, a scalable methodology for measuring customer-level outages in LEO satellite networks at a global scale and in near-real time. By leveraging path redundancy, RomanHitchHiking reduces packet loss by four orders of magnitude compared to prior approaches. We analyze Starlink customer outage patterns over three days and find large clusters of simultaneous outages across geographically diverse regions. We open-source Roman-HitchHiking to support the future study of global LEO outages.
我们提出了 Roman-HitchHiking,一种可扩展的方法,用于在全球范围内近乎实时地测量 LEO 卫星网络中的客户级中断。通过利用路径冗余,Roman-HitchHiking 与先前方法相比,将丢包率降低了四个数量级。我们分析了为期三天的 Starlink 客户中断模式,发现在地理上分散的区域存在大规模的并发中断集群。我们开源了 Roman-HitchHiking,以支持未来对全球 LEO 中断的研究。