Radshield: Software Radiation Protection for Commodity Hardware in Space¶
Exponentially-declining launch costs have led to an explosion of inexpensive satellites launched to space, often equipped with off-the-shelf chips. These chips, however, lack hardware radiation protection, leaving them vulnerable to space radiation. We thus design Radshield, a software system protecting against the two most ubiquitous and costly radiation fault scenarios: (a) radiation-induced short-circuits that lead to permanent hardware failure; and (b) radiationinduced transient charges that result in single-bit silent data corruption (SDC). Radshield counters these failure scenarios with two components. First, it uses a short-circuit detector that can detect tiny increases in the device’s current draw by estimating the normal current draw when resource utilization is low. Second, it duplicates the execution of spacecraft workloads in a CPU and memory-efficient manner, and catches SDCs even when they affect the CPU’s pipeline or cache. In our experiments, we show Radshield is very effective at preventing both errors, and is 1.4−35.5× more power-efficient than the state-of-the-art protection mechanisms in detecting SDC. Radshield is deployed on missions in low-earth orbit and in deep space.
指数级下降的发射成本促使了大量廉价卫星被发射入轨,这些卫星通常搭载商用现货芯片。然而,这些芯片缺乏硬件辐射防护,因此很容易受到空间辐射的影响。为此,我们设计了 Radshield,这是一个旨在防护两种最普遍且代价高昂的辐射故障场景的软件系统:
(a) 由辐射引发、可导致永久性硬件故障的短路
(b) 由辐射引发的瞬态电荷所造成的单位比特静默数据损坏 (SDC)
Radshield 通过以下两个组件来应对这些故障场景:
- 短路探测器:该探测器通过在资源利用率较低时估算正常电流消耗,从而能够检测到设备电流消耗的微小增长
- 高效冗余执行:该组件以一种 CPU 和内存高效的方式冗余执行航天器的工作负载,即使 SDC 影响到 CPU 的流水线或缓存,也能够将其捕获
在我们的实验中,Radshield 在预防这两种错误方面都表现出极高的有效性,并且在检测 SDC 方面,其能效 (power-efficient) 比当前最先进的防护机制高出 1.4 至 35.5 倍。Radshield 已被成功部署到近地轨道和深空任务中。
Introduction¶
The last decade has seen an explosion in commercial and public interest in space exploration, driven by exponentially-decreasing launch costs, depicted in Figure 1. The cost of launching a kilogram to low-earth orbit (LEO) in 1981 was $88K 1 on the Space Shuttle has since dropped to just $1.4K on SpaceX’s Falcon Heavy today [1]. Private and government organizations are launching spacecraft at a rapidly-increasing rate [2], aiming to create large satellite constellations that support a wide variety of use cases from Internet connectivity [3–5], Earth imaging [6, 7], blockchain processing [8, 9], to wireless power delivery [10]. Taking advantage of decreased launch costs, LEO satellite constellations like Starlink have proven capable of providing significant performance and latency improvements over large, expensive satellites in higher orbits [11], attracting significant interest in the systems and architecture research communities [11–17].
Minimizing the unit cost of the thousands of satellites in each constellation is a key concern for operators [18]. To operate under the harsh space environment, missions traditionally use costly specialized hardware that can withstand ionizing radiation. Due to their specialized nature and their redundant hardware mechanisms, these chips are decades behind commodity ones in terms of their computational capabilities. For example, state-of-the-art radiation-hardened chips currently under development “boast” performance on the order of GFLOPS [19], while even low-power mobile chips today are capable of TFLOPS of compute [20].
近十年来,商业和公共领域对太空探索的兴趣激增,这主要得益于发射成本的指数级下降(如图1所示)。1981年,通过航天飞机将一公斤载荷送入近地轨道(LEO)的成本高达8.8万美元,而如今在SpaceX的“猎鹰重型”火箭上,这一成本已降至仅1400美元[1]。私人和政府组织正以前所未有的速度发射航天器[2],旨在创建支持从互联网连接[3–5]、地球成像[6, 7]、区块链处理[8, 9]到无线电力传输[10]等多种应用的大型卫星星座。利用降低的发射成本,像星链(Starlink)这样的近地轨道卫星星座已被证明,相比于更高轨道上的大型昂贵卫星,能够在性能和延迟方面提供显著改进[11],从而吸引了系统与体系结构研究社区的极大兴趣[11–17]。
对于星座运营商而言,最小化成千上万颗卫星的单位成本是其核心关切之一[18]。为了在严酷的太空环境中运行,传统的航天任务通常使用昂贵的、能够承受电离辐射的专用硬件。由于其专用性和冗余的硬件机制,这些抗辐射芯片在计算能力上比商用芯片落后数十年。例如,目前正在开发的顶级抗辐射芯片的性能“号称”达到GFLOPS级别[19],而当今即便是低功耗的移动芯片也已具备TFLOPS级别的计算能力[20]。
Moreover, due to bandwidth limitations at ground stations [3, 4, 6–9, 21], there is increased demand to run sophisticated computations locally onboard the spacecraft [22, 23]. However, the computational requirements of such tasks cannot be met with existing radiation-hardened hardware [24].
Thus, many missions have started to deploy commodity devices without any radiation protection [25–27]. For example, many low-cost CubeSats use Linux on off-the-shelf boards like the Raspberry Pi for flight control [26]. Even SpaceX and NASA are adopting off-the-shelf hardware for their aircraft, with the SpaceX Falcon 9 rocket running Linux on x86 CPUs [26] and the Ingenuity Mars Helicopter running Linux on a Snapdragon CPU [27]. These devices are not radiation-hardened, and are thus vulnerable to radiationinduced errors such as silent data corruption (SDC) and hardware overheating that we examine in §2.
此外,由于地面站的带宽限制[3, 4, 6–9, 21],在航天器上本地运行复杂计算的需求日益增加[22, 23]。然而,现有抗辐射硬件的计算能力无法满足此类任务的需求[24]。
因此,许多航天任务已开始部署没有任何辐射防护的商用设备[25–27]。 例如,许多低成本的立方星(CubeSats)在现成的商用开发板(如树莓派)上运行Linux系统进行飞行控制[26]。甚至SpaceX和NASA也在其航天器上采用商用硬件,例如SpaceX的猎鹰9号火箭在x86 CPU上运行Linux[26],而“机智号”火星直升机则在骁龙CPU上运行Linux[27]。这些设备未经抗辐射加固,因此容易受到辐射引发的错误影响,例如我们将在第2节中探讨的静默数据损坏(SDC)和硬件过热。
To this end, we introduce Radshield, a system that uses software mechanisms to protect commodity hardware components from radiation effects while minimizing the performance penalty. The work on Radshield presents a collaborative effort between two spacecraft operators and academic researchers. Radshield protects against the two most common and costly error scenarios: single-event latchups (SELs), radiation-induced localized short-circuits in the device, which cause the device to overheat over time, eventually leading to its malfunctioning; and single-event upsets (SEUs), single-bit errors that can cause SDCs or crashes.
为此,我们引入了 Radshield,这是一个利用软件机制来保护商用硬件组件免受辐射影响,同时最小化性能开销的系统 。Radshield 的工作是两家航天器运营商与学术研究人员之间的合作成果。Radshield 旨在防护两种最常见且代价高昂的错误场景:
- 单粒子锁定(SELs):辐射在设备中引发的局部短路,会导致设备随时间推移而过热,最终导致故障
- 单粒子翻转(SEUs):单位比特错误,可能导致静默数据损坏(SDC)或系统崩溃
Previous work in mitigating these errors approach the software and computer system they are protecting as a black box. For example, SEL detection relies solely on the current draw to try and detect when the current draw reaches beyond a threshold [28–30]. By treating the system as a black box, these approaches experience very high false negative or false positive rates, since they are oblivious to natural variation in current draw, for example due to high CPU consumption in an application that increases the power draw.
The current approach to addressing SEUs is simply running the entire computation 𝑅 times sequentially. Diverging results between runs indicates a potential SEU. However, this is computationally wasteful and multiplies the device’s energy consumption and heat generation. Doing so also stresses the limited thermal and power envelopes of these satellites, affecting their productivity and uptime.
以往缓解这些错误的工作通常将它们所保护的软件和计算机系统视为一个黑盒。例如,SEL检测仅依赖于监测电流消耗,试图在电流超过某个阈值时发出警报[28–30]。由于将系统视为黑盒,这些方法存在极高的漏报率或误报率,因为它们无法感知电流的自然变化,例如,应用程序的高CPU占用率同样会增加功耗。
当前应对SEU的方法是简单地将整个计算任务串行执行R次。若各次运行结果不一致,则表明可能发生了SEU。然而,这种方式在计算上是极其浪费的,并且会成倍增加设备的能耗和热量产生。这样做也给这些卫星有限的热量和功率预算带来了巨大压力,影响了它们的生产力和正常运行时间。
Instead, we show that with a white-box approach relying on software-visible metrics into system performance, Radshield can provide much more efficient error mitigation. Radshield consists of two primary components. Idle Latchup Detector (ILD) is a high-fidelity detector for SEL events. It relies on the simple yet powerful observation: since SEL events often trigger very small changes in the system’s current draw, the only reliable time to observe the current draw change is when the system is idle. Therefore, ILD uses OS-visible performance counters to automatically determine when the system is naturally idle, and when it is, it uses performance counters to detect whether an SEL has occurred with minimal overhead, using a lightweight ML model.
Radshield’s Efficient Modular Redundancy (EMR) component efficiently protects against SEU events. EMR is inspired by existing approaches which run the same computation several times [31]. Unlike existing approaches, it does so in parallel rather sequentially, relying on the idea that operators are adopting multi-core commodity CPUs. However, one cannot naively run the computation in parallel, as redundant computations may access the same memory location in a shared cache. If radiation corrupts the shared cache, it would affect all redundant runs and go undetected. To solve this problem, EMR carefully and automatically parallelizes the application to avoid the scenario where parallel tasks store data in shared caches at the same time.
与此相反,我们展示了通过一种依赖于软件可见的系统性能指标的白盒方法,Radshield 可以提供更高效的错误缓解。Radshield 由两个主要组件构成:
-
空闲锁定探测器(Idle Latchup Detector, ILD):这是一个高保真的SEL事件探测器。它基于一个简单而强大的观察:由于SEL事件通常只引发系统电流消耗的微小变化,因此唯一能够可靠观察到这一变化的时刻是系统处于空闲状态时。因此,ILD使用操作系统可见的性能计数器来自动判断系统何时处于自然空闲状态,并在此时利用轻量级机器学习模型,以极小的开销检测是否发生了SEL。
-
高效模块化冗余(Efficient Modular Redundancy, EMR):该组件高效地防护SEU事件。EMR的灵感来源于现有将同一计算任务运行多次的方法[31]。但与现有方法不同,它依赖于运营商正在采用多核商用CPU的趋势,并行执行而非串行执行冗余计算。然而,简单地并行运行计算任务是行不通的,因为冗余的计算任务可能会访问共享缓存中的同一内存位置。如果辐射损坏了共享缓存,它将影响所有冗余运行,从而导致错误无法被检测到。为解决此问题,EMR能够精心且自动地并行化应用程序,以避免并行任务同时将数据存储在共享缓存中的情况。
In over 960 hours of testing using real-world workloads on a ground-based testbed, ILD was able to catch all induced SELs, and has a false positive rate of only 0.02%, with a runtime overhead of only 2% (§4.1). Similarly, EMR achieves the same reliability as state-of-the-art mitigations with an average of 63% less runtime overhead and 60% less energy consumption. Radshield’s SEL mitigation is actively being tested on LEO SmallSats, and the SEU mitigation is deployed onboard a spacecraft on the surface of Mars (§5). We will open source Radshield’s code and experiments.
在地面测试平台上,通过真实世界的工作负载进行的超过960小时的测试表明,ILD能够捕获所有诱发的SEL,其误报率仅为0.02%,而运行时开销仅为2%(§4.1)。同样,EMR在实现与最先进缓解措施相同可靠性的同时,平均减少了63%的运行时开销和60%的能耗。Radshield的SEL缓解方案正在近地轨道小卫星上积极测试,而其SEU缓解方案则已部署在火星表面的航天器上(§5)。我们将开源Radshield的代码和实验。