EXAMINING A FEW LEO PATHS¶
We first analyze connectivity between a few GS-pairs in depth to give a view of how an end-end connection behaves.
RTT fluctuations¶
We examine how the end-end RTTs vary over time. These experiments use the Kuiper K1 shell. We run the analysis for 200 seconds, as for Kuiper-scale networks this is sufficient to show nearly the full range of variations.
我们研究了端到端RTT(往返时延)随时间变化的情况。这些实验使用了Kuiper K1壳。我们进行200秒的分析,因为对于Kuiper规模的网络,200秒足以展示几乎全部的变化范围。
For each source-destination pair, 𝑠-𝑑, 𝑠 sends 𝑑 a ping every 1 ms, and logs the response time. We also compare the measured RTTs to those generated using networkx computations for the same end-points, and the same constellation. For these networkx computations, we use snapshots of the system every 100 ms, and compute the shortest paths using the Floyd-Warshall algorithm. Analysis based on such computations has already appeared in recent work [5, 29]; we use it both as a validation for some of our simulator’s satellitespecific code, and to highlight and explain the subtle differences that actual packets sometimes experience compared to paths computed from a static snapshot.
对于每对源-目的地( \(s\) - \(d\) ),\(s\) 每1毫秒向 \(d\) 发送一个ping,并记录响应时间。我们还将测量到的RTT与使用networkx计算生成的相同端点和相同星座的RTT进行比较。对于这些networkx计算,我们每100毫秒获取系统的快照,并使用Floyd-Warshall算法计算最短路径。基于这些计算的分析已经出现在最近的研究中[5, 29];我们使用它既作为我们模拟器中某些卫星特定代码的验证,也用于突出并解释实际数据包有时与从静态快照计算出的路径之间的微妙差异。
Fig. 3 shows the results for three 𝑠-𝑑 pairs. The ping measurements from Hypatia (‘Pings’) and the snapshot computations from networkx (‘Computed’) match closely for most of the time. For instance, in Fig. 3(a) at 𝑡 = 32.9 s the path changes, which causes the RTT to rise from 96 ms to 111 ms. Occasionally, like in Fig. 3(c) around 130 seconds, we see spikes in the ping RTT compared to networkx. These spikes result from forwarding state changes across the path: as a packet travels on what was the shortest path when it departed the source, the packet arrives at some satellite no longer on the new shortest path, as satellites have moved. This results effectively in the packet having taken a detour compared to the instant path computation in networkx.
The path from Rio de Janeiro to St. Petersburg sees a disruption around 150 seconds into the simulation, shown as the shaded region in all related plots. We found that for this period, St. Petersburg does not have any visible Kuiper satellites at sufficiently high angle of elevation, which, obviously, results in the satellite network path being disconnected. For Kuiper, its other two shells do not address this missing connectivity either; high-latitude cities like St. Petersburg will not see continuous connectivity over Kuiper.
图3展示了三个源-目的地( \(s\) - \(d\) )对的结果。Hypatia的ping测量(“Pings”)与networkx的快照计算(“Computed”)在大多数时间内非常接近。例如,在图3(a)中,\(t = 32.9\) 秒时,路径发生了变化,导致RTT从96毫秒上升到111毫秒。偶尔,如在图3(c)中约130秒时,我们会看到ping RTT相比networkx有a明显的峰值。这些峰值是由于路径上转发状态变化引起的:当数据包沿着离开源时的最短路径行进时,它到达的某些卫星已经不再是新的最短路径上的卫星,因为卫星已经移动。这实际上导致数据包相对于networkx中的即时路径计算,走了一个绕路。
从里约热内卢到圣彼得堡的路径在仿真进行到约150秒时出现了中断,这在所有相关图表中以阴影区域表示。我们发现,在这一期间,圣彼得堡没有任何可见的Kuiper卫星位于足够高的仰角,这显然导致卫星网络路径断开。对于Kuiper,其另外两个壳层也未解决这一缺失的连接性;像圣彼得堡这样的高纬度城市将无法在Kuiper网络中获得持续的连接。
For the other two paths, there are smaller but still substantial variations in the RTT over time. Across time, the Manila-Dalian path has a minimum RTT of 25 ms and a maximum RTT of 48 ms, thus changing by nearly 2×. For the Istanbul-Nairobi path, this RTT range is 47-70 ms.
对于其他两条路径,RTT随时间变化的幅度较小,但仍然存在显著的波动。在时间变化过程中,马尼拉-大连路径的最小RTT为25毫秒,最大RTT为48毫秒,变化幅度接近2倍。对于伊斯坦布尔-内罗毕路径,RTT的范围为47毫秒到70毫秒。
For real-time applications that care about jitter, these variations could necessitate a somewhat large “jitter buffer” to store and deliver packets to the application at an even rate. The determining latency in such cases will be the maximum latency of an end-end connection over time.
对于关注抖动的实时应用,这些波动可能需要较大的“抖动缓冲区”来存储并以均匀速率将数据包传递给应用程序。在这种情况下,决定延迟的将是端到端连接的最大延迟(很容易理解,因为我们对抖动敏感,所以需要根据最大延迟设置缓冲区上限)。
Takeaway for applications 应用供需小结
The maximum end-end RTT over time can be much higher than the minimum, and will determine the latency for jitter-sensitive applications.
随着时间的推移,最大端到端 RTT 可能远高于最小值,并将决定抖动敏感应用程序的延迟。
Congestion control, absent congestion¶
We also explore how congestion control works on changing satellite paths. For this, we first use a congestion-free setting: the measured end-end connection is the only one sending traffic, with the rest of the network being entirely empty.
我们还探讨了在变化的卫星路径上,拥塞控制是如何工作的。为此,我们首先使用一个无拥塞的设置:测量的端到端连接是唯一发送流量的连接,其余网络完全为空闲状态。
Fig. 3 also includes the per-packet RTT observed by TCP (NewReno) packets. This TCP observed RTT is calculated as the time difference between sending a packet and receiving its ACK. As expected, TCP continually fills and drains the buffer, thus increasing the RTT. To make the simulations faster, the shown experiments use a 10 Mbps line-rate. The buffers are sized to 100 packets, i.e., 1 bandwidthdelay product (BDP) for 100 ms. With higher rate, we expect the same trend, with a smaller increase in RTTs as queues drain faster.
图3还包括了TCP(NewReno)数据包观察到的每包RTT。此TCP观察到的RTT是通过计算发送数据包与接收其ACK之间的时间差来得到的。正如预期的那样,TCP不断地填充和排空缓冲区,从而导致RTT的增加。为了加速仿真,所展示的实验使用了10 Mbps的线路速率。缓冲区的大小为100个数据包,即对于100毫秒的延迟带宽乘积(BDP)。使用更高的速率时,我们预计会出现相同的趋势,只不过由于队列排空得更快,RTT的增加幅度会更小。
Fig. 4 shows the TCP congestion window evolution for the same 3 connections over the same period. The instantaneous BDP, aggregated with queue capacity, i.e., BDP+Q, is also shown at each point in time – this is the maximum number of packets that can be in-flight without drops (assuming there is one bottleneck). The network device queue size, 𝑄 , for both ISLs and GSLs is set to 100 packets. For the times when BDP+Q is stable, TCP, as expected, repeatedly hits it, incurs a drop, cuts the rate, and ramps up again. But the changes in RTT, and thus BDP+Q, result in TCP changing its behavior. The disconnection event for St. Petersburg is evident from Fig. 4(a), but additionally, we can see drops in the congestion window for the other connections too, e.g., in Fig. 4(c), around 140 s, TCP drops the congestion window because of packet reordering. At this time, as the path is shortened by ∼16 ms, packets transmitted later use the new shorter path, and arrive first at the destination. The resulting duplicate ACKs are interpreted as loss by the sender. The TCP RTT oscillations at the right end of Fig. 3(a) and 5(a) are caused by delayed acknowledgements. We checked that disabling delayed ACKs eliminates these, but does not change the rest of the observed behavior, which is our focus.
图4展示了相同的三个连接在相同时间段内的TCP拥塞窗口演化。在每个时间点,还显示了瞬时BDP,结合队列容量,即 \(BDP + Q\),这是在没有丢包的情况下可以处于飞行中的最大数据包数(假设存在一个瓶颈)。对于ISL和GSL的网络设备队列大小\(Q\)设置为100个数据包。
当 \(BDP + Q\) 稳定时,TCP如预期的那样反复达到此值,发生丢包,降低速率,再次逐步增加。然而,RTT的变化,从而影响\(BDP + Q\),导致TCP行为发生变化。
从图4(a)中可以明显看出圣彼得堡的断开事件,但我们还可以看到其他连接的拥塞窗口也发生了丢包,例如,在图4(c)中,约在140秒时,TCP因为数据包重排而减少了拥塞窗口。此时,由于路径缩短了约16毫秒,稍后传输的数据包使用了新的较短路径,并首先到达目的地(包重排序事件)。由此产生的重复ACK被发送方误解为丢包。图3(a)和图5(a)右侧的TCP RTT振荡是由于延迟确认引起的。我们检查过,禁用延迟确认后可以消除这些振荡,但不会改变其他观察到的行为,这也是我们关注的重点。
Review: Delay ACKs
在TCP协议中,Delayed ACKs(延迟确认)是一个用于提高效率的机制。它的作用是延迟发送ACK(确认报文),以便可能在短时间内接收多个数据包,并一次性确认这些数据包,从而减少网络上的ACK数量,避免每接收到一个数据包就立刻发送一个ACK,从而减少了网络拥塞。
- 正常的ACK机制:当接收到数据包时,接收方会立即发送一个ACK给发送方,表示已经收到数据包。
- 延迟确认(Delayed ACK)机制:接收方不会立即发送ACK,而是等待一定的时间,通常是200ms,或者直到接收到两个数据包。如果在这个时间段内接收方接收到第二个数据包,它会一次性发一个ACK确认这两个数据包。
为什么使用延迟确认?
- 减少ACK的数量:对于许多小的数据包,如果每个包都发送ACK,可能会在网络上产生大量的确认报文。通过延迟确认,可以将多个确认合并为一个ACK,从而减少了网络负载和ACK的数量。
- 提高网络吞吐量:如果网络带宽和延迟适合延迟ACK,发送方可以获得更大的TCP窗口,从而提高吞吐量。
例子
假设你有两个数据包,分别是数据包A和数据包B:
- 正常的ACK机制:接收方收到A之后,会立即发送ACK(A+1);然后接收方收到B,会立即发送ACK(B+1)。每个数据包的到达都会触发一个ACK报文。
- 延迟确认机制:接收方收到A之后,不立即发送ACK,而是等待一段时间,假设200ms。接收到B之后,它可以发送一个ACK,确认A和B这两个数据包。这样,在这200ms内只有一个ACK,而不是两个。
在论文中的应用
在你提到的论文片段中,提到TCP RTT的波动与延迟ACKs相关。通过禁用延迟ACKs,TCP的行为发生了变化,但论文指出,禁用延迟ACKs并未改变其他行为,表明延迟ACKs对论文中的现象并不是最关键的因素。
为什么延迟ACK会导致问题?
延迟ACK机制可能会导致一些TCP性能上的问题,尤其是在存在延迟确认的情况下。例如:
- 增加的RTT波动:由于接收方等待一定的时间来确认数据,可能会造成发送方未能及时得到确认,导致它认为网络出现了延迟或丢包,从而可能会触发不必要的重传或窗口调整。
- 误解重传或丢包:在某些情况下,延迟ACK会导致发送方误认为某些数据包丢失,尤其是在存在包乱序的情况下,如论文中提到的“duplicate ACKs”,这会导致TCP误认为发生了丢包,进而执行拥塞控制措施。
简而言之,延迟ACKs是TCP协议中的一个优化机制,用于减少网络中ACK的数量,提高效率。但是,它也可能引起一些问题,如引发RTT波动、丢包误判等,尤其是在路径变动或包乱序的情况下。
TCP’s filling up of buffers and the resulting deterioration in per-packet latency is a widely recognized problem [3, 11, 27]. For LEO networks that promise low-latency operation, this is perhaps even more undesirable. We thus also test delay-based transport by repeating the same experiments, except using TCP Vegas. Note that the algorithms are not competing with each other, rather, each transport is tested entirely separately, i.e., without any competing traffic – the issue of Vegas not being aggressive enough against Reno or Cubic is entirely orthogonal and immaterial here. Any transport implementable in ns-3 can be evaluated in Hypatia.
TCP填满缓冲区并导致每包延迟恶化是一个广为人知的问题[3, 11, 27]。对于承诺低延迟操作的LEO网络来说,这个问题可能更加不受欢迎。因此,我们还通过重复相同的实验来测试基于延迟的传输,只不过这次使用了TCP Vegas。需要注意的是,这些算法并不相互竞争,而是每个传输协议完全独立地进行测试,即没有任何竞争流量 —— Vegas对比Reno或Cubic不够激进的问题在这里是完全无关紧要的。在Hypatia中,可以评估任何在ns-3中实现的传输协议。
Fig. 5 shows the behavior of both NewReno and Vegas for one of the paths, Rio de Janeiro to St. Petersburg, Across the 200 s simulations, the per-packet RTT is shown in Fig. 5(a), the congestion window in Fig. 5(b), and the achieved throughput averaged over 100 ms intervals in Fig. 5(c). Vegas, as expected, often operates with a near-empty buffer, e.g., until around 140 s, it matches the ping RTT measurements in Fig. 3(a) closely. Unfortunately, however, Vegas interprets the increase in latency at ∼33 s as a sign of congestion, drastically cuts its congestion window (Fig. 5(b)), and achieves very poor throughput (Fig. 5(c)) after this point.
图5展示了Rio de Janeiro到圣彼得堡路径上NewReno和Vegas的行为。在200秒的仿真过程中,图5(a)显示了每包RTT,图5(b)显示了拥塞窗口,图5(c)显示了每100毫秒间隔平均的吞吐量。如预期的那样,Vegas通常在几乎为空的缓冲区上运行,例如,直到约140秒,它与图3(a)中的ping RTT测量值非常接近。然而,不幸的是,Vegas将约33秒时延迟的增加解读为拥塞的信号,剧烈地缩小了其拥塞窗口(图5(b)),并且在此之后吞吐量非常差(图5(c))。
We tested NewReno and Vegas primarily because they are two well-known algorithms using loss- and delay-based congestion detection, and are already implemented in ns-3. However, Hypatia can be used with any congestion control algorithm implemented in ns-3. For instance, once a mature implementation of BBR [11] is available, evaluating its behavior on LEO networks would be of high interest. As of this writing, while there are some BBR implementations available online [17, 39], these have not been merged into ns-3, and we did not invest effort in testing these.
我们主要测试了NewReno和Vegas,因为它们是两种使用基于丢包和延迟的拥塞检测的著名算法,并且已经在ns-3中实现。然而,Hypatia可以与在ns-3中实现的任何拥塞控制算法一起使用。例如,一旦BBR的成熟实现可用,评估其在LEO网络中的表现将具有很高的兴趣。截止本文撰写时,虽然网上已有一些BBR的实现[17, 39],但这些尚未合并到ns-3中,因此我们没有投入精力进行测试。
Our above results highlight challenges for congestion control in LEO networks: both loss and delay are poor signals of congestion in this setting. Loss, besides suffering from its well-known problem of only arising after buffers are full and latencies are inflated, is additionally vulnerable to being inferred incorrectly due to reordering. On the other hand, delay is also an unreliable signal because delay fluctuations occur even without queueing. This makes congestion control in this setting a difficult problem. Of course, if the sender knows the satellite path’s variations, they can “subtract” them out and adapt. However, in general, the end-points need not even be aware that they are using a satellite-path: an end-point that is directly connected to a fixed connection could have its traffic sent to the nearest ground station by its ISP, as suggested in recent work [26]. Solutions like splitting the transport connection are also becoming difficult to support with transport such as QUIC, that does not support man-in-the-middle behavior.
我们以上的结果突出了在LEO网络中进行拥塞控制的挑战:在这种环境下,丢包和延迟都不是有效的拥塞信号。丢包除了存在其众所周知的问题——只有在缓冲区满和延迟膨胀后才会发生外,还容易由于数据包重排而被错误地推断为丢包。另一方面,延迟也是一个不可靠的信号,因为即使没有排队,延迟波动也会发生。这使得在这种环境下进行拥塞控制成为一个困难的问题。当然,如果发送方知道卫星路径的变化,他们可以“减去”这些变化并进行适应。然而,通常情况下,端点不需要知道它们正在使用卫星路径:如最近的研究[26]所示,一个直接连接到固定连接的端点,其流量可能通过ISP发送到最近的地面站。像QUIC这样的传输协议也越来越难以支持这种解决方案,因为QUIC不支持中间人行为。
Takeaway for congestion control 拥塞控制小结
Both loss and delay can be poor signals for congestion control in LEO networks.
QUIC
QUIC、TCP和UDP是三种不同的传输层协议,它们各自有不同的设计目标和特点。下面是它们的区别与共同点:
共同点
- 传输层协议:
- QUIC、TCP和UDP都工作在传输层,它们的作用是从应用层接收数据并负责数据的传输到目标计算机的正确位置。
- 面向端到端通信:
- 三者都是为端到端的通信设计的,确保数据从源到目的地的传输,不直接处理网络层的路由或物理层的细节。
- 数据流控制:
- 它们都涉及到数据的流量控制和错误处理,尽管在具体实现方式上有所不同。
区别
- QUIC(Quick UDP Internet Connections)
- 传输协议类型:基于UDP。
- 连接建立:QUIC支持 零往返时间建立握手(0-RTT),意味着客户端可以在第一次连接时就开始发送数据,这极大地减少了连接延迟。
- 加密:QUIC从一开始就 内建TLS 1.3加密,提供比TCP更好的加密性能,确保传输的安全性。
- 多路复用:QUIC 支持多个数据流在一个连接上独立传输,避免了TCP的队头阻塞问题。如果某个数据流发生丢包,其他流不会受到影响。
- 拥塞控制与流量管理:QUIC继承了TCP的拥塞控制机制,但它对其进行了一些优化,使其更适应高丢包和高延迟的环境。
- 应用:QUIC通常用于Web协议(如HTTP/3)中,提供更快的网页加载和更好的移动网络支持。
- TCP(Transmission Control Protocol)
- 传输协议类型:面向连接的协议。
- 连接建立:TCP使用三次握手过程来建立连接,需要更长的延迟(较高的连接建立时间)。
- 可靠性:TCP保证数据包按顺序、无误地传输,通过重传丢失的数据包和流量控制避免拥塞。
- 流量控制:TCP使用窗口机制来控制数据流量,避免发送方发送过多数据而接收方无法处理。
- 拥塞控制:TCP会根据网络的拥塞情况动态调整数据的发送速率,防止过多的数据冲击网络。
- 应用:广泛用于需要保证数据可靠传输的场景,如文件传输、电子邮件等。
- UDP(User Datagram Protocol)
- 传输协议类型:无连接协议。
- 连接建立:UDP不进行连接建立,可以立即发送数据。
- 可靠性:UDP不保证数据的可靠传输,它不对丢失或错误的数据包进行重传,也不保证数据包的顺序。
- 流量控制与拥塞控制:UDP没有内建的流量控制和拥塞控制机制,发送方可以随意发送数据。
- 效率:由于没有连接管理和错误校验等机制,UDP的开销较低,适用于实时通信、视频流、DNS查询等对实时性要求较高的场景。
- 应用:适用于需要快速传输且对丢包或顺序要求不高的应用,如VoIP、视频会议、流媒体等。
特性 | QUIC | TCP | UDP |
---|---|---|---|
协议类型 | 基于UDP,但添加了可靠性机制 | 面向连接,可靠传输 | 无连接,不保证可靠性 |
连接建立 | 0-RTT(零往返时间) | 三次握手(需要更多延迟) | 无连接,立即传输 |
加密 | 内建TLS 1.3加密 | 需要独立的TLS或SSL加密层 | 无加密 |
多路复用 | 支持独立的数据流 | 不支持多路复用 | 不支持多路复用 |
拥塞控制 | 优化的拥塞控制(继承TCP) | 内建拥塞控制 | 无拥塞控制 |
流量控制 | 支持流量控制 | 支持流量控制 | 无流量控制 |
适用场景 | 低延迟、高性能Web应用(如HTTP/3) | 高可靠性文件传输、电子邮件等 | 实时通信、流媒体、DNS查询等 |
- QUIC 适用于高性能的Web应用,尤其是在高延迟和丢包频繁的网络环境下,提供了更快的连接建立和更好的多路复用。
- TCP 适用于需要可靠数据传输的场景,如文件传输、数据库访问等。
- UDP 适用于实时性要求较高,但可以容忍部分丢包或顺序错误的场景,如视频通话、流媒体播放等。