跳转至

BACKGROUND AND MOTIVATION

RTC Requires Low Communication Latency

User-perceived end-to-end communication latency is one of the key QoE metrics for real-time communication (RTC) services, such as video-conferencing, Internet telephony, and interactive VR/AR applications, etc. Quantitatively, ITU-TG.114 [12] recommends a maximum of 150ms one-way latency for telephony, and 3GPP TS 22.105 [11] suggests a two-way 400ms limit (150ms preferred) for videophone.

用户感知的端到端通信延迟是实时通信(RTC)服务的关键体验质量(QoE)指标之一, 适用于视频会议、互联网电话、互动VR/AR应用等。定量来说, ITU-T G.114 [12] 推荐语音通信的单向最大延迟为150毫秒, 而3GPP TS 22.105 [11] 为视频电话建议的双向延迟上限为400毫秒(优选150毫秒)。

Figure 1 plots a typical architecture of multi-user RTC systems, which is built upon cloud-based relay server(s) offered by wide-area cloud platforms (e.g., Amazon AWS, Microsoft Azure). Such a cloud-based overlay network can achieve improved performance and better scalability [15], [28], [29] as compared to a full-mesh peer-to-peer data transmission model over the public Internet. Commercial RTC applications such as Skype, Zoom and Google Hangouts etc., have already migrated their services (partially) to cloud platforms [32], [34], [50]. To establish a multi-user RTC session, a set of proper cloud relay servers are first selected by the RTC service provider, and then the routing policy (i.e., how to forward RTC flows) is applied to each selected relay. One relay server is picked as the control unit which aggregates RTC traffic on the server side, and thus each peer does not need to send an independent flow to every other peer. In this typical RTC architecture, the end-to-end communication path is divided into two segments: a clientcloud segment over the public Internet which may go through multiple autonomous systems (ASes), and an inter-cloud-site segment over the private WAN of the cloud provider.

alt text

图1绘制了一个典型的多用户RTC系统架构, 该架构基于由广域云平台(如Amazon AWS、Microsoft Azure)提供的云中继服务器。这种基于云的叠加网络与公共互联网的全网状对等数据传输模型相比, 可以实现更好的性能和更强的可扩展性 [15], [28], [29]。商业RTC应用, 如Skype、Zoom和Google Hangouts等, 已经将其服务(部分)迁移到云平台上 [32], [34], [50]。

为了建立一个多用户RTC会话, RTC服务提供商首先选择一组合适的云中继服务器, 然后 将路由策略(即如何转发RTC流量)应用于每个选定的中继一个中继服务器被选为控制单元, 负责在服务器端聚合RTC流量, 因此 每个对等方无需向其他每个对等方发送独立的流。在这种典型的RTC架构中, 端到端通信路径被划分为两个段落: 一个是通过公共互联网的客户端到云段, 可能经过多个自治系统(AS);另一个是通过云服务提供商的私有广域网(WAN)的云站点间段。

TL;DR
  1. 选定一组 云中继服务器集群
  2. 将路由策略安装到该集群的每一个服务器上
  3. 选定一个CU(Control Unit), 将这个集群划分成: client-cloud segment + inter-cloud segment
  4. 每当想要multi-cast时, 重点在于CU的转发!而不是发送方复制很多份进行发放 🌟 🔥

Wide-Area RTC is Still Suffering from High Latency

While sustaining low latency is important, through our measurements and in-depth analysis on two datasets collected from state-of-the-art RTC applications, we observe that: widearea RTC is still suffering from high communication latency.

Dataset description. The first dataset is a large-scale anonymous trace collected from a commercial videoconferencing application. The dataset consists of 17507 videoconferencing sessions drawn from 21 days period across 193 countries/regions with 32686 end users. Each session is associated with the round-trip time (RTT) and packet loss rate experienced by each user, which are reported by the client software 1 . The second dataset is measured from four popular RTC applications (i.e., Zoom [9], Google Meet [6], Cisco Webex [4] and VooV [8]) based on a number of controllable virtual hosts rented from virtual private server (VPS) providers. Because some applications perform differently in their free and premium version (e.g., we observe that Zoom free can only use cloud relays deployed in U.S., while its premium version can use much more geo-distributed relay servers around the world to attain significantly reduced latency), in our experiment we upgrade the client software to obtain their best performance. In particular, we rent 102 geo-distributed hosts from 7 different VPS providers to emulate wired RTC users, and measure the end-to-end communication latency among them 2 . Note that the virtual host is totally controllable, and thus we are able to run traceroute and tcpdump to track and analyze how media packets are routed over the network, especially for those bad cases suffering from very poor network performance. For each RTC session, we sample the latency in every one second, and use the minimal sample in each session for statistic to refrain the impact of temporary network congestion.

Our analysis makes three observations as described below:

尽管维持低延迟至关重要, 但通过我们对来自先进RTC应用的两个数据集进行的测量和深入分析, 我们观察到: 广域RTC仍然面临较高的通信延迟问题。

数据集描述。第一个数据集是从一个商业视频会议应用收集的大规模匿名数据。该数据集包含了跨越21天期间、来自193个国家/地区的17507个视频会议会话, 涉及32686个终端用户。每个会话都与每个用户经历的往返时间(RTT)和丢包率相关, 这些数据由客户端软件报告。第二个数据集来自四个流行的RTC应用(即Zoom [9]、Google Meet [6]、Cisco Webex [4]和VooV [8]), 基于从虚拟私有服务器(VPS)提供商租用的若干可控虚拟主机进行测量。由于某些应用在免费版和高级版中表现不同(例如, 我们观察到Zoom免费版只能使用部署在美国的云中继, 而其高级版可以使用分布在全球的更多地理位置的中继服务器, 从而显著降低延迟), 在我们的实验中, 我们升级了客户端软件以获得其最佳性能。具体来说, 我们从7个不同的VPS提供商租用了102个地理分布的主机, 用于模拟有线RTC用户, 并测量它们之间的端到端通信延迟。需要注意的是, 虚拟主机是完全可控的, 因此我们可以运行traceroute和tcpdump来追踪和分析媒体包在网络上的路由, 尤其是在那些因网络性能差而出现问题的情况。对于每个RTC会话, 我们每秒采样一次延迟, 并使用每个会话中的最小样本进行统计, 以减少临时网络拥塞的影响。

我们的分析得出了以下三点观察结果:

O(i): from a global perspective, RTC users in many regions are still suffering from high communication latency.

Figure 2 plots the overview of the measured user-perceived latency and packet loss of all collected traces in above two datasets. We group the network performance based on their associated continents. As shown in Figure 2, we observe that while the latency results in many populated and developed areas are low, there are still a large number of users suffering from high communication latency (e.g., > 500ms RTT), especially for users in remote or under-developing areas. Specifically, there are about 25.5%/10.1% users suffering from RTTs higher than 300ms/500ms in total. Even in developed regions like EU and NA, we still observe 2% and 13% sessions associated with RTT higher than 300ms. The latency problem is more stringent in AF, where more than 47.7%/18.2% users suffer from RTT higher than 300ms/500ms, probably due to the under-served network infrastructure and lack of available cloud nodes. As user-perceived experience is sensitive to network performance, high RTTs will inevitably impair the QoE of user interactions.

alt text

O(i): 从全球角度看, 许多地区的RTC用户仍然遭遇较高的通信延迟。

图2展示了从上述两个数据集中收集的所有痕迹的用户感知延迟和丢包率概览。我们根据相关大洲对网络性能进行了分组。如图2所示, 我们观察到, 尽管在许多人口密集和发达地区的延迟较低, 但仍有大量用户遭遇较高的通信延迟(例如, >500毫秒RTT), 尤其是在偏远或发展中的地区。具体来说, 约有25.5%/10.1%的用户遭遇高于300ms/500ms的RTT。即使在发达地区, 如欧盟和北美, 我们仍然观察到2%和13%的会话RTT高于300ms。延迟问题在非洲地区尤为严重, 其中超过47.7%/18.2%的用户遭遇高于300ms/500ms的RTT, 这可能是由于网络基础设施不完善和可用云节点不足所致。由于用户感知体验对网络性能非常敏感, 高RTT必然会影响用户交互的QoE。

O(ii): wide-area, international RTC sessions are more likely to suffer from high communication latency.

Figure 3 depicts the average user-perceived one-way latency between Zoom users in representative populated areas, together with their the inherent physical distance (i.e., the great circle distance). Results of other applications are similar and omitted due to the space limit. The one-way latency is measured as the time consumed by delivering a certain video frame from the sender to the receiver. We observe that despite their growing importance, international or wide-area RTC sessions are more likely to suffer from poor network performance than short-distance, domestic calls. For populated city pairs, all intercontinental sessions are associated with one-way latency higher than 300ms, and some of them can even reach 500ms, which may cause poor experience for RTC interactions.

alt text

O(ii): 广域国际RTC会话更容易遭遇较高的通信延迟。

图3展示了Zoom用户在代表性人口密集区域之间的平均单向用户感知延迟及其固有物理距离(即大圆距离)。其他应用的结果类似, 由于篇幅限制, 未作展示。单向延迟是指从发送方到接收方传输某个视频帧所消耗的时间。我们观察到, 尽管国际或广域RTC会话越来越重要, 但它们比短距离的国内通话更容易遭遇较差的网络性能。对于人口密集的城市对, 所有跨洲的会话单向延迟都高于300ms, 其中一些甚至可以达到500ms, 这可能会导致RTC交互体验较差。

O(iii): the high latency issue is widely observed in existing RTC applications.

Finally, Figure 4 shows the one-way latency measured from four state-of-the-art applications for international communication between populated city pairs in the world. There is no clear winner, as each application may outperform others for certain communication sessions. However, we observe that for most long-distance, cross-continent RTC sessions, the user-perceived latency is higher than 300ms for all applications, indicating that the high latency is widely observed in existing popular RTC applications.

alt text

O(iii): 高延迟问题在现有RTC应用中普遍存在。

最后, 图4显示了从四个先进应用中测量的、用于全球人口密集城市对之间的国际通信的单向延迟。没有明显的优胜者, 因为每个应用在某些通信会话中可能会优于其他应用。然而, 我们观察到, 对于大多数长距离、跨洲的RTC会话, 所有应用的用户感知延迟都高于300ms, 这表明高延迟问题在现有流行的RTC应用中广泛存在。

TL;DR
  1. 偏远/农村地区RTC体验差
  2. 远距离通信更容易遭遇较差的网络性能

Root Cause Analysis

We reproduce the high-latency sessions based on our rented hosts to track and analyze how data packets are routed. Specifically, we identify three factors from existing cloud-based RTC architecture that can lead to the high latency issue.

(i) Inherent high propagation delay in long transmission path.

In terrestrial fiber, data packets travel at about 67% of the speed of light in a vacuum. Even though two communication ends establish the session over the optimal route (e.g., the great circle path), the inherent propagation delay could be high under very long physical communication distance.

(i) 长传输路径中的固有高传播延迟。

在陆地光纤中, 数据包的传播速度大约是光速的67%。即使两个通信端点在最优路由(例如大圆路径)上建立了会话, 由于物理通信距离非常长, 固有的传播延迟可能仍然很高。

alt text

(ii) Additional delay involved by meandering routes over the public Internet.

While existing RTC architectures exploit private cloud WAN to improve their network performance, the paths between clients and clouds are still constructed upon public Internet, which may cross more than one autonomous system (AS). How data packets are forwarded between ASes is decided by policy-based inter-AS routing protocols (e.g., BGP). Constrained by many practical considerations rather than latency performance, existing inter-AS communication may suffer from meandering routes under certain circumstances [16], [19]. Figure 5a plots an example of the prolonged path caused by tortuous client-to-cloud paths. Two Webex users are located in Beijing (CN) and Johannesburg (ZA), and the relays are selected on two cloud servers in Singapore and Tianjin (CN). However, through our traceroute analysis we observe a detour over the client-cloud segment between Johannesburg and Singapore, which is made by underlying inter-AS routing protocols, and data packets are forwarded to the relay server through London (UK). Such a detour from the client to the relay server significantly prolongs the RTT to 322ms, and can lead to poor user experience for the RTC session.

(ii) 通过公共互联网的曲折路径带来的额外延迟。

虽然现有的RTC架构利用私有云WAN来提升网络性能, 但客户端和云之间的路径仍然通过公共互联网构建, 这些路径可能跨越多个自治系统(AS)。数据包在不同AS之间的转发方式由基于策略的跨AS路由协议(例如BGP)决定。由于许多实际考虑而非延迟性能, 现有的跨AS通信在某些情况下可能会受到曲折路径的影响 [16], [19]。图5a展示了由曲折的客户端到云路径引起的延迟。两位Webex用户分别位于北京(CN)和约翰内斯堡(ZA), 中继服务器选择位于新加坡和天津(CN)的两台云服务器。然而, 通过我们的traceroute分析, 我们观察到约翰内斯堡到新加坡的客户端到云路径存在绕行现象, 这是由底层的跨AS路由协议造成的, 数据包通过伦敦(UK)转发到中继服务器。这种从客户端到中继服务器的绕行显著延长了RTT, 达到了322毫秒, 并可能导致RTC会话的用户体验变差。

(iii) Meandering inter-cloud routes caused by uneven deployment of cloud servers.

The network performance of the relaybased architecture is significantly affected by the distribution of available cloud servers. However, today’s terrestrial Internet is an uneven network, where resources (computation, storage and network) are aggregated in “hot/developed areas”, and limited by topographic constraints. In practice, the deployment of cloud servers might be insufficient to construct a delay-optimal route for wide-area interactions. Figure 5b depicts an example of a meandering route over the inter-cloud-site network. Two Zoom users locate in London (UK) and Rio (BR) respectively. Two Amazon AWS cloud servers in Sao Paulo (BR) and Frankfurt (DE) are selected as the access relays, and RTC traffic between them is carried over the private cloud WAN. In this case, although the latency over the client-cloud segment is low, there are no direct terrestrial/submarine fibers between the two cloud sites in Sao Paulo and Frankfurt. Our traceroute analysis exposes that the inter-cloud-site route goes through a cloud site in Washington (US), creating a meandering route with prolonged end-to-end communication latency.

(iii) 由于云服务器部署不均造成的曲折的云间路由。

基于中继的架构的网络性能受可用云服务器分布的影响显著。然而, 今天的陆地互联网是一个不均衡的网络, 资源(计算、存储和网络)聚集在“热/发达地区”, 并且受到地理约束的限制。在实践中, 云服务器的部署可能不足以为广域互动构建延迟最优的路由。图5b展示了一个在云站点间网络上造成曲折路径的例子。两位Zoom用户分别位于伦敦(UK)和里约热内卢(BR)。两台位于圣保罗(BR)和法兰克福(DE)的Amazon AWS云服务器被选为访问中继, RTC流量通过私有云WAN在它们之间传输。在这种情况下, 尽管客户端到云段的延迟较低, 但在圣保罗和法兰克福的两个云站点之间没有直接的陆地/海底光纤。我们的traceroute分析揭示, 云站点间的路由经过了华盛顿(US)的一个云站点, 导致了一个曲折的路径, 延长了端到端的通信延迟。

Terrestrial difficulties. None of above culprits are easy to tackle in today’s terrestrial networks: (i) the propagation speed in terrestrial fiber is inherently bounded; (ii) cross-AS routes are hard to achieve optimal globally, as they are jointly determined by multiple independent AS operators; (iii) the deployment and coverage of cloud servers is constrained by complex economic or geographic factors. How can we alleviate these root causes?

陆地网络的困难。 在今天的陆地网络中, 上述问题都很难解决: (i)陆地光纤中的传播速度固有限制;(ii)跨AS路由难以在全球范围内实现最优, 因为它们由多个独立的AS运营商共同决定;(iii)云服务器的部署和覆盖受复杂的经济或地理因素制约。

根本原因
  1. 陆地光纤中的传播速度固有限制 -> 远距离通信天然效果不好
  2. 跨AS, 本质不是最短路径 -> RTT 变大
  3. 云服务器部署不均造成的曲折的云间路由 -> 绕路 -> RTT 变大

我们该如何缓解这些根本原因呢?

Can Emerging LEO Mega-Constellations Help

Nowadays many “NewSpace” companies are actively planning and deploying their mega-constellations [5], [45] which will consist of thousands of low Earth orbit (LEO) satellites, promising to provide low-latency and high-throughput Internet services globally [21], [24], [26], [27], [48].

Thus we ask an exploratory and futuristic question: can emerging megaconstellations help to reduce latency for wide-area RTC?

Theoretically, a satellite network constructed upon megaconstellation enables following three critical opportunities.

如今, 许多“新兴太空”(NewSpace)公司正在积极规划和部署其巨型星座 [5], [45], 这些星座将由数千颗低地轨道(LEO)卫星组成, 承诺提供低延迟和高吞吐量的全球互联网服务 [21], [24], [26], [27], [48]。

因此, 我们提出一个探索性且具有未来感的问题: 新兴的巨型星座能否帮助降低广域RTC的延迟?

理论上, 基于巨型星座构建的卫星网络能够提供以下三个关键机会。

(i) Free-space laser links can accelerate long-haul data transmission.

Many mega-constellations plan to deploy ISLs for data communication in space [2], [7], [30].

Laser ISLs can communicate at c (i.e., the speed of light in a vacuum) in free-space, which is about 47% higher than that in terrestrial fiber [18], [26], [36], indicating that for the similar distance, routes over ISLs can potentially offer lower propagation latency.

(i) 自由空间激光链路可以加速长途数据传输。

许多巨型星座计划部署空间激光链路(ISLs)用于数据通信 [2], [7], [30]。

激光ISLs可以在自由空间中以光速(c, 即真空中的光速)进行通信, 这比陆地光纤的速度高出约47% [18], [26], [36], 这表明在相同的距离下, ISL路径可以潜在地提供更低的传播延迟。

(ii) Satellite paths can potentially outperform circuitous terrestrial fiber routes in terms of end-to-end latency.

First, free-space satellite paths are free from geographical constraints (e.g., mountains or oceans where cables are hard to deploy).

Second, unlike today's public Internet which is operated by a large number of operators with their own routing policies, a satellite network built upon a certain constellation is likely to be managed by a single operator (e.g., Starlink by SpaceX).

Therefore, routes in space are likely to build nearly-shortest paths (excluding overhead of up/down-links) with low latency as compared to terrestrial fiber routes [18], [26], [27], [37].

(ii) 卫星路径在端到端延迟方面可能优于曲折的陆地光纤路径。

首先, 自由空间中的卫星路径不受地理约束(例如, 山脉或海洋, 这些地方很难铺设光纤电缆)。

其次, 与今天由大量运营商以各自的路由策略管理的公共互联网不同, 基于某一星座构建的卫星网络可能由单一运营商(例如, SpaceX的Starlink)管理。

因此, 与陆地光纤路径相比, 卫星路径可能能够构建近乎最短的路径(排除上下行链路的开销), 从而提供低延迟的通信 [18], [26], [27], [37]。

(iii) Extending the coverage and availability of existing cloud platforms.

Emerging satellites with evolved cloudlike capabilities [17], [25], [42] can collaborate with existing cloud platforms and work as additional “cloud relays in space” [22], covering areas with insufficient cloud deployment, aggregating, processing and forwarding RTC traffic while avoiding meandering inter-cloud-site routes.

(iii) 扩展现有云平台的覆盖范围和可用性。

新兴的卫星具有进化的类似云的能力 [17], [25], [42], 可以与现有的云平台协作, 充当额外的“太空中的云中继” [22], 覆盖云部署不足的区域, 聚合、处理并转发RTC流量, 同时避免曲折的云站点间路由。

Takeaways. Given that space-borne hardware and capability are evolving rapidly [25], [42], we argue that future LEO constellations are likely to be feasible for assisting cloud platforms and relaying RTC traffic, and thus can reduce the communication latency especially for long-distance, wide-area sessions.

Fully unleashing the low-latency potential of emerging constellations has to solve a fundamental problem involved by the high-dynamicity characteristic of LEO satellites:

how should we dynamically and judiciously select relay server(s) from a collection of available (dynamic) satellites and (static) clouds, and allocate RTC flows over the relay network?

总结: 鉴于太空硬件和能力正在迅速发展 [25], [42], 我们认为未来的LEO星座可能在协助云平台和转发RTC流量方面具有可行性, 因此可以降低通信延迟, 尤其是在长距离广域会话中。

要充分释放新兴星座的低延迟潜力, 必须解决一个基本问题, 即LEO卫星的高度动态性特征所带来的挑战:

我们如何动态且明智地从一组可用(动态)卫星和(静态)云中选择中继服务器, 并在中继网络上分配RTC流量?

Why LEO can help
  1. 自由空间激光链路可以加速长途数据传输 <- 物理层面
  2. 卫星路径在端到端延迟方面可能更好
    • 增益: 在LEO群中, 路径更好(不存在阻碍)
    • 痛点: 上下行的 GSL 带来额外开销
  3. 充当额外的“太空中的云中继” -> 流量转发